Merge lp://staging/~jameinel/bzr/2.4-cheaper-iter-entries-by-dir into lp://staging/bzr

Proposed by John A Meinel
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 5730
Proposed branch: lp://staging/~jameinel/bzr/2.4-cheaper-iter-entries-by-dir
Merge into: lp://staging/bzr
Diff against target: 224 lines (+158/-2)
3 files modified
bzrlib/inventory.py (+69/-2)
bzrlib/tests/test_inv.py (+82/-0)
doc/en/release-notes/bzr-2.4.txt (+7/-0)
To merge this branch: bzr merge lp://staging/~jameinel/bzr/2.4-cheaper-iter-entries-by-dir
Reviewer Review Type Date Requested Status
Jelmer Vernooij (community) code Approve
Vincent Ladeuil Approve
Review via email: mp+53994@code.staging.launchpad.net

Commit message

Fix bug #737234. When running iter_entries_by_dir preload all the data if we are going to be looking at it all.

Description of the change

This addresses bug #737234.

Basically, "iter_entries_by_dir" is a very non-optimal ordering for reading the chk pages in. CHK pages are hash(file_id) ordered (pretty much random). So going dir-by-dir means that at each directory, we end up reading a lot of CHK child pages. These often get re-read the next time we go through for a different directory.

With this patch "time list(iter_entries_by_dir())" goes from 4m30s down to 11s locally (on gcc-linaro, which has XXk files). It also changes the number of bytes transferred from 8.1GB down to around 128MB. That still seems a little big to me, but I don't see anything obvious at this point.

The change just switches it so that we always iterate in chk order, for both the id_to_entry map and the parent_id_basename_to_file_id map.

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

That was meant to be 70k entries.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Somebody recently mentioned to me that we don't allow the use of "assert"... :-)

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/18/2011 5:36 PM, Jelmer Vernooij wrote:
> Somebody recently mentioned to me that we don't allow the use of "assert"... :-)

You're absolutely right, I meant to pull those out after I had it working.

I updated it to raise ValueError on any of the inconsistencies. None of
them should ever trigger, and I wonder if we really want to check them.
But since I did it, I'm ok leaving it in. It probably makes the code a
little bit harder to follow, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2EZ50ACgkQJdeBCYSNAAOSuQCfQw0i9e8P/t+z05Cnj7GpxO2I
MywAnjaoZWUX50HQvui+OyYfGXoOA9R1
=nAEZ
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

29 + if from_dir is None and specific_file_ids is None:
30 + # They are iterating from the root, and have not specified any
31 + # specific entries to look at. All current callers fully consume the
32 + # iterator, so we can safely assume we are accessing all entries
33 + self._preload_cache()
34 if from_dir is None:
35 if self.root is None:
36 return

You're pre-loading the cache even when from_dir and self.root are None, is there nothing to preload in this case or does it mean the shortcut is now ineffective and what are the consequences then ?

review: Needs Information
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/20/2011 12:18 PM, Vincent Ladeuil wrote:
> Review: Needs Information
> 29 + if from_dir is None and specific_file_ids is None:
> 30 + # They are iterating from the root, and have not specified any
> 31 + # specific entries to look at. All current callers fully consume the
> 32 + # iterator, so we can safely assume we are accessing all entries
> 33 + self._preload_cache()
> 34 if from_dir is None:
> 35 if self.root is None:
> 36 return
>
> You're pre-loading the cache even when from_dir and self.root are None, is there nothing to preload in this case or does it mean the shortcut is now ineffective and what are the consequences then ?
>

If self.root is None, there shouldn't be anything to preload, because if
you have no root, you better not have any children of root.

That said, I'm not 100% sure that CHKInventory can exist without a root.
If you had an object like that, it would probably have to be an
Inventory object.

Now, iter_entries_by_dir() exists only on CommonInventory, so it is
shared between both types. However the base implementation of _preload
is also a no-op.

So if you have a tree with no root, then preload is a no-op (one way or
the other).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2GTGAACgkQJdeBCYSNAANO1ACfWVNV/NeV9PqMmB7Gh6Hl0sxU
JckAoNFKKICCU0Zl6ku0VfGYSzPUHzWL
=KmmW
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

It's rather weird to have an iterator turned into a full loading method even for a special case...

But since this is a significant progress I think we should land it. A further submission clarifying the different use cases would certainly be welcome.

review: Approve
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Is this a reasonable thing to backport? It seems like it would be very beneficial to have this at least in natty.

Are there perhaps other locations where preloading would make sense? E.g. Inventory.iter_entries(from_dir=None) ?

review: Approve (code)
Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Revision history for this message
John A Meinel (jameinel) wrote :

I think we could reasonable back port this to at least 2.3, what do other people think?

Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Revision history for this message
Andrew Bennetts (spiv) wrote :

John A Meinel wrote:
> I think we could reasonable back port this to at least 2.3, what do other people think?

Sounds reasonable to me, assuming you're confident that there's little
risk involved.

-Andrew.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.