Merge lp://staging/~jelmer/bzr/hpss-get-inventories into lp://staging/bzr

Proposed by Jelmer Vernooij
Status: Merged
Approved by: Jelmer Vernooij
Approved revision: no longer in the source branch.
Merged at revision: 6372
Proposed branch: lp://staging/~jelmer/bzr/hpss-get-inventories
Merge into: lp://staging/bzr
Prerequisite: lp://staging/~jelmer/bzr/hpss-_get-checkout-format
Diff against target: 728 lines (+372/-52)
17 files modified
bzrlib/remote.py (+171/-5)
bzrlib/smart/repository.py (+58/-1)
bzrlib/smart/request.py (+6/-3)
bzrlib/tests/blackbox/test_annotate.py (+1/-1)
bzrlib/tests/blackbox/test_branch.py (+3/-3)
bzrlib/tests/blackbox/test_cat.py (+2/-3)
bzrlib/tests/blackbox/test_checkout.py (+3/-8)
bzrlib/tests/blackbox/test_export.py (+2/-3)
bzrlib/tests/blackbox/test_log.py (+4/-6)
bzrlib/tests/blackbox/test_ls.py (+2/-3)
bzrlib/tests/blackbox/test_sign_my_commits.py (+4/-12)
bzrlib/tests/per_interbranch/test_push.py (+2/-2)
bzrlib/tests/per_repository_chk/test_supported.py (+11/-1)
bzrlib/tests/per_repository_vf/test_add_inventory_by_delta.py (+1/-1)
bzrlib/tests/test_remote.py (+41/-0)
bzrlib/tests/test_smart.py (+48/-0)
doc/en/release-notes/bzr-2.5.txt (+13/-0)
To merge this branch: bzr merge lp://staging/~jelmer/bzr/hpss-get-inventories
Reviewer Review Type Date Requested Status
Jelmer Vernooij (community) Approve
Vincent Ladeuil Approve
Andrew Bennetts Pending
Review via email: mp+85252@code.staging.launchpad.net

This proposal supersedes a proposal from 2011-11-29.

Commit message

Add HPSS call for retrieving inventories.

Description of the change

Add a HPSS call for ``Repository.iter_inventories``.

This massively reduces the number of roundtrips for various commands, and causes a fair number to no longer use VFS calls at all when used against a modern remote server.

Repository.get_deltas_for_revisions() now has its own implementation for remote repositories, which means the client won't use the VFS to get at the inventories to generate the deltas from.

Unlike my previous attempt at this, this now uses record streams with inventory deltas. It is still a separate verb though.

To post a comment you must log in.
Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal

Timings when checking out bzr.dev:

~/src/bzr/hpss-get-inventories/bzr co --lightweight bzr://people.samba.org/bzr.dev 2.25s user 0.46s system 15% cpu 17.277 total
~/src/bzr/hpss-get-inventories/bzr export /tmp/bzr.dev2 bzr://people.samba.org/bzr.dev 1.54s user 0.40s system 12% cpu 15.435 total

versus against an older bzr server:

~/src/bzr/hpss-get-inventories/bzr co --lightweight bzr://people.samba.org/bzr.dev 4.91s user 1.03s system 10% cpu 58.807 total
~/src/bzr/hpss-get-inventories/bzr export /tmp/bzr.dev5 bzr://people.samba.org/bzr.dev 4.35s user 0.98s system 9% cpu 58.357 total

Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal

For comparison (though I should it's a different server):

apt-get source bzr 2.79s user 0.54s system 27% cpu 12.338 total

so we're getting closer.

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Jelmer Vernooij wrote:
> Add a HPSS call for ``Repository.iter_inventories``.

I am a bit concerned that by adding verbs like this, with their own ad hoc
record stream-like wire format, that we're growing the maintenance burden
unnecessarily, and not reusing improvements everywhere we could.

I'm glad this is using inventory-deltas. But if it's worth zlib compressing
them here, shouldn't we do that for inventory-deltas from get_record_stream too?

Also, could you implement this via the get_record_stream RPC with a new
parameter that means “just inventories (rather than rev+inv+texts+chks+sigs for
given keys)”?

On one hand it might feel a bit ugly to make one RPC do so many things,
get_record_stream, do so many things, but on the other I think it provides a
useful pressure to keep the interface minimal and consistent (e.g. whether
records are zlib-compressed).

Adding this iter_inventories RPC might be the right thing (although putting
“iter” in the name of an RPC feels weird to me, that's surely a property of the
Python API rather than a property of the RPC), but I'd like hear what you think
about the tradeoffs of doing that vs. extending get_record_stream.

-Andrew.

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

On 30 November 2011 09:37, Andrew Bennetts
<email address hidden> wrote:
> Jelmer Vernooij wrote:
>> Add a HPSS call for ``Repository.iter_inventories``.
>
> I am a bit concerned that by adding verbs like this, with their own ad hoc
> record stream-like wire format, that we're growing the maintenance burden
> unnecessarily, and not reusing improvements everywhere we could.

+1

--
Martin

Revision history for this message
Vincent Ladeuil (vila) wrote : Posted in a previous version of this proposal

>> I am a bit concerned that by adding verbs like this, with their own ad hoc
>> record stream-like wire format, that we're growing the maintenance burden
>> unnecessarily, and not reusing improvements everywhere we could.

> +1

I kind of had the same feeling when jelmer started adding a bunch of verbs that were basically replacing a vfs roundtrip by a smart request roundtrip.

BUT

If we stop maintaining these verbs on the server side, the clients will fallback to vfs.

So we introduce different/better verbs, we can remove the old ones in both the server and the client and all but the newest clients can still fallback to vfs.

The net effect is at least to get to a point where the most recent client/server do not use vfs at all.

This sounds like a good incremental step to me.

'reusing improvements everywhere we could' is still valuable but doesn't to block progress.

review: Approve
Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal

Hi Andrew,

> Jelmer Vernooij wrote:
> > Add a HPSS call for ``Repository.iter_inventories``.
> I am a bit concerned that by adding verbs like this, with their own ad hoc
> record stream-like wire format, that we're growing the maintenance burden
> unnecessarily, and not reusing improvements everywhere we could.
>
> I'm glad this is using inventory-deltas. But if it's worth zlib compressing
> them here, shouldn't we do that for inventory-deltas from get_record_stream
> too?
It's a pretty big inventory delta in this case - for something like gcc it matters for the delta between null: and the initial revision that was requested. I haven't investigated whether it would help for record streams too, but I imagine it would have less of an effect.

> Also, could you implement this via the get_record_stream RPC with a new
> parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
> for
> given keys)”?
>
> On one hand it might feel a bit ugly to make one RPC do so many things,
> get_record_stream, do so many things, but on the other I think it provides a
> useful pressure to keep the interface minimal and consistent (e.g. whether
> records are zlib-compressed).
>
> Adding this iter_inventories RPC might be the right thing (although putting
> “iter” in the name of an RPC feels weird to me, that's surely a property of
> the Python API rather than a property of the RPC), but I'd like hear what you
> think about the tradeoffs of doing that vs. extending get_record_stream.

With regard to the name: Most of the other calls seem to be named after the equivalent method in the client / server, that's why I went with Repository.iter_inventories. I'm not particularly tied to that name, something like "Repository.get_inventories" or "Repository.stream_inventories" seems reasonable to me too.

I'm not sure whether this should be part of get_record_stream. I have a hard time understanding the get_record_stream code as it is, so I'd rather not make it even more complex by adding more parameters - for example, get_record_streams seem to be dependent on the repository on-disk format to an extent - the inventory stream is not, as it's using inventory deltas. In other words, adding another verb was simpler, while keeping it all understandable for a mere mortal like me. :-)

Either way, I agree we should be reusing more code between the work I've done recently and the existing record stream
 calls. But it seems to me that would best be done by refactoring so they e.g. use the same code for sending a stream of blobs of indeterminate length, rather than by all being a part of the same verb.

What do you think?

Cheers,

Jelmer

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal
Download full text (4.5 KiB)

Jelmer Vernooij wrote:

> > I'm glad this is using inventory-deltas. But if it's worth zlib compressing
> > them here, shouldn't we do that for inventory-deltas from get_record_stream
> > too?
> It's a pretty big inventory delta in this case - for something like gcc it
> matters for the delta between null: and the initial revision that was
> requested. I haven't investigated whether it would help for record streams
> too, but I imagine it would have less of an effect.

Well, I know that there are already cases that send deltas from null: —
something to do with stacking perhaps? So reusing this improvement globally
would be nice.

> > Also, could you implement this via the get_record_stream RPC with a new
> > parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
> > for
> > given keys)”?
> >
> > On one hand it might feel a bit ugly to make one RPC do so many things,
> > get_record_stream, do so many things, but on the other I think it provides a
> > useful pressure to keep the interface minimal and consistent (e.g. whether
> > records are zlib-compressed).
> >
> > Adding this iter_inventories RPC might be the right thing (although putting
> > “iter” in the name of an RPC feels weird to me, that's surely a property of
> > the Python API rather than a property of the RPC), but I'd like hear what you
> > think about the tradeoffs of doing that vs. extending get_record_stream.
>
> With regard to the name: Most of the other calls seem to be named after the
> equivalent method in the client / server, that's why I went with
> Repository.iter_inventories. I'm not particularly tied to that name, something
> like "Repository.get_inventories" or "Repository.stream_inventories" seems
> reasonable to me too.

I think of using the name of the Python API as a good guide, not a strict rule.
Certainly there's no insert_stream_locked method on Repository…

“Repository.get_inventories” sounds fine to me.

> I'm not sure whether this should be part of get_record_stream. I have a hard
> time understanding the get_record_stream code as it is, so I'd rather not make
> it even more complex by adding more parameters - for example,
> get_record_streams seem to be dependent on the repository on-disk format to an
> extent - the inventory stream is not, as it's using inventory deltas. In other
> words, adding another verb was simpler, while keeping it all understandable
> for a mere mortal like me. :-)
>
> Either way, I agree we should be reusing more code between the work I've done
> recently and the existing record stream calls. But it seems to me that would
> best be done by refactoring so they e.g. use the same code for sending a
> stream of blobs of indeterminate length, rather than by all being a part of
> the same verb.

Well, the path to reuse we've developed *is* record streams. I'm quite willing
to believe that get_record_stream's parameters aren't a convenient way to
express all needs. But I'd really like it the thing that was returned by a new
verb was a true record stream — something you could pass directly to
insert_record_stream. The idea is that record streams should be the One True
Way we have to say “here is a stream of da...

Read more...

Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal
Download full text (6.7 KiB)

Am 05/12/11 11:31, schrieb Andrew Bennetts:
> Jelmer Vernooij wrote:
> …
>>> I'm glad this is using inventory-deltas. But if it's worth zlib compressing
>>> them here, shouldn't we do that for inventory-deltas from get_record_stream
>>> too?
>> It's a pretty big inventory delta in this case - for something like gcc it
>> matters for the delta between null: and the initial revision that was
>> requested. I haven't investigated whether it would help for record streams
>> too, but I imagine it would have less of an effect.
> Well, I know that there are already cases that send deltas from null: —
> something to do with stacking perhaps? So reusing this improvement globally
> would be nice.
I guess the best way to do this would be to add a zlib pack record kind
for network streams? I.e. a new entry in
NetworkRecordStream._kind_factory, and something equivalent on the
server side?
>>> Also, could you implement this via the get_record_stream RPC with a new
>>> parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
>>> for
>>> given keys)”?
>>>
>>> On one hand it might feel a bit ugly to make one RPC do so many things,
>>> get_record_stream, do so many things, but on the other I think it provides a
>>> useful pressure to keep the interface minimal and consistent (e.g. whether
>>> records are zlib-compressed).
>>>
>>> Adding this iter_inventories RPC might be the right thing (although putting
>>> “iter” in the name of an RPC feels weird to me, that's surely a property of
>>> the Python API rather than a property of the RPC), but I'd like hear what you
>>> think about the tradeoffs of doing that vs. extending get_record_stream.
>> With regard to the name: Most of the other calls seem to be named after the
>> equivalent method in the client / server, that's why I went with
>> Repository.iter_inventories. I'm not particularly tied to that name, something
>> like "Repository.get_inventories" or "Repository.stream_inventories" seems
>> reasonable to me too.
> I think of using the name of the Python API as a good guide, not a strict rule.
> Certainly there's no insert_stream_locked method on Repository…
>
> “Repository.get_inventories” sounds fine to me.
I've renamed it.
>
>> I'm not sure whether this should be part of get_record_stream. I have a hard
>> time understanding the get_record_stream code as it is, so I'd rather not make
>> it even more complex by adding more parameters - for example,
>> get_record_streams seem to be dependent on the repository on-disk format to an
>> extent - the inventory stream is not, as it's using inventory deltas. In other
>> words, adding another verb was simpler, while keeping it all understandable
>> for a mere mortal like me. :-)
>>
>> Either way, I agree we should be reusing more code between the work I've done
>> recently and the existing record stream calls. But it seems to me that would
>> best be done by refactoring so they e.g. use the same code for sending a
>> stream of blobs of indeterminate length, rather than by all being a part of
>> the same verb.
> Well, the path to reuse we've developed *is* record streams. I'm quite willing
> to believe that get_record_stream's parameters aren'...

Read more...

Revision history for this message
Vincent Ladeuil (vila) wrote :

Based on the tests and the decrease of roundtrips: approved !

16 + def _vfs_checkout_metadir(self):
17 + self._ensure_real()
18 + return self._real_bzrdir.checkout_metadir()
19 +
20 + def checkout_metadir(self):
21 + medium = self._client._medium
22 + if medium._is_remote_before((2, 5)):
23 + return self._vfs_checkout_metadir()

_vfs_checkout_metadir -> checkout_metadir -> _vfs_checkout_metadir ?

Is that a loop or do I miss something obvious ?

Up to you to wait for spiv's review.

review: Approve
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Am 13/12/11 11:14, schrieb Vincent Ladeuil:
> Review: Approve
>
> Based on the tests and the decrease of roundtrips: approved !
>
> 16 + def _vfs_checkout_metadir(self):
> 17 + self._ensure_real()
> 18 + return self._real_bzrdir.checkout_metadir()
> 19 +
> 20 + def checkout_metadir(self):
> 21 + medium = self._client._medium
> 22 + if medium._is_remote_before((2, 5)):
> 23 + return self._vfs_checkout_metadir()
>
> _vfs_checkout_metadir -> checkout_metadir -> _vfs_checkout_metadir ?
It's not checkout_metadir, but _real_bzrdir.checkout_metadir. :-)
>
> Is that a loop or do I miss something obvious ?
>
> Up to you to wait for spiv's review.
Cheers,

Jelmer

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

> Up to you to wait for spiv's review.
Thanks.

I haven't heard from spiv in the last week, and I'd really like to get this landed so it gets some decent testing for 2.5.0. My recent changes should have addressed his concerns regarding the use of record streams, and I'm happy to tweak whatever is necessary later based on post-commit reviews.

review: Approve
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.