Merge lp://staging/~xaav/bzr/tarball-generator into lp://staging/bzr

Proposed by xaav
Status: Superseded
Proposed branch: lp://staging/~xaav/bzr/tarball-generator
Merge into: lp://staging/bzr
Diff against target: 536 lines (+248/-100)
4 files modified
bzrlib/export/__init__.py (+55/-21)
bzrlib/export/dir_exporter.py (+16/-2)
bzrlib/export/tar_exporter.py (+163/-75)
bzrlib/export/zip_exporter.py (+14/-2)
To merge this branch: bzr merge lp://staging/~xaav/bzr/tarball-generator
Reviewer Review Type Date Requested Status
xaav (community) Abstain
Martin Pool Pending
Robert Collins Pending
Review via email: mp+63319@code.staging.launchpad.net

This proposal supersedes a proposal from 2011-06-01.

This proposal has been superseded by a proposal from 2011-06-06.

Description of the change

This change fixes the related bug:
- yeild after ball.addfile is called.
- export_tarball function wrapper.

To post a comment you must log in.
Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal

Hi Geoff,

On Wed, 2011-06-01 at 22:36 +0000, Geoff wrote:
> Geoff has proposed merging lp:~xaav/bzr/tarball-generator into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> Bug #791005 in Bazaar: "Impossible to get data from export during progress."
> https://bugs.launchpad.net/bzr/+bug/791005
>
> For more details, see:
> https://code.launchpad.net/~xaav/bzr/tarball-generator/+merge/63180
>
> This change fixes the related bug:
> - yeild after ball.addfile is called.
> - export_tarball function wrapper.
Thanks for working on this... it'd be really nice to have export support
in Loggerhead.

I think rather than making export_tarball a generator, it would make
more sense to factor out the body of the loop in export_tarball, which
could then be used by your function and the existing export_tarball
function. I.e. a export_tarball_item function that just returns a (item,
fileobj) tuple.

Can you add a trivial test function for the new method you expose,
whatever it is? That way we won't accidentally change it, and break
loggerhead.

Cheers,

Jelmer

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

On Thu, Jun 2, 2011 at 11:01 AM, Jelmer Vernooij <email address hidden> wrote:
> Hi Geoff,
>
> On Wed, 2011-06-01 at 22:36 +0000, Geoff wrote:
>> Geoff has proposed merging lp:~xaav/bzr/tarball-generator into lp:bzr.
>>
>> Requested reviews:
>>   bzr-core (bzr-core)
>> Related bugs:
>>   Bug #791005 in Bazaar: "Impossible to get data from export during progress."
>>   https://bugs.launchpad.net/bzr/+bug/791005
>>
>> For more details, see:
>> https://code.launchpad.net/~xaav/bzr/tarball-generator/+merge/63180
>>
>> This change fixes the related bug:
>> - yeild after ball.addfile is called.
>> - export_tarball function wrapper.
> Thanks for working on this... it'd be really nice to have export support
> in Loggerhead.
>
> I think rather than making export_tarball a generator, it would make
> more sense to factor out the body of the loop in export_tarball, which
> could then be used by your function and the existing export_tarball
> function. I.e. a export_tarball_item function that just returns a (item,
> fileobj) tuple.
>
> Can you add a trivial test function for the new method you expose,
> whatever it is? That way we won't accidentally change it, and break
> loggerhead.

That would still mean duplicating code in loggerhead vs bzrlib, which
frankly sucks.

Whats the concern over this being a generator in bzrlib?

-Rob

Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal

On Wed, 2011-06-01 at 23:07 +0000, Robert Collins wrote:
> On Thu, Jun 2, 2011 at 11:01 AM, Jelmer Vernooij <email address hidden> wrote:
> > Hi Geoff,
> >
> > On Wed, 2011-06-01 at 22:36 +0000, Geoff wrote:
> >> Geoff has proposed merging lp:~xaav/bzr/tarball-generator into lp:bzr.
> >>
> >> Requested reviews:
> >> bzr-core (bzr-core)
> >> Related bugs:
> >> Bug #791005 in Bazaar: "Impossible to get data from export during progress."
> >> https://bugs.launchpad.net/bzr/+bug/791005
> >>
> >> For more details, see:
> >> https://code.launchpad.net/~xaav/bzr/tarball-generator/+merge/63180
> >>
> >> This change fixes the related bug:
> >> - yeild after ball.addfile is called.
> >> - export_tarball function wrapper.
> > Thanks for working on this... it'd be really nice to have export support
> > in Loggerhead.
> >
> > I think rather than making export_tarball a generator, it would make
> > more sense to factor out the body of the loop in export_tarball, which
> > could then be used by your function and the existing export_tarball
> > function. I.e. a export_tarball_item function that just returns a (item,
> > fileobj) tuple.
> >
> > Can you add a trivial test function for the new method you expose,
> > whatever it is? That way we won't accidentally change it, and break
> > loggerhead.
> That would still mean duplicating code in loggerhead vs bzrlib, which
> frankly sucks.
>
> Whats the concern over this being a generator in bzrlib?
It's a generator, but it still takes a tarball object that it will add
entries to even though it also yields them.

export_tarball would roughly be:

def export_tarball(ball, tree, ...):
   for ie, path in tree.iter_entries_by_dir():
      (item, fileobj) = export_tarball_item(tree, file_id, ...)
      ball.add(item, fileobj)

and loggerhead could do roughly the same thing, except it wouldn't have
to add the items returned to the tarfile object but it could stream them
directly.

Having a separate export_tarball_item function also gives us freedom to
select a different set of files to export - for example, there is a bug
open to support a --exclude option to "bzr export", and there is a bug
about exporting .bzrignore.

Cheers,

Jelmer

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

On Thu, Jun 2, 2011 at 11:40 AM, Jelmer Vernooij <email address hidden> wrote:
>> Whats the concern over this being a generator in bzrlib?
> It's a generator, but it still takes a tarball object that it will add
> entries to even though it also yields them.
>
> export_tarball would roughly be:
>
> def export_tarball(ball, tree, ...):
>   for ie, path in tree.iter_entries_by_dir():
>      (item, fileobj) = export_tarball_item(tree, file_id, ...)
>      ball.add(item, fileobj)
>
> and loggerhead could do roughly the same thing, except it wouldn't have
> to add the items returned to the tarfile object but it could stream them
> directly.

That means loggerhead is duplicating code. I think the yield of the
ball was incidental and not the point of the patch.

> Having a separate export_tarball_item function also gives us freedom to
> select a different set of files to export - for example, there is a bug
> open to support a --exclude option to "bzr export", and there is a bug
> about exporting .bzrignore.

Likewise here - a separate inner function may be a useful thing to do
but it doesn't help loggerhead.

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

The yield of the ball is not really helpful to anything - I figured it would make more sense to yield the tarball than yield nothing.

Alternatively, we could move the for in loop to the tgz,tbz, plain tar, etc... exporters.

All I really care about here is fixing the bug in loggerhead. Please make up your mind about what I'm supposed to do - duplicate code in loggerhead or factor it out here - and inform me of that decision.

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

Okay, I've updated the code to use a different method. Please tell me which method you prefer.

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

Hi Geoff, sorry you are getting torn in different directions. I've chatted with Jelmer on IRC to avoid this for you.

I think this is pretty close. We may want a test that it works in bzrlib's test suite, to show the contract loggerhead needs.

I think what we want is the following:

* in loggerhead a *single* code path that can handle:
 - tar.gz
 - zip
 - tar.lzma
 - etc

exports. It would be something like you wrote.

output_stream = LoggerheadFilelikeBuffer()
exporter = bzrlib.export.get_export_generator(output_stream, '.tar.gz', tree, revision_id)
for _ in exporter:
    yield output_stream.get_buffer()

* in bzrlib get_export_generator takes care of wrapping the output stream with the right GZipFile etc and raises an exception if it can't stream.

That way we can fixup bzrlib to support zip etc in the future and only have to make cosmetic changes to loggerhead.

What do you think?

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

> Hi Geoff, sorry you are getting torn in different directions. I've chatted
> with Jelmer on IRC to avoid this for you.
>
> I think this is pretty close. We may want a test that it works in bzrlib's
> test suite, to show the contract loggerhead needs.
>
> I think what we want is the following:
>
> * in loggerhead a *single* code path that can handle:
> - tar.gz
> - zip
> - tar.lzma
> - etc
>
> exports. It would be something like you wrote.
>
> output_stream = LoggerheadFilelikeBuffer()
> exporter = bzrlib.export.get_export_generator(output_stream, '.tar.gz', tree,
> revision_id)
> for _ in exporter:
> yield output_stream.get_buffer()
>
> * in bzrlib get_export_generator takes care of wrapping the output stream with
> the right GZipFile etc and raises an exception if it can't stream.
>
> That way we can fixup bzrlib to support zip etc in the future and only have to
> make cosmetic changes to loggerhead.
>
> What do you think?

Sounds good. I should have it coded up by tomorrow.

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

Okay, I've reworked this to meet the specifications, but it may need some cleaning-up.

You are welcome to commend in the meantime.

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

I commend this patch ;-)

[question] It seems to me that you can get streaming by just observing the chunks being written into the file object, and having a generator as well is redundant. I see this is basically needed because PEP333 strongly deprecates having streaming data sent to a write function returned by a generator.

Perhaps it would at least be good to explain that it is done this way for the sake of PEP333.

[question] Given that wsgi does actually support having write() called repeatedly for streaming data, why not use it? I realize it's not the recommended interface, but it is the interface that does fit most naturally with the lower level tarball code we're using. It seems a bit like we're reimplementing something that could be better done in the wsgi framework -- specifically, turning push commands into pull. That would have the concrete benefit of allowing streaming just exactly as data is produced by the compressor, on finer granularity than exporting a single file.

+
+def get_export_generator(tree, dest=None, format=None, root=None, subdir=None, filtered=False,
+ per_file_timestamps=False, fileobj=None):
+
+ """Returns a generator that exports the given Tree to the specific destination.
...
+ :param fileobj: Optional file object to use

Factoring out this function makes sense.

[fix] Please make the docstring explain what the generator actually produces. I am confused whether the content is written to the file object or returned in the generator; also how fileobj interacts with dest.

[idea] Actually, perhaps it would be better to do away with 'dest' entirely at this layer? I think it is only there for the sake of heuristics about choosing the format, and then for opening the file, and

[fix] I see all the actual 'exporters', and also 'export' itself now return generators which must be consumed for them to work. Generally if we change an api in such a way that existing callers will be wrong (rather than getting an error), we ought to rename the api too, and of course add a news entry. Another option would be to keep the existing function names doing the whole job, and add new ones that are generators.

+def export_tarball_item(tree, ball, root, dp, ie, subdir=None, filtered=False,
+ force_mtime=None):

Lifting this out is good.

[tweak] Per pep8, we indent continuation lines either by 4 spaces, or to be just after the preceding parenthesis.

review: Needs Fixing
Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

Just for the sake of being clear: as the comment tags are meant to indicate, I'd _like_ to see if there's a simpler way, but I'd be ok with just fixing the docs and the API stability.

Revision history for this message
Jelmer Vernooij (jelmer) wrote : Posted in a previous version of this proposal

per_file_timestamps determines whether or not force_mtime is set; passing down per_file_timestamps to the actual exporters is not necessary since it won't be used.

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

[question] Given that wsgi does actually support having write() called repeatedly for streaming data, why not use it? I realize it's not the recommended interface, but it is the interface that does fit most naturally with the lower level tarball code we're using. It seems a bit like we're reimplementing something that could be better done in the wsgi framework -- specifically, turning push commands into pull. That would have the concrete benefit of allowing streaming just exactly as data is produced by the compressor, on finer granularity than exporting a single file.

It seemed like calling that method wouldn't support chunked encoding - it would instead store it in a buffer and calculate the content length. Also the documentation regarding chunked encoding isn't good, and this seemed the most appropriate way to do this.

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

[fix] Please make the docstring explain what the generator actually produces. I am confused whether the content is written to the file object or returned in the generator; also how fileobj interacts with dest.

The generator does the same thing that had been previously been done, just in chunks.

Revision history for this message
xaav (xaav) wrote : Posted in a previous version of this proposal

[fix] I see all the actual 'exporters', and also 'export' itself now return generators which must be consumed for them to work. Generally if we change an api in such a way that existing callers will be wrong (rather than getting an error), we ought to rename the api too, and of course add a news entry. Another option would be to keep the existing function names doing the whole job, and add new ones that are generators.

I changed how this was done, because I didn't think the code in the finally statement would be executed properly. Please re-review this.

[tweak] Per pep8, we indent continuation lines either by 4 spaces, or to be just after the preceding parenthesis.

Where is this a problem?

Revision history for this message
xaav (xaav) : Posted in a previous version of this proposal
review: Needs Resubmitting
Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

On 3 June 2011 10:38, Geoff <email address hidden> wrote:
> [fix] Please make the docstring explain what the generator actually produces.  I am confused whether the content is written to the file object or returned in the generator; also how fileobj interacts with dest.
>
> The generator does the same thing that had been previously been done, just in chunks.

Normally Python generators return something on each iteration;
typically the main result of a generator function will be produced
through the generator. This one returns None repeatedly, and the
caller is required to consume the whole thing. This is unusual enough
that it should be clearly documented.

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

On 3 June 2011 10:41, Geoff <email address hidden> wrote:
> [fix] I see all the actual 'exporters', and also 'export' itself now return generators which must be consumed for them to work.  Generally if we change an api in such a way that existing callers will be wrong (rather than getting an error), we ought to rename the api too, and of course add a news entry.  Another option would be to keep the existing function names doing the whole job, and add new ones that are generators.
>
> I changed how this was done, because I didn't think the code in the finally statement would be executed properly. Please re-review this.

It looks like you did change export() back into being a non-generator
function, but the others still are. I am concerned there will be
functions that count on the current behaviour so we ought to fix this.

I see::

- try:
- return _exporters[format](tree, dest, root, subdir, filtered=filtered,
- force_mtime=force_mtime)
- finally:
- tree.unlock()
-
+
+ for _ in _exporters[format](tree, dest, root, subdir, filtered,
force_mtime, per_file_timestamps, fileobj):
+
+ yield
+
+ tree.unlock()
+
+

[fix] It's almost always a bug, and it is in this case, to have the
unlock code not in either a finally block, or addCleanup, or some
other equivalent. If the export is interrupted or errors, the tree
won't be unlocked.

+def export_tarball_item(tree, root, final_path, entry,
filtered=False, force_mtime=None):

[tweak] So this doesn't actually export it, it just packs it up ready
to be exported. Please consider renaming it, and document what is
returned.

[aside] We could probably make this faster by avoiding using a
StringIO in the common case, but that's a separate existing issue.

> [tweak] Per pep8, we indent continuation lines either by 4 spaces, or to be just after the preceding parenthesis.
>
> Where is this a problem?

[tweak] Actually, that specific problem isn't there, it just looked
like it to me because some lines are very long. Our coding
convention, following <http://www.python.org/dev/peps/pep-0008/> is
that they should be wrapped before 80 columns.

hth, thanks

Revision history for this message
xaav (xaav) wrote :

On 3 June 2011 10:41, Geoff <email address hidden> wrote:
> [fix] I see all the actual 'exporters', and also 'export' itself now return generators which must be consumed for them to work. Generally if we change an api in such a way that existing callers will be wrong (rather than getting an error), we ought to rename the api too, and of course add a news entry. Another option would be to keep the existing function names doing the whole job, and add new ones that are generators.
>
> I changed how this was done, because I didn't think the code in the finally statement would be executed properly. Please re-review this.
>
> It looks like you did change export() back into being a non-generator
> function, but the others still are. I am concerned there will be
> functions that count on the current behaviour so we ought to fix this.

The way the code base is currently set up, there is no need to use the other functions. They should just be passing in the information to the export() function.

I see::

- try:
- return _exporters[format](tree, dest, root, subdir, filtered=filtered,
- force_mtime=force_mtime)
- finally:
- tree.unlock()
-
+
+ for _ in _exporters[format](tree, dest, root, subdir, filtered,
force_mtime, per_file_timestamps, fileobj):
+
+ yield
+
+ tree.unlock()
+
+

> [fix] It's almost always a bug, and it is in this case, to have the
> unlock code not in either a finally block, or addCleanup, or some
> other equivalent. If the export is interrupted or errors, the tree
> won't be unlocked.

Fixed and will show up on the next push.

5968. By xaav

Moved unlock function into finally block.

Revision history for this message
xaav (xaav) wrote :

+def export_tarball_item(tree, root, final_path, entry,
filtered=False, force_mtime=None):

[tweak] So this doesn't actually export it, it just packs it up ready
to be exported. Please consider renaming it, and document what is
returned.

Fixed and will show up on the next push.

5969. By xaav

Renamed export item function in tar exporter.

Revision history for this message
xaav (xaav) wrote :

[aside] We could probably make this faster by avoiding using a
StringIO in the common case, but that's a separate existing issue.

> [tweak] Per pep8, we indent continuation lines either by 4 spaces, or to be just after the preceding parenthesis.
>
> Where is this a problem?

[tweak] Actually, that specific problem isn't there, it just looked
like it to me because some lines are very long. Our coding
convention, following <http://www.python.org/dev/peps/pep-0008/> is
that they should be wrapped before 80 columns.

Okay, I'll be sure and fix this.

5970. By xaav

Fixed line ending problems.

Revision history for this message
xaav (xaav) wrote :

Fixed. Are there any more issues?

review: Abstain
Revision history for this message
Martin Pool (mbp) wrote :

On 3 June 2011 13:40, Geoff <email address hidden> wrote:
> On 3 June 2011 10:41, Geoff <email address hidden> wrote:
>> [fix] I see all the actual 'exporters', and also 'export' itself now return generators which must be consumed for them to work.  Generally if we change an api in such a way that existing callers will be wrong (rather than getting an error), we ought to rename the api too, and of course add a news entry.  Another option would be to keep the existing function names doing the whole job, and add new ones that are generators.
>>
>> I changed how this was done, because I didn't think the code in the finally statement would be executed properly. Please re-review this.
>>
>> It looks like you did change export() back into being a non-generator
>> function, but the others still are.  I am concerned there will be
>> functions that count on the current behaviour so we ought to fix this.
>
> The way the code base is currently set up, there is no need to use the other functions. They should just be passing in the information to the export() function.

If you're reasoning that therefore nobody actually will be using them,
you're probably right, but I wouldn't bet on it. Renaming them is
cheap.

Thanks

Revision history for this message
Martin Pool (mbp) wrote :

... so the only remaining thing I would like is to rename the per-format exporters to say tar_lzma_export_generator etc.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

On Fri, 2011-06-03 at 03:56 +0000, Geoff wrote:
> Review: Abstain
> Fixed. Are there any more issues?
per_file_timestamps and force_mtime do the same thing, no function
should take both.

Cheers,

Jelmer

Revision history for this message
xaav (xaav) wrote :

Okay, I'll rename them and then create wrapper functions that return the original contents.

Revision history for this message
xaav (xaav) wrote :

> per_file_timestamps and force_mtime do the same thing, no function
> should take both.

Oops! I'll look into this.

5971. By xaav

Fixed force_mtime problem in tar exporter and created wrappper functions.

5972. By xaav

Fixed compatibility issues with directory exporter.

5973. By xaav

Fixed compatibility issues in zip_exporter.

5974. By xaav

Updated function references.

Revision history for this message
xaav (xaav) wrote :

Okay, this should fix all the issues pointed out so far.

Revision history for this message
xaav (xaav) wrote :

If there are no further issues then I would appreciate if someone would merge this.

5975. By xaav

Updated docstring.

5976. By xaav <email address hidden>

Merged branch

5977. By xaav

Updated release notes & What's new.

5978. By xaav

Fixed clumsy typing.

5979. By xaav

Fixed export.

Unmerged revisions

5979. By xaav

Fixed export.

5978. By xaav

Fixed clumsy typing.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.