Bazaar Fast Import

Merge lp://staging/~jameinel/bzr-fastimport/less-sticky into lp://staging/bzr-fastimport

less-sticky
Merge into trunk

Proposed by John A Meinel on 2009-11-30

Status:	Merged
Approved by:	Ian Clatworthy on 2009-12-08
Approved revision:	not available
Merged at revision:	not available
Proposed branch:	lp://staging/~jameinel/bzr-fastimport/less-sticky
Merge into:	lp://staging/bzr-fastimport
Diff against target:	595 lines (+414/-45) 6 files modified bzr_commit_handler.py (+20/-16) cache_manager.py (+28/-0) revision_store.py (+157/-19) tests/__init__.py (+11/-10) tests/test_generic_processor.py (+51/-0) tests/test_revision_store.py (+147/-0)
To merge this branch:	bzr merge lp://staging/~jameinel/bzr-fastimport/less-sticky
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Ian Clatworthy		2009-11-30	Approve on 2009-12-08
Review via email: mp+15458@code.staging.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2009-11-30:

This is my earlier branch that I proposed. I just wasn't sure if it was properly put up for review.

There is also one very small tweak, because the way the inv_sha1 fix landed in bzr is slightly different from how I originally envisioned it.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-11-30:

The old merge request is:
https://code.edge.launchpad.net/~jameinel/bzr-fastimport/mooloolaba/+merge/15365

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-12-04:

Thanks a bucket-load for these changes. I'll take a detailed look next Monday when my brain is a little more awake.

FWIW, my main concern is bzr version compatibility. Until now, fastimport has supported early versions of bzr: 1.1 according to README.txt. I'm happy to bump that to 2.0.0 (say) but I don't want to depend on any APIs only in the 2.1.0 branch yet. Off the top of your head, do we need to tweak any of this patch accordingly?

review: Needs Information

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-04:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Review: Needs Information
> Thanks a bucket-load for these changes. I'll take a detailed look next Monday when my brain is a little more awake.
>
> FWIW, my main concern is bzr version compatibility. Until now, fastimport has supported early versions of bzr: 1.1 according to README.txt. I'm happy to bump that to 2.0.0 (say) but I don't want to depend on any APIs only in the 2.1.0 branch yet. Off the top of your head, do we need to tweak any of this patch accordingly?

I don't know of anything, but it is best to run an import with an older
bzr to make sure :).

There is at least one bugfix that landed in trunk, but the code already
checks for that

if isinstance(builder.inv_sha1, tuple):
...

KnownGraph was introduced in bzr 1.17, and though 'add_node' is new, it
also has a 'getattr()' check around that (though I think that is a
different patch come to think of it.)

I think that if the switch to CommitBuilder is finalized, then we should
look into cleaning up the code, as you have a lot of duplicate logic in
there.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksZIrEACgkQJdeBCYSNAANViwCg1d18S7ygPWdG0e89xvSa0IXK
UjcAoIKWth4vZzo8+PxuMIHShsPpuOP7
=cwJW
-----END PGP SIGNATURE-----

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-12-08:

I've been experimenting with this branch on numerous different inputs and the results need some further digging. On bzr itself for example, fastimport time increases from 24 minutes to 33 minutes and the memory consumption increases a little as well. On Firefox, fastimport time increases from 52 minutes to 79 minutes (I didn't check memory). Hmm.

My initial reaction was that it might be the disk blob caching. However, disabling that only drops the time for bzr to 32 minutes so I'm suspecting it's more related to using CommitBuilder. At least in that case.

I'm certainly happy with most of the code I've checked and I fully trust you to do a better job of writing it than me. Even so, I might proceed by merging this in a few different pieces.

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-08:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Review: Approve
> I've been experimenting with this branch on numerous different inputs and the results need some further digging. On bzr itself for example, fastimport time increases from 24 minutes to 33 minutes and the memory consumption increases a little as well. On Firefox, fastimport time increases from 52 minutes to 79 minutes (I didn't check memory). Hmm.
>
> My initial reaction was that it might be the disk blob caching. However, disabling that only drops the time for bzr to 32 minutes so I'm suspecting it's more related to using CommitBuilder. At least in that case.
>
> I'm certainly happy with most of the code I've checked and I fully trust you to do a better job of writing it than me. Even so, I might proceed by merging this in a few different pieces.
>
>

Does this include the KnownGraph code changes? I'm wondering what sort
of result you would get from that. I would guess that this code is,
indeed, slower than what you had, as it is doing more work. (At merge
time it takes the diff against all parents not just the left-hand one.)

However, if we can make parts faster, then 'bzr commit' will also
benefit. Adding KnownGraph to handle heads() should be quite a bit
better, though there are some small things like how many actual merges
there are in the codebase.

xserver is probably not a great example, as it is quite linear for most
of the history. I should probably try a different one going forward.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksd9DMACgkQJdeBCYSNAAP2mQCfelbgTJKCIXhv5iWaEUP8hd4L
YRcAn1vvc53M/7Wci5y/s2KLyeouQ2b6
=g/BY
-----END PGP SIGNATURE-----

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-12-08:

John A Meinel wrote:

> Does this include the KnownGraph code changes?

Not yet. That's a separate review.

> xserver is probably not a great example, as it is quite linear for most
> of the history. I should probably try a different one going forward.

It's worth testing on a few projects from different sources, e.g.
xserver from git, bzr from bzr and Thunderbird from hg, say. All of
those are medium size but should complete quickly enough. Every now and
then, try the really big stuff like the kernel, mysql and OOo.

Some questions about TreeShim now that I'm looking at the code harder:

1. id2path() looks wrong. It should return newpath if it's not None?

2. What value does get_reference_revision() add?

Ian C.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-08:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> John A Meinel wrote:
>
>> Does this include the KnownGraph code changes?
>
> Not yet. That's a separate review.
>
>> xserver is probably not a great example, as it is quite linear for most
>> of the history. I should probably try a different one going forward.
>
> It's worth testing on a few projects from different sources, e.g.
> xserver from git, bzr from bzr and Thunderbird from hg, say. All of
> those are medium size but should complete quickly enough. Every now and
> then, try the really big stuff like the kernel, mysql and OOo.
>
> Some questions about TreeShim now that I'm looking at the code harder:
>
> 1. id2path() looks wrong. It should return newpath if it's not None?

I think you are correct. I certainly did the lookup, it would make sense
to return that value.

The code as written seems to only trap for when an object is deleted.

That said, I would hope we don't use id2path very often in the commit
code. I'm pretty sure it is done by some code paths (as I added it
because I was auditing the commit code). It seems to be called in the
"unchanged_merged" code path.

Which is 'things different from a right-hand parent'. So I *think* for
it to fail, you would need to add a file versus both parents. Which
means that at merge time, you 'bzr add' a new file. Which is probably a
rather rare condition. Since it is unrecommended behavior to do a lot of
"unrelated" changes during a merge.

However, it certainly sounds like something worth adding a test case. :)

>
> 2. What value does get_reference_revision() add?

Not much yet, but it is one of the function that CommitBuilder would
look at if we had trees with references, I don't have any idea what that
would look like for a fast-import stream. But since git has submodule
support, I would assume there is *something* that it can put in the fast
import stream, which we could then map to a tree_reference.

If you want to take it out, that is fine. But I figured a
NotImplementedError would probably be better than an AttributeError.

>
> Ian C.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkseiJcACgkQJdeBCYSNAAMjywCfcIJR7JQU0nWw+fg+oYD6RVM8
EbUAoIMA/0DattP8h9BQYL0OJnmVhMh0
=m8Nz
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> John A Meinel wrote:
> 
>> Does this include the KnownGraph code changes?
> 
> Not yet. That's a separate review.
> 
>> xserver is probably not a great example, as it is quite linear for most
>> of the history. I should probably try a different one going forward.
> 
> It's worth testing on a few projects from different sources, e.g.
> xserver from git, bzr from bzr and Thunderbird from hg, say. All of
> those are medium size but should complete quickly enough. Every now and
> then, try the really big stuff like the kernel, mysql and OOo.
> 
> Some questions about TreeShim now that I'm looking at the code harder:
> 
> 1. id2path() looks wrong. It should return newpath if it's not None?

I think you are correct. I certainly did the lookup, it would make sense
to return that value.

The code as written seems to only trap for when an object is deleted.

That said, I would hope we don't use id2path very often in the commit
code. I'm pretty sure it is done by some code paths (as I added it
because I was auditing the commit code).  It seems to be called in the
"unchanged_merged" code path.

However, it certainly sounds like something worth adding a test case. :)

> 
> 2. What value does get_reference_revision() add?

If you want to take it out, that is fine. But I figured a
NotImplementedError would probably be better than an AttributeError.

> 
> Ian C.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkseiJcACgkQJdeBCYSNAAMjywCfcIJR7JQU0nWw+fg+oYD6RVM8
EbUAoIMA/0DattP8h9BQYL0OJnmVhMh0
=m8Nz
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-08:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
>> 1. id2path() looks wrong. It should return newpath if it's not None?
>
> I think you are correct. I certainly did the lookup, it would make sense
> to return that value.
>
> The code as written seems to only trap for when an object is deleted.
>
> That said, I would hope we don't use id2path very often in the commit
> code. I'm pretty sure it is done by some code paths (as I added it
> because I was auditing the commit code). It seems to be called in the
> "unchanged_merged" code path.
>
> Which is 'things different from a right-hand parent'. So I *think* for
> it to fail, you would need to add a file versus both parents. Which
> means that at merge time, you 'bzr add' a new file. Which is probably a
> rather rare condition. Since it is unrecommended behavior to do a lot of
> "unrelated" changes during a merge.
>
> However, it certainly sounds like something worth adding a test case. :)
>

So I think I'm wrong about how to trigger this. I'll try to track it
down. As it stands today, only 1 test case triggers id2path and it is
the one I added for merge+revert. The conditions are

1) The file must be considered modified in a merged revision relative to
BASE. This puts it into the 'merged_ids' list.

2) The file must not be considered modified in THIS versus BASE. This
puts it into the 'unchanged_merged' set.

If (1) and (2) aren't true, then id2path is not called for the file.
However, as an interesting side effect:

3) It must be in the THIS vs BASE delta for it to show up in
_new_info_by_id. And this negates (2).

Which sounds like what we really need is just an assertion that
file_id is not in _neW_info_by_id.

Anyway, so it doesn't actually matter that it is broken. We could add
some direct tests for it, but I'm inclined to keep _TreeShim as minimal
as possible, and only implement stuff that can be tested during the
import. I suppose commit builder could change in the future...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkseq9oACgkQJdeBCYSNAAPPfACdGnTkABFPtwQxifA1HoULo/b+
iIUAn0Pe6hoM1Y/koT6zajHwwaUywJhg
=qzFm
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
>> 1. id2path() looks wrong. It should return newpath if it's not None?
> 
> I think you are correct. I certainly did the lookup, it would make sense
> to return that value.
> 
> The code as written seems to only trap for when an object is deleted.
> 
> That said, I would hope we don't use id2path very often in the commit
> code. I'm pretty sure it is done by some code paths (as I added it
> because I was auditing the commit code).  It seems to be called in the
> "unchanged_merged" code path.
> 
> Which is 'things different from a right-hand parent'. So I *think* for
> it to fail, you would need to add a file versus both parents. Which
> means that at merge time, you 'bzr add' a new file. Which is probably a
> rather rare condition. Since it is unrecommended behavior to do a lot of
> "unrelated" changes during a merge.
> 
> However, it certainly sounds like something worth adding a test case. :)
>

So I think I'm wrong about how to trigger this. I'll try to track it
down. As it stands today, only 1 test case triggers id2path and it is
the one I added for merge+revert. The conditions are

1) The file must be considered modified in a merged revision relative to
   BASE. This puts it into the 'merged_ids' list.

2) The file must not be considered modified in THIS versus BASE. This
   puts it into the 'unchanged_merged' set.

If (1) and (2) aren't true, then id2path is not called for the file.
However, as an interesting side effect:

3) It must be in the THIS vs BASE delta for it to show up in
   _new_info_by_id. And this negates (2).

Which sounds like what we really need is just an assertion that
   file_id is not in _neW_info_by_id.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkseq9oACgkQJdeBCYSNAAPPfACdGnTkABFPtwQxifA1HoULo/b+
iIUAn0Pe6hoM1Y/koT6zajHwwaUywJhg
=qzFm
-----END PGP SIGNATURE-----

lp://staging/~jameinel/bzr-fastimport/less-sticky updated on 2009-12-09

272. By John A Meinel on 2009-12-09: Add a bunch of direct tests for the _TreeShim interface.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-09:

...

> Some questions about TreeShim now that I'm looking at the code harder:
>
> 1. id2path() looks wrong. It should return newpath if it's not None?

I did end up adding some direct tests to _TreeShim, and correcting that code path.

>
> 2. What value does get_reference_revision() add?
>
> Ian C.

As mentioned, it means if we ever had tree references, you'll get a NotImplementedError rather than an AttributeError. Otherwise, it doesn't add much of anything.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Brock Hutson

Gyorgy Kovacs

John A Meinel

Nuno Araujo

Bazaar Fast Import

Merge lp://staging/~jameinel/bzr-fastimport/less-sticky into lp://staging/bzr-fastimport

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'bzr_commit_handler.py'
 --- bzr_commit_handler.py	2009-11-30 21:31:31 +0000
 +++ bzr_commit_handler.py	2009-12-09 20:15:27 +0000
@@ -71,9 +71,9 @@
          """Prepare for committing."""
          self.revision_id = self.gen_revision_id()
          # cache of texts for this commit, indexed by file-id
--        self.lines_for_commit = {}
++        self.data_for_commit = {}
          #if self.rev_store.expects_rich_root():
--        self.lines_for_commit[inventory.ROOT_ID] = []
++        self.data_for_commit[inventory.ROOT_ID] = []
          # Track the heads and get the real parent list
          parents = self.cache_mgr.track_heads(self.command)
@@ -126,9 +126,13 @@
              self.cache_mgr.inventories[revision_id] = inv
          return inv
++    def _get_data(self, file_id):
++        """Get the data bytes for a file-id."""
++        return self.data_for_commit[file_id]
++
      def _get_lines(self, file_id):
          """Get the lines for a file-id."""
--        return self.lines_for_commit[file_id]
++        return osutils.split_lines(self._get_data(file_id))
      def _get_per_file_parents(self, file_id):
          """Get the lines for a file-id."""
@@ -288,20 +292,20 @@
          ie.revision = self.revision_id
          if kind == 'file':
              ie.executable = is_executable
--            lines = osutils.split_lines(data)
--            ie.text_sha1 = osutils.sha_strings(lines)
--            ie.text_size = sum(map(len, lines))
--            self.lines_for_commit[file_id] = lines
++            # lines = osutils.split_lines(data)
++            ie.text_sha1 = osutils.sha_string(data)
++            ie.text_size = len(data)
++            self.data_for_commit[file_id] = data
          elif kind == 'directory':
              self.directory_entries[path] = ie
              # There are no lines stored for a directory so
              # make sure the cache used by get_lines knows that
--            self.lines_for_commit[file_id] = []
++            self.data_for_commit[file_id] = ''
          elif kind == 'symlink':
              ie.symlink_target = data.encode('utf8')
              # There are no lines stored for a symlink so
              # make sure the cache used by get_lines knows that
--            self.lines_for_commit[file_id] = []
++            self.data_for_commit[file_id] = ''
          else:
              self.warning("Cannot import items of kind '%s' yet - ignoring '%s'"
                  % (kind, path))
@@ -345,7 +349,7 @@
          self.directory_entries[dirname] = ie
          # There are no lines stored for a directory so
          # make sure the cache used by get_lines knows that
--        self.lines_for_commit[dir_file_id] = []
++        self.data_for_commit[dir_file_id] = ''
          # It's possible that a file or symlink with that file-id
          # already exists. If it does, we need to delete it.
@@ -415,7 +419,7 @@
          kind = ie.kind
          if kind == 'file':
              if newly_changed:
--                content = ''.join(self.lines_for_commit[file_id])
++                content = self.data_for_commit[file_id]
              else:
                  content = self.rev_store.get_file_text(self.parents[0], file_id)
              self._modify_item(dest_path, kind, ie.executable, content, inv)
@@ -451,7 +455,7 @@
          # that means the loader then needs to know what the "new" text is.
          # We therefore must go back to the revision store to get it.
          lines = self.rev_store.get_file_lines(rev_id, file_id)
--        self.lines_for_commit[file_id] = lines
++        self.data_for_commit[file_id] = ''.join(lines)
      def _delete_all_items(self, inv):
          for name, root_item in inv.root.children.iteritems():
@@ -499,7 +503,7 @@
          """Save the revision."""
          self.cache_mgr.inventories[self.revision_id] = self.inventory
          self.rev_store.load(self.revision, self.inventory, None,
--            lambda file_id: self._get_lines(file_id),
++            lambda file_id: self._get_data(file_id),
              lambda file_id: self._get_per_file_parents(file_id),
              lambda revision_ids: self._get_inventories(revision_ids))
@@ -598,9 +602,9 @@
          delta = self._get_final_delta()
          inv = self.rev_store.load_using_delta(self.revision,
              self.basis_inventory, delta, None,
--            lambda file_id: self._get_lines(file_id),
--            lambda file_id: self._get_per_file_parents(file_id),
--            lambda revision_ids: self._get_inventories(revision_ids))
++            self._get_data,
++            self._get_per_file_parents,
++            self._get_inventories)
          self.cache_mgr.inventories[self.revision_id] = inv
          #print "committed %s" % self.revision_id
 === modified file 'cache_manager.py'
 --- cache_manager.py	2009-12-08 06:26:34 +0000
 +++ cache_manager.py	2009-12-09 20:15:26 +0000
@@ -54,6 +54,34 @@
              shutils.rmtree(self.tempdir)
++
++class _Cleanup(object):
++    """This class makes sure we clean up when CacheManager goes away.
++
++    We use a helper class to ensure that we are never in a refcycle.
++    """
++
++    def __init__(self, disk_blobs):
++        self.disk_blobs = disk_blobs
++        self.tempdir = None
++        self.small_blobs = None
++
++    def __del__(self):
++        self.finalize()
++
++    def finalize(self):
++        if self.disk_blobs is not None:
++            for info in self.disk_blobs.itervalues():
++                if info[-1] is not None:
++                    os.unlink(info[-1])
++            self.disk_blobs = None
++        if self.small_blobs is not None:
++            self.small_blobs.close()
++            self.small_blobs = None
++        if self.tempdir is not None:
++            shutils.rmtree(self.tempdir)
++
++
  class CacheManager(object):
      _small_blob_threshold = 25*1024
 === modified file 'revision_store.py'
 --- revision_store.py	2009-10-25 22:05:48 +0000
 +++ revision_store.py	2009-12-09 20:15:27 +0000
@@ -1,4 +1,4 @@
--# Copyright (C) 2008 Canonical Ltd
++# Copyright (C) 2008, 2009 Canonical Ltd
+ #
  # This program is free software; you can redistribute it and/or modify
  # it under the terms of the GNU General Public License as published by
@@ -16,11 +16,137 @@
  """An abstraction of a repository providing just the bits importing needs."""
++import cStringIO
  from bzrlib import errors, inventory, knit, lru_cache, osutils, trace
  from bzrlib import revision as _mod_revision
++class _TreeShim(object):
++    """Fake a Tree implementation.
++
++    This implements just enough of the tree api to make commit builder happy.
++    """
++
++    def __init__(self, repo, basis_inv, inv_delta, content_provider):
++        self._repo = repo
++        self._content_provider = content_provider
++        self._basis_inv = basis_inv
++        self._inv_delta = inv_delta
++        self._new_info_by_id = dict([(file_id, (new_path, ie))
++                                    for _, new_path, file_id, ie in inv_delta])
++
++    def id2path(self, file_id):
++        if file_id in self._new_info_by_id:
++            new_path = self._new_info_by_id[file_id][0]
++            if new_path is None:
++                raise errors.NoSuchId(self, file_id)
++            return new_path
++        return self._basis_inv.id2path(file_id)
++
++    def path2id(self, path):
++        # CommitBuilder currently only requires access to the root id. We don't
++        # build a map of renamed files, etc. One possibility if we ever *do*
++        # need more than just root, is to defer to basis_inv.path2id() and then
++        # check if the file_id is in our _new_info_by_id dict. And in that
++        # case, return _new_info_by_id[file_id][0]
++        if path != '':
++            raise NotImplementedError(_TreeShim.path2id)
++        # TODO: Handle root renames?
++        return self._basis_inv.root.file_id
++
++    def get_file_with_stat(self, file_id, path=None):
++        try:
++            content = self._content_provider(file_id)
++        except KeyError:
++            # The content wasn't shown as 'new'. Just validate this fact
++            assert file_id not in self._new_info_by_id
++            old_ie = self._basis_inv[file_id]
++            old_text_key = (file_id, old_ie.revision)
++            stream = self._repo.texts.get_record_stream([old_text_key],
++                                                        'unordered', True)
++            content = stream.next().get_bytes_as('fulltext')
++        sio = cStringIO.StringIO(content)
++        return sio, None
++
++    def get_symlink_target(self, file_id):
++        if file_id in self._new_info_by_id:
++            ie = self._new_info_by_id[file_id][1]
++            return ie.symlink_target
++        return self._basis_inv[file_id].symlink_target
++
++    def get_reference_revision(self, file_id, path=None):
++        raise NotImplementedError(_TreeShim.get_reference_revision)
++
++    def _delta_to_iter_changes(self):
++        """Convert the inv_delta into an iter_changes repr."""
++        # iter_changes is:
++        #   (file_id,
++        #    (old_path, new_path),
++        #    content_changed,
++        #    (old_versioned, new_versioned),
++        #    (old_parent_id, new_parent_id),
++        #    (old_name, new_name),
++        #    (old_kind, new_kind),
++        #    (old_exec, new_exec),
++        #   )
++        basis_inv = self._basis_inv
++        for old_path, new_path, file_id, ie in self._inv_delta:
++            # Perf: Would this be faster if we did 'if file_id in basis_inv'?
++            # Since the *very* common case is that the file already exists, it
++            # probably is better to optimize for that
++            try:
++                old_ie = basis_inv[file_id]
++            except errors.NoSuchId:
++                old_ie = None
++                if ie is None:
++                    raise AssertionError('How is both old and new None?')
++                    change = (file_id,
++                        (old_path, new_path),
++                        False,
++                        (False, False),
++                        (None, None),
++                        (None, None),
++                        (None, None),
++                        (None, None),
++                        )
++                change = (file_id,
++                    (old_path, new_path),
++                    True,
++                    (False, True),
++                    (None, ie.parent_id),
++                    (None, ie.name),
++                    (None, ie.kind),
++                    (None, ie.executable),
++                    )
++            else:
++                if ie is None:
++                    change = (file_id,
++                        (old_path, new_path),
++                        True,
++                        (True, False),
++                        (old_ie.parent_id, None),
++                        (old_ie.name, None),
++                        (old_ie.kind, None),
++                        (old_ie.executable, None),
++                        )
++                else:
++                    content_modified = (ie.text_sha1 != old_ie.text_sha1
++                                        or ie.text_size != old_ie.text_size)
++                    # TODO: ie.kind != old_ie.kind
++                    # TODO: symlinks changing targets, content_modified?
++                    change = (file_id,
++                        (old_path, new_path),
++                        content_modified,
++                        (True, True),
++                        (old_ie.parent_id, ie.parent_id),
++                        (old_ie.name, ie.name),
++                        (old_ie.kind, ie.kind),
++                        (old_ie.executable, ie.executable),
++                        )
++            yield change
++
++
  class AbstractRevisionStore(object):
      def __init__(self, repo):
@@ -224,29 +350,41 @@
                  including an empty inventory for the missing revisions
              If None, a default implementation is provided.
          """
--        # Get the non-ghost parents and their inventories
--        if inventories_provider is None:
--            inventories_provider = self._default_inventories_provider
--        present_parents, parent_invs = inventories_provider(rev.parent_ids)
++        # TODO: set revision_id = rev.revision_id
++        builder = self.repo._commit_builder_class(self.repo,
++            parents=rev.parent_ids, config=None, timestamp=rev.timestamp,
++            timezone=rev.timezone, committer=rev.committer,
++            revprops=rev.properties, revision_id=rev.revision_id)
--        # Load the inventory
--        try:
--            rev_id = rev.revision_id
--            rev.inventory_sha1, inv = self._add_inventory_by_delta(
--                rev_id, basis_inv, inv_delta, present_parents, parent_invs)
--        except errors.RevisionAlreadyPresent:
++        if rev.parent_ids:
++            basis_rev_id = rev.parent_ids[0]
++        else:
++            basis_rev_id = _mod_revision.NULL_REVISION
++        tree = _TreeShim(self.repo, basis_inv, inv_delta, text_provider)
++        changes = tree._delta_to_iter_changes()
++        for (file_id, path, fs_hash) in builder.record_iter_changes(
++                tree, basis_rev_id, changes):
++            # So far, we don't *do* anything with the result
              pass
++        builder.finish_inventory()
++        # TODO: This is working around a bug in the bzrlib code base.
++        # 'builder.finish_inventory()' ends up doing:
++        # self.inv_sha1 = self.repository.add_inventory_by_delta(...)
++        # However, add_inventory_by_delta returns (sha1, inv)
++        # And we *want* to keep a handle on both of those objects
++        if isinstance(builder.inv_sha1, tuple):
++            builder.inv_sha1, builder.new_inventory = builder.inv_sha1
++        # This is a duplicate of Builder.commit() since we already have the
++        # Revision object, and we *don't* want to call commit_write_group()
++        rev.inv_sha1 = builder.inv_sha1
++        builder.repository.add_revision(builder._new_revision_id, rev,
++            builder.new_inventory, builder._config)
--        # Load the texts, signature and revision
--        file_rev_ids_needing_texts = [(id, ie.revision)
--            for _, n, id, ie in inv_delta
--            if n is not None and ie.revision == rev_id]
--        self._load_texts_for_file_rev_ids(file_rev_ids_needing_texts,
--            text_provider, parents_provider)
          if signature is not None:
++            raise AssertionError('signatures not guaranteed yet')
              self.repo.add_signature_text(rev_id, signature)
--        self._add_revision(rev, inv)
--        return inv
++        # self._add_revision(rev, inv)
++        return builder.revision_tree().inventory
      def _non_root_entries_iter(self, inv, revision_id):
          if hasattr(inv, 'iter_non_root_entries'):
 === modified file 'tests/__init__.py'
 --- tests/__init__.py	2009-02-25 11:33:28 +0000
 +++ tests/__init__.py	2009-12-09 20:15:27 +0000
@@ -21,15 +21,16 @@
  def test_suite():
--    module_names = [
--        'bzrlib.plugins.fastimport.tests.test_branch_mapper',
--        'bzrlib.plugins.fastimport.tests.test_commands',
--        'bzrlib.plugins.fastimport.tests.test_errors',
--        'bzrlib.plugins.fastimport.tests.test_filter_processor',
--        'bzrlib.plugins.fastimport.tests.test_generic_processor',
--        'bzrlib.plugins.fastimport.tests.test_head_tracking',
--        'bzrlib.plugins.fastimport.tests.test_helpers',
--        'bzrlib.plugins.fastimport.tests.test_parser',
--        ]
++    module_names = [__name__ + '.' + x for x in [
++        'test_branch_mapper',
++        'test_commands',
++        'test_errors',
++        'test_filter_processor',
++        'test_generic_processor',
++        'test_head_tracking',
++        'test_helpers',
++        'test_parser',
++        'test_revision_store',
++        ]]
      loader = TestLoader()
      return loader.loadTestsFromModuleNames(module_names)
 === modified file 'tests/test_generic_processor.py'
 --- tests/test_generic_processor.py	2009-11-12 07:51:21 +0000
 +++ tests/test_generic_processor.py	2009-12-09 20:15:26 +0000
@@ -1832,3 +1832,54 @@
      def test_import_symlink(self):
          handler, branch = self.get_handler()
          handler.process(self.get_command_iter('foo', 'symlink', 'bar'))
++
++
++class TestModifyRevertInBranch(TestCaseForGenericProcessor):
++
++    def file_command_iter(self):
++        # A     add 'foo'
++        # |\
++        # | B   modify 'foo'
++        # | |
++        # | C   revert 'foo' back to A
++        # |/
++        # D     merge 'foo'
++        def command_list():
++            committer_a = ['', 'a@elmer.com', time.time(), time.timezone]
++            committer_b = ['', 'b@elmer.com', time.time(), time.timezone]
++            committer_c = ['', 'c@elmer.com', time.time(), time.timezone]
++            committer_d = ['', 'd@elmer.com', time.time(), time.timezone]
++            def files_one():
++                yield commands.FileModifyCommand('foo', 'file', False,
++                        None, "content A\n")
++            yield commands.CommitCommand('head', '1', None,
++                committer_a, "commit 1", None, [], files_one)
++            def files_two():
++                yield commands.FileModifyCommand('foo', 'file', False,
++                        None, "content B\n")
++            yield commands.CommitCommand('head', '2', None,
++                committer_b, "commit 2", ":1", [], files_two)
++            def files_three():
++                yield commands.FileModifyCommand('foo', 'file', False,
++                        None, "content A\n")
++            yield commands.CommitCommand('head', '3', None,
++                committer_c, "commit 3", ":2", [], files_three)
++            yield commands.CommitCommand('head', '4', None,
++                committer_d, "commit 4", ":1", [':3'], lambda: [])
++        return command_list
++
++    def test_modify_revert(self):
++        handler, branch = self.get_handler()
++        handler.process(self.file_command_iter())
++        branch.lock_read()
++        self.addCleanup(branch.unlock)
++        rev_d = branch.last_revision()
++        rev_a, rev_c = branch.repository.get_parent_map([rev_d])[rev_d]
++        rev_b = branch.repository.get_parent_map([rev_c])[rev_c][0]
++        rtree_a, rtree_b, rtree_c, rtree_d = branch.repository.revision_trees([
++            rev_a, rev_b, rev_c, rev_d])
++        foo_id = rtree_a.path2id('foo')
++        self.assertEqual(rev_a, rtree_a.inventory[foo_id].revision)
++        self.assertEqual(rev_b, rtree_b.inventory[foo_id].revision)
++        self.assertEqual(rev_c, rtree_c.inventory[foo_id].revision)
++        self.assertEqual(rev_c, rtree_d.inventory[foo_id].revision)
 === added file 'tests/test_revision_store.py'
 --- tests/test_revision_store.py	1970-01-01 00:00:00 +0000
 +++ tests/test_revision_store.py	2009-12-09 20:15:27 +0000
@@ -0,0 +1,147 @@
++# Copyright (C) 2008, 2009 Canonical Ltd
++#
++# This program is free software; you can redistribute it and/or modify
++# it under the terms of the GNU General Public License as published by
++# the Free Software Foundation; either version 2 of the License, or
++# (at your option) any later version.
++#
++# This program is distributed in the hope that it will be useful,
++# but WITHOUT ANY WARRANTY; without even the implied warranty of
++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++# GNU General Public License for more details.
++#
++# You should have received a copy of the GNU General Public License
++# along with this program; if not, write to the Free Software
++# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
++
++"""Direct tests of the revision_store classes."""
++
++from bzrlib import (
++    branch,
++    errors,
++    inventory,
++    osutils,
++    tests,
++    )
++
++from bzrlib.plugins.fastimport import (
++    revision_store,
++    )
++
++
++class Test_TreeShim(tests.TestCase):
++
++    def invAddEntry(self, inv, path, file_id=None):
++        if path.endswith('/'):
++            path = path[:-1]
++            kind = 'directory'
++        else:
++            kind = 'file'
++        parent_path, basename = osutils.split(path)
++        parent_id = inv.path2id(parent_path)
++        inv.add(inventory.make_entry(kind, basename, parent_id, file_id))
++
++    def make_trivial_basis_inv(self):
++        basis_inv = inventory.Inventory('TREE_ROOT')
++        self.invAddEntry(basis_inv, 'foo', 'foo-id')
++        self.invAddEntry(basis_inv, 'bar/', 'bar-id')
++        self.invAddEntry(basis_inv, 'bar/baz', 'baz-id')
++        return basis_inv
++
++    def test_id2path_no_delta(self):
++        basis_inv = self.make_trivial_basis_inv()
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=[], content_provider=None)
++        self.assertEqual('', shim.id2path('TREE_ROOT'))
++        self.assertEqual('foo', shim.id2path('foo-id'))
++        self.assertEqual('bar', shim.id2path('bar-id'))
++        self.assertEqual('bar/baz', shim.id2path('baz-id'))
++        self.assertRaises(errors.NoSuchId, shim.id2path, 'qux-id')
++
++    def test_id2path_with_delta(self):
++        basis_inv = self.make_trivial_basis_inv()
++        foo_entry = inventory.make_entry('file', 'foo2', 'TREE_ROOT', 'foo-id')
++        inv_delta = [('foo', 'foo2', 'foo-id', foo_entry),
++                     ('bar/baz', None, 'baz-id', None),
++                    ]
++
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=inv_delta,
++                                        content_provider=None)
++        self.assertEqual('', shim.id2path('TREE_ROOT'))
++        self.assertEqual('foo2', shim.id2path('foo-id'))
++        self.assertEqual('bar', shim.id2path('bar-id'))
++        self.assertRaises(errors.NoSuchId, shim.id2path, 'baz-id')
++
++    def test_path2id(self):
++        basis_inv = self.make_trivial_basis_inv()
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=[], content_provider=None)
++        self.assertEqual('TREE_ROOT', shim.path2id(''))
++        # We don't want to ever give a wrong value, so for now we just raise
++        # NotImplementedError
++        self.assertRaises(NotImplementedError, shim.path2id, 'bar')
++
++    def test_get_file_with_stat_content_in_stream(self):
++        basis_inv = self.make_trivial_basis_inv()
++
++        def content_provider(file_id):
++            return 'content of\n' + file_id + '\n'
++
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=[],
++                                        content_provider=content_provider)
++        f_obj, stat_val = shim.get_file_with_stat('baz-id')
++        self.assertIs(None, stat_val)
++        self.assertEqualDiff('content of\nbaz-id\n', f_obj.read())
++
++    # TODO: Test when the content isn't in the stream, and we fall back to the
++    #       repository that was passed in
++
++    def test_get_symlink_target(self):
++        basis_inv = self.make_trivial_basis_inv()
++        ie = inventory.make_entry('symlink', 'link', 'TREE_ROOT', 'link-id')
++        ie.symlink_target = u'link-target'
++        basis_inv.add(ie)
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=[], content_provider=None)
++        self.assertEqual(u'link-target', shim.get_symlink_target('link-id'))
++
++    def test_get_symlink_target_from_delta(self):
++        basis_inv = self.make_trivial_basis_inv()
++        ie = inventory.make_entry('symlink', 'link', 'TREE_ROOT', 'link-id')
++        ie.symlink_target = u'link-target'
++        inv_delta = [(None, 'link', 'link-id', ie)]
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=inv_delta,
++                                        content_provider=None)
++        self.assertEqual(u'link-target', shim.get_symlink_target('link-id'))
++
++    def test__delta_to_iter_changes(self):
++        basis_inv = self.make_trivial_basis_inv()
++        foo_entry = inventory.make_entry('file', 'foo2', 'bar-id', 'foo-id')
++        link_entry = inventory.make_entry('symlink', 'link', 'TREE_ROOT',
++                                          'link-id')
++        link_entry.symlink_target = u'link-target'
++        inv_delta = [('foo', 'bar/foo2', 'foo-id', foo_entry),
++                     ('bar/baz', None, 'baz-id', None),
++                     (None, 'link', 'link-id', link_entry),
++                    ]
++        shim = revision_store._TreeShim(repo=None, basis_inv=basis_inv,
++                                        inv_delta=inv_delta,
++                                        content_provider=None)
++        changes = list(shim._delta_to_iter_changes())
++        expected = [('foo-id', ('foo', 'bar/foo2'), False, (True, True),
++                     ('TREE_ROOT', 'bar-id'), ('foo', 'foo2'),
++                     ('file', 'file'), (False, False)),
++                    ('baz-id', ('bar/baz', None), True, (True, False),
++                     ('bar-id', None), ('baz', None),
++                     ('file', None), (False, None)),
++                    ('link-id', (None, 'link'), True, (False, True),
++                     (None, 'TREE_ROOT'), (None, 'link'),
++                     (None, 'symlink'), (None, False)),
++                   ]
++        # from pprint import pformat
++        # self.assertEqualDiff(pformat(expected), pformat(changes))
++        self.assertEqual(expected, changes)
++