Merge lp://staging/~jameinel/bzr/1.17-gc-single-mem into lp://staging/~bzr/bzr/trunk-old

Proposed by John A Meinel
Status: Merged
Approved by: Ian Clatworthy
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp://staging/~jameinel/bzr/1.17-gc-single-mem
Merge into: lp://staging/~bzr/bzr/trunk-old
Diff against target: 136 lines
To merge this branch: bzr merge lp://staging/~jameinel/bzr/1.17-gc-single-mem
Reviewer Review Type Date Requested Status
Andrew Bennetts Approve
Review via email: mp+7768@code.staging.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

This patch finishes up the work I started with "commit" and memory consumption.

After this patch, when doing 'bzr commit' in a --2a format repository, we no longer hold more than one copy of the file text in memory.

So --2a now only holds:
  1 file text
  2 copies of the compressed bytes

For committing a single large file with lots of small lines, this is:

  bzr.dev (last week) 457324 KB 6.164s
  bzr.dev (_add_text) 215996 KB 3.980s
  this code 123904 KB 3.863s

So this code at least isn't slower, and it cuts the memory for a large file commit by a good amount.

https://bugs.edge.launchpad.net/bzr/+bug/109114

I also checked, and I couldn't see any major change in the time for 'bzr pack', so it doesn't seem to make anything slower, while decreasing peak memory consumption by quite a bit.

(Note that the memory consumed during 'bzr pack' is unaffected, and is still considerably higher (536MiB). I'm guessing it is dominated by the 'create_delta_index' which peaks at something like 2x bytes while building and re-shaping the index.)

Revision history for this message
Andrew Bennetts (spiv) wrote :

John A Meinel wrote:
[...]
> After this patch, when doing 'bzr commit' in a --2a format repository, we no longer hold more than one copy of the file text in memory.
>
> So --2a now only holds:
> 1 file text
> 2 copies of the compressed bytes
>
> For committing a single large file with lots of small lines, this is:
>
> bzr.dev (last week) 457324 KB 6.164s
> bzr.dev (_add_text) 215996 KB 3.980s
> this code 123904 KB 3.863s
>
> So this code at least isn't slower, and it cuts the memory for a large file commit by a good amount.

Nice!

Simple patch, too.

 review approve

It would be nice to have some sort of automated test for memory consumption,
but perhaps usertest is a better place for that than bzr selftest.

-Andrew.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-06-22 21:55:37 +0000
3+++ NEWS 2009-06-23 03:35:22 +0000
4@@ -43,9 +43,10 @@
5 (Martin Pool, #339385)
6
7 * Reduced memory consumption during ``bzr commit`` of large files. For
8- pre 2a formats, should be down to ~3x the size of a file, and for
9- ``--2a`` formats should be down to exactly 2x the size. Related to bug
10- #109114. (John Arbash Meinel)
11+ pre 2a formats, should be down to ~3x the size of a file.
12+ For ``--2a`` format repositories, it is down to the size of the file
13+ content plus the size of the compressed text. Related to bug #109114.
14+ (John Arbash Meinel)
15
16 * Repositories using CHK pages (which includes the new 2a format) will no
17 longer error during commit or push operations when an autopack operation
18
19=== modified file 'bzrlib/groupcompress.py'
20--- bzrlib/groupcompress.py 2009-06-22 15:47:25 +0000
21+++ bzrlib/groupcompress.py 2009-06-23 03:35:23 +0000
22@@ -108,6 +108,7 @@
23 self._z_content_length = None
24 self._content_length = None
25 self._content = None
26+ self._content_chunks = None
27
28 def __len__(self):
29 # This is the maximum number of bytes this object will reference if
30@@ -137,6 +138,10 @@
31 % (num_bytes, self._content_length))
32 # Expand the content if required
33 if self._content is None:
34+ if self._content_chunks is not None:
35+ self._content = ''.join(self._content_chunks)
36+ self._content_chunks = None
37+ if self._content is None:
38 if self._z_content is None:
39 raise AssertionError('No content to decompress')
40 if self._z_content == '':
41@@ -273,22 +278,55 @@
42 bytes = apply_delta_to_source(self._content, content_start, end)
43 return bytes
44
45+ def set_chunked_content(self, content_chunks, length):
46+ """Set the content of this block to the given chunks."""
47+ # If we have lots of short lines, it is may be more efficient to join
48+ # the content ahead of time. If the content is <10MiB, we don't really
49+ # care about the extra memory consumption, so we can just pack it and
50+ # be done. However, timing showed 18s => 17.9s for repacking 1k revs of
51+ # mysql, which is below the noise margin
52+ self._content_length = length
53+ self._content_chunks = content_chunks
54+ self._content = None
55+ self._z_content = None
56+
57 def set_content(self, content):
58 """Set the content of this block."""
59 self._content_length = len(content)
60 self._content = content
61 self._z_content = None
62
63+ def _create_z_content_using_lzma(self):
64+ if self._content_chunks is not None:
65+ self._content = ''.join(self._content_chunks)
66+ self._content_chunks = None
67+ if self._content is None:
68+ raise AssertionError('Nothing to compress')
69+ self._z_content = pylzma.compress(self._content)
70+ self._z_content_length = len(self._z_content)
71+
72+ def _create_z_content_from_chunks(self):
73+ compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION)
74+ compressed_chunks = map(compressor.compress, self._content_chunks)
75+ compressed_chunks.append(compressor.flush())
76+ self._z_content = ''.join(compressed_chunks)
77+ self._z_content_length = len(self._z_content)
78+
79+ def _create_z_content(self):
80+ if self._z_content is not None:
81+ return
82+ if _USE_LZMA:
83+ self._create_z_content_using_lzma()
84+ return
85+ if self._content_chunks is not None:
86+ self._create_z_content_from_chunks()
87+ return
88+ self._z_content = zlib.compress(self._content)
89+ self._z_content_length = len(self._z_content)
90+
91 def to_bytes(self):
92 """Encode the information into a byte stream."""
93- compress = zlib.compress
94- if _USE_LZMA:
95- compress = pylzma.compress
96- if self._z_content is None:
97- if self._content is None:
98- raise AssertionError('Nothing to compress')
99- self._z_content = compress(self._content)
100- self._z_content_length = len(self._z_content)
101+ self._create_z_content()
102 if _USE_LZMA:
103 header = self.GCB_LZ_HEADER
104 else:
105@@ -762,10 +800,9 @@
106 # for 'commit' down to ~1x the size of the largest file, at a
107 # cost of increased complexity within this code. 2x is still <<
108 # 3x the size of the largest file, so we are doing ok.
109- content = ''.join(self.chunks)
110+ self._block.set_chunked_content(self.chunks, self.endpoint)
111 self.chunks = None
112 self._delta_index = None
113- self._block.set_content(content)
114 return self._block
115
116 def pop_last(self):
117
118=== modified file 'bzrlib/tests/test_groupcompress.py'
119--- bzrlib/tests/test_groupcompress.py 2009-06-10 03:56:49 +0000
120+++ bzrlib/tests/test_groupcompress.py 2009-06-23 03:35:23 +0000
121@@ -363,6 +363,15 @@
122 raw_bytes = zlib.decompress(remaining_bytes)
123 self.assertEqual(content, raw_bytes)
124
125+ # we should get the same results if using the chunked version
126+ gcb = groupcompress.GroupCompressBlock()
127+ gcb.set_chunked_content(['this is some content\n'
128+ 'this content will be compressed\n'],
129+ len(content))
130+ old_bytes = bytes
131+ bytes = gcb.to_bytes()
132+ self.assertEqual(old_bytes, bytes)
133+
134 def test_partial_decomp(self):
135 content_chunks = []
136 # We need a sufficient amount of data so that zlib.decompress has