Merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename into lp://staging/lazr.restfulclient

Proposed by Leonard Richardson
Status: Merged
Approved by: Brad Crittenden
Approved revision: 86
Merged at revision: not available
Proposed branch: lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename
Merge into: lp://staging/lazr.restfulclient
Diff against target: 173 lines (+147/-1)
2 files modified
src/lazr/restfulclient/_browser.py (+40/-1)
src/lazr/restfulclient/docs/caching.txt (+107/-0)
To merge this branch: bzr merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename
Reviewer Review Type Date Requested Status
Brad Crittenden (community) code Approve
Review via email: mp+18951@code.staging.launchpad.net
To post a comment you must log in.
Revision history for this message
Leonard Richardson (leonardr) wrote :

This branch fixes bug 512832 by ensuring that the filename of a cached representation is never longer than 150 characters. The filename always ends with an MD5 sum derived from the resource's full URL, so truncated filenames won't collide unless there's also a hash collision.

This code copies-and-pastes in code from httplib2. I filed an httplib2 bug (http://code.google.com/p/httplib2/issues/detail?id=92) to deal with the underlying problem in such a way that I can eventually get rid of the copy-and-pasted code.

Tests make up most of this branch.

Revision history for this message
Brad Crittenden (bac) wrote :

Nice fix Leonard.

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'src/lazr/restfulclient/_browser.py'
--- src/lazr/restfulclient/_browser.py 2009-10-22 16:25:34 +0000
+++ src/lazr/restfulclient/_browser.py 2010-02-09 20:07:14 +0000
@@ -35,7 +35,7 @@
35import shutil35import shutil
36import tempfile36import tempfile
37from httplib2 import (37from httplib2 import (
38 FailedToDecompressContent, FileCache, Http, safename, urlnorm)38 FailedToDecompressContent, FileCache, Http, urlnorm)
39import simplejson39import simplejson
40from cStringIO import StringIO40from cStringIO import StringIO
41import zlib41import zlib
@@ -68,6 +68,45 @@
68 response, content)68 response, content)
69 return content69 return content
7070
71# A drop-in replacement for httplib2's safename.
72from httplib2 import _md5, re_url_scheme, re_slash
73def safename(filename):
74 """Return a filename suitable for the cache.
75
76 Strips dangerous and common characters to create a filename we
77 can use to store the cache in.
78 """
79
80 try:
81 if re_url_scheme.match(filename):
82 if isinstance(filename,str):
83 filename = filename.decode('utf-8')
84 filename = filename.encode('idna')
85 else:
86 filename = filename.encode('idna')
87 except UnicodeError:
88 pass
89 if isinstance(filename,unicode):
90 filename=filename.encode('utf-8')
91 filemd5 = _md5(filename).hexdigest()
92 filename = re_url_scheme.sub("", filename)
93 filename = re_slash.sub(",", filename)
94
95 # This is the part that we changed. In stock httplib2, the
96 # filename is trimmed if it's longer than 200 characters, and then
97 # a comma and a 32-character md5 sum are appended. This causes
98 # problems on eCryptfs filesystems, where the maximum safe
99 # filename length is closer to 150 characters. So we take 117 as
100 # our limit (150-32-1) instead of 200.
101 #
102 # See:
103 # http://code.google.com/p/httplib2/issues/detail?id=92
104 # https://bugs.launchpad.net/bugs/344878
105 # https://bugs.launchpad.net/bugs/512832
106 if len(filename)>117:
107 filename=filename[:117]
108 return ",".join((filename, filemd5))
109
71110
72class RestfulHttp(Http):111class RestfulHttp(Http):
73 """An Http subclass with some custom behavior.112 """An Http subclass with some custom behavior.
74113
=== modified file 'src/lazr/restfulclient/docs/caching.txt'
--- src/lazr/restfulclient/docs/caching.txt 2009-10-22 16:25:34 +0000
+++ src/lazr/restfulclient/docs/caching.txt 2010-02-09 20:07:14 +0000
@@ -112,3 +112,110 @@
112 >>> httplib2.debuglevel = 0112 >>> httplib2.debuglevel = 0
113 >>> import shutil113 >>> import shutil
114 >>> shutil.rmtree(tempdir)114 >>> shutil.rmtree(tempdir)
115
116Cache filenames
117---------------
118
119lazr.restfulclient caches HTTP repsonses in individual files named
120after the URL accessed. This is behavior derived from httplib2, but
121lazr.restfulclient does two things differently from httplib2.
122
123To see these two things, let's set up a client that uses a temporary
124directory as a cache file. The directory starts out empty.
125
126 >>> from os import listdir
127 >>> tempdir = tempfile.mkdtemp()
128 >>> len(listdir(tempdir))
129 0
130
131As soon as we create a client object, though, lazr.restfulclient
132fetches a JSON and a WADL representation of the service root, and
133caches them individually.
134
135 >>> service = CookbookWebServiceClient(cache=tempdir)
136 >>> cache_contents = listdir(tempdir)
137 >>> for file in sorted(cache_contents):
138 ... print file
139 cookbooks.dev...application,json...
140 cookbooks.dev...vnd.sun.wadl+xml...
141
142This is the first difference between lazr.restfulclient's caching and
143httplib2's. httplib2 would store all requests for the service root in
144a filename based solely on the URL. This effectively limits httplib2
145to a single representation of a given resource: the WADL
146representation would be overwritten with the JSON
147representation. lazr.restfulclient incorporates the media type in the
148cache filename, so that WADL and JSON representations are stored
149separately.
150
151The second difference has to do with filename length limits. httplib2
152caps filenames at about 240 characters so that cache files can be
153stored on filesystems with 255-character filename length limits. For
154compatibility with eCryptfs filesystems, lazr.restfulclient goes
155further, and caps filenames at 150 characters.
156
157To test out the limit, let's create a cookbook with an incredibly
158long name.
159
160 >>> long_name = (
161 ... "This cookbook name is amazingly long; so long that it will "
162 ... "surely be truncated when it is incorporated into a file "
163 ... "name for the cache. The cache file will contain a cached "
164 ... "HTTP respone containing a JSON representation of of this "
165 ... "cookbook, whose name, I repeat, is very long indeed.")
166 >>> len(long_name)
167 281
168
169 >>> import datetime
170 >>> date = datetime.datetime(1994, 1, 1)
171 >>> book = service.cookbooks.create(
172 ... name=long_name, cuisine="General", copyright_date=date,
173 ... price=10.22, last_printing=date)
174
175lazr.restfulclient automatically fetched a JSON representation of the
176new cookbook, so it's already present in the cache. Because a
177cookbook's URL incorporates its name, and this cookbook's name is
178incredibly long, it must have been truncated to fit on disk.
179
180 >>> [cookbook_cache_filename] = [file for file in listdir(tempdir)
181 ... if 'amazingly' in file]
182
183Indeed, the filename has been truncated to fit in the rough
184150-character safety limit for eCryptfs filesystems.
185
186 >>> len(cookbook_cache_filename)
187 150
188
189Despite the truncation, some of the useful information from the
190cookbook's name makes it into the filename, making it easy to find when
191manually crawling through the cache directory.
192
193 >>> print cookbook_cache_filename
194 cookbooks.dev...This%20cookbook%20name%20is%20amazingly%20long...
195
196To avoid conflicts caused by truncation, the filename always ends with
197an MD5 sum derived from the untruncated URL. Let's create a second
198cookbook whose name differs from the first cookbook only at the end.
199
200 >>> longer_name = long_name + ": The Sequel"
201 >>> book = service.cookbooks.create(
202 ... name=longer_name, cuisine="General", copyright_date=date,
203 ... price=10.22, last_printing=date)
204
205This cookbook's URL is identical to the first cookbook's URL for far
206longer than 150 characters. But since the truncated filename
207incorporates an MD5 sum based on the full URL, the two cookbooks are
208cached in separate files.
209
210 >>> [file1, file2] = [file for file in listdir(tempdir)
211 ... if 'amazingly' in file]
212
213The filenames are identical up to the last 32 characters, which is
214where the MD5 sum begins. But because the MD5 sums are different, they
215are not completely identical.
216
217 >>> file1[:-32] == file2[:-32]
218 True
219
220 >>> file1 == file2
221 False

Subscribers

People subscribed via source and target branches