Merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename into lp://staging/lazr.restfulclient

Proposed by Leonard Richardson
Status: Merged
Approved by: Brad Crittenden
Approved revision: 86
Merged at revision: not available
Proposed branch: lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename
Merge into: lp://staging/lazr.restfulclient
Diff against target: 173 lines (+147/-1)
2 files modified
src/lazr/restfulclient/_browser.py (+40/-1)
src/lazr/restfulclient/docs/caching.txt (+107/-0)
To merge this branch: bzr merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename
Reviewer Review Type Date Requested Status
Brad Crittenden (community) code Approve
Review via email: mp+18951@code.staging.launchpad.net
To post a comment you must log in.
Revision history for this message
Leonard Richardson (leonardr) wrote :

This branch fixes bug 512832 by ensuring that the filename of a cached representation is never longer than 150 characters. The filename always ends with an MD5 sum derived from the resource's full URL, so truncated filenames won't collide unless there's also a hash collision.

This code copies-and-pastes in code from httplib2. I filed an httplib2 bug (http://code.google.com/p/httplib2/issues/detail?id=92) to deal with the underlying problem in such a way that I can eventually get rid of the copy-and-pasted code.

Tests make up most of this branch.

Revision history for this message
Brad Crittenden (bac) wrote :

Nice fix Leonard.

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/lazr/restfulclient/_browser.py'
2--- src/lazr/restfulclient/_browser.py 2009-10-22 16:25:34 +0000
3+++ src/lazr/restfulclient/_browser.py 2010-02-09 20:07:14 +0000
4@@ -35,7 +35,7 @@
5 import shutil
6 import tempfile
7 from httplib2 import (
8- FailedToDecompressContent, FileCache, Http, safename, urlnorm)
9+ FailedToDecompressContent, FileCache, Http, urlnorm)
10 import simplejson
11 from cStringIO import StringIO
12 import zlib
13@@ -68,6 +68,45 @@
14 response, content)
15 return content
16
17+# A drop-in replacement for httplib2's safename.
18+from httplib2 import _md5, re_url_scheme, re_slash
19+def safename(filename):
20+ """Return a filename suitable for the cache.
21+
22+ Strips dangerous and common characters to create a filename we
23+ can use to store the cache in.
24+ """
25+
26+ try:
27+ if re_url_scheme.match(filename):
28+ if isinstance(filename,str):
29+ filename = filename.decode('utf-8')
30+ filename = filename.encode('idna')
31+ else:
32+ filename = filename.encode('idna')
33+ except UnicodeError:
34+ pass
35+ if isinstance(filename,unicode):
36+ filename=filename.encode('utf-8')
37+ filemd5 = _md5(filename).hexdigest()
38+ filename = re_url_scheme.sub("", filename)
39+ filename = re_slash.sub(",", filename)
40+
41+ # This is the part that we changed. In stock httplib2, the
42+ # filename is trimmed if it's longer than 200 characters, and then
43+ # a comma and a 32-character md5 sum are appended. This causes
44+ # problems on eCryptfs filesystems, where the maximum safe
45+ # filename length is closer to 150 characters. So we take 117 as
46+ # our limit (150-32-1) instead of 200.
47+ #
48+ # See:
49+ # http://code.google.com/p/httplib2/issues/detail?id=92
50+ # https://bugs.launchpad.net/bugs/344878
51+ # https://bugs.launchpad.net/bugs/512832
52+ if len(filename)>117:
53+ filename=filename[:117]
54+ return ",".join((filename, filemd5))
55+
56
57 class RestfulHttp(Http):
58 """An Http subclass with some custom behavior.
59
60=== modified file 'src/lazr/restfulclient/docs/caching.txt'
61--- src/lazr/restfulclient/docs/caching.txt 2009-10-22 16:25:34 +0000
62+++ src/lazr/restfulclient/docs/caching.txt 2010-02-09 20:07:14 +0000
63@@ -112,3 +112,110 @@
64 >>> httplib2.debuglevel = 0
65 >>> import shutil
66 >>> shutil.rmtree(tempdir)
67+
68+Cache filenames
69+---------------
70+
71+lazr.restfulclient caches HTTP repsonses in individual files named
72+after the URL accessed. This is behavior derived from httplib2, but
73+lazr.restfulclient does two things differently from httplib2.
74+
75+To see these two things, let's set up a client that uses a temporary
76+directory as a cache file. The directory starts out empty.
77+
78+ >>> from os import listdir
79+ >>> tempdir = tempfile.mkdtemp()
80+ >>> len(listdir(tempdir))
81+ 0
82+
83+As soon as we create a client object, though, lazr.restfulclient
84+fetches a JSON and a WADL representation of the service root, and
85+caches them individually.
86+
87+ >>> service = CookbookWebServiceClient(cache=tempdir)
88+ >>> cache_contents = listdir(tempdir)
89+ >>> for file in sorted(cache_contents):
90+ ... print file
91+ cookbooks.dev...application,json...
92+ cookbooks.dev...vnd.sun.wadl+xml...
93+
94+This is the first difference between lazr.restfulclient's caching and
95+httplib2's. httplib2 would store all requests for the service root in
96+a filename based solely on the URL. This effectively limits httplib2
97+to a single representation of a given resource: the WADL
98+representation would be overwritten with the JSON
99+representation. lazr.restfulclient incorporates the media type in the
100+cache filename, so that WADL and JSON representations are stored
101+separately.
102+
103+The second difference has to do with filename length limits. httplib2
104+caps filenames at about 240 characters so that cache files can be
105+stored on filesystems with 255-character filename length limits. For
106+compatibility with eCryptfs filesystems, lazr.restfulclient goes
107+further, and caps filenames at 150 characters.
108+
109+To test out the limit, let's create a cookbook with an incredibly
110+long name.
111+
112+ >>> long_name = (
113+ ... "This cookbook name is amazingly long; so long that it will "
114+ ... "surely be truncated when it is incorporated into a file "
115+ ... "name for the cache. The cache file will contain a cached "
116+ ... "HTTP respone containing a JSON representation of of this "
117+ ... "cookbook, whose name, I repeat, is very long indeed.")
118+ >>> len(long_name)
119+ 281
120+
121+ >>> import datetime
122+ >>> date = datetime.datetime(1994, 1, 1)
123+ >>> book = service.cookbooks.create(
124+ ... name=long_name, cuisine="General", copyright_date=date,
125+ ... price=10.22, last_printing=date)
126+
127+lazr.restfulclient automatically fetched a JSON representation of the
128+new cookbook, so it's already present in the cache. Because a
129+cookbook's URL incorporates its name, and this cookbook's name is
130+incredibly long, it must have been truncated to fit on disk.
131+
132+ >>> [cookbook_cache_filename] = [file for file in listdir(tempdir)
133+ ... if 'amazingly' in file]
134+
135+Indeed, the filename has been truncated to fit in the rough
136+150-character safety limit for eCryptfs filesystems.
137+
138+ >>> len(cookbook_cache_filename)
139+ 150
140+
141+Despite the truncation, some of the useful information from the
142+cookbook's name makes it into the filename, making it easy to find when
143+manually crawling through the cache directory.
144+
145+ >>> print cookbook_cache_filename
146+ cookbooks.dev...This%20cookbook%20name%20is%20amazingly%20long...
147+
148+To avoid conflicts caused by truncation, the filename always ends with
149+an MD5 sum derived from the untruncated URL. Let's create a second
150+cookbook whose name differs from the first cookbook only at the end.
151+
152+ >>> longer_name = long_name + ": The Sequel"
153+ >>> book = service.cookbooks.create(
154+ ... name=longer_name, cuisine="General", copyright_date=date,
155+ ... price=10.22, last_printing=date)
156+
157+This cookbook's URL is identical to the first cookbook's URL for far
158+longer than 150 characters. But since the truncated filename
159+incorporates an MD5 sum based on the full URL, the two cookbooks are
160+cached in separate files.
161+
162+ >>> [file1, file2] = [file for file in listdir(tempdir)
163+ ... if 'amazingly' in file]
164+
165+The filenames are identical up to the last 32 characters, which is
166+where the MD5 sum begins. But because the MD5 sums are different, they
167+are not completely identical.
168+
169+ >>> file1[:-32] == file2[:-32]
170+ True
171+
172+ >>> file1 == file2
173+ False

Subscribers

People subscribed via source and target branches