lazr.restfulclient

Merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename into lp://staging/lazr.restfulclient

shorten-cache-filename
Merge into trunk

Proposed by Leonard Richardson on 2010-02-09

Status:

Merged

Approved by:

Brad Crittenden on 2010-02-09

Approved revision:

Merged at revision:

not available

Proposed branch:

lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename

Merge into:

lp://staging/lazr.restfulclient

Diff against target:

173 lines (+147/-1)

2 files modified

src/lazr/restfulclient/_browser.py (+40/-1)
src/lazr/restfulclient/docs/caching.txt (+107/-0)

To merge this branch:

bzr merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Brad Crittenden (community)	code	2010-02-09	Approve on 2010-02-09
Review via email: mp+18951@code.staging.launchpad.net

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-02-09:

This branch fixes bug 512832 by ensuring that the filename of a cached representation is never longer than 150 characters. The filename always ends with an MD5 sum derived from the resource's full URL, so truncated filenames won't collide unless there's also a hash collision.

This code copies-and-pastes in code from httplib2. I filed an httplib2 bug (http://code.google.com/p/httplib2/issues/detail?id=92) to deal with the underlying problem in such a way that I can eventually get rid of the copy-and-pasted code.

Tests make up most of this branch.

Revision history for this message

Brad Crittenden (bac) wrote on 2010-02-09:

Nice fix Leonard.

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Launchpad code reviewers from Canonical

Leonard Richardson

 === modified file 'src/lazr/restfulclient/_browser.py'
 --- src/lazr/restfulclient/_browser.py	2009-10-22 16:25:34 +0000
 +++ src/lazr/restfulclient/_browser.py	2010-02-09 20:07:14 +0000
@@ -35,7 +35,7 @@
  import shutil
  import tempfile
  from httplib2 import (
--    FailedToDecompressContent, FileCache, Http, safename, urlnorm)
++    FailedToDecompressContent, FileCache, Http, urlnorm)
  import simplejson
  from cStringIO import StringIO
  import zlib
@@ -68,6 +68,45 @@
              response, content)
      return content
++# A drop-in replacement for httplib2's safename.
++from httplib2 import _md5, re_url_scheme, re_slash
++def safename(filename):
++    """Return a filename suitable for the cache.
++
++    Strips dangerous and common characters to create a filename we
++    can use to store the cache in.
++    """
++
++    try:
++        if re_url_scheme.match(filename):
++            if isinstance(filename,str):
++                filename = filename.decode('utf-8')
++                filename = filename.encode('idna')
++            else:
++                filename = filename.encode('idna')
++    except UnicodeError:
++        pass
++    if isinstance(filename,unicode):
++        filename=filename.encode('utf-8')
++    filemd5 = _md5(filename).hexdigest()
++    filename = re_url_scheme.sub("", filename)
++    filename = re_slash.sub(",", filename)
++
++    # This is the part that we changed. In stock httplib2, the
++    # filename is trimmed if it's longer than 200 characters, and then
++    # a comma and a 32-character md5 sum are appended. This causes
++    # problems on eCryptfs filesystems, where the maximum safe
++    # filename length is closer to 150 characters. So we take 117 as
++    # our limit (150-32-1) instead of 200.
++    #
++    # See:
++    #  http://code.google.com/p/httplib2/issues/detail?id=92
++    #  https://bugs.launchpad.net/bugs/344878
++    #  https://bugs.launchpad.net/bugs/512832
++    if len(filename)>117:
++        filename=filename[:117]
++    return ",".join((filename, filemd5))
++
  class RestfulHttp(Http):
      """An Http subclass with some custom behavior.
 === modified file 'src/lazr/restfulclient/docs/caching.txt'
 --- src/lazr/restfulclient/docs/caching.txt	2009-10-22 16:25:34 +0000
 +++ src/lazr/restfulclient/docs/caching.txt	2010-02-09 20:07:14 +0000
@@ -112,3 +112,110 @@
      >>> httplib2.debuglevel = 0
      >>> import shutil
      >>> shutil.rmtree(tempdir)
++
++Cache filenames
++---------------
++
++lazr.restfulclient caches HTTP repsonses in individual files named
++after the URL accessed. This is behavior derived from httplib2, but
++lazr.restfulclient does two things differently from httplib2.
++
++To see these two things, let's set up a client that uses a temporary
++directory as a cache file. The directory starts out empty.
++
++    >>> from os import listdir
++    >>> tempdir = tempfile.mkdtemp()
++    >>> len(listdir(tempdir))
++    0
++
++As soon as we create a client object, though, lazr.restfulclient
++fetches a JSON and a WADL representation of the service root, and
++caches them individually.
++
++    >>> service = CookbookWebServiceClient(cache=tempdir)
++    >>> cache_contents = listdir(tempdir)
++    >>> for file in sorted(cache_contents):
++    ...     print file
++    cookbooks.dev...application,json...
++    cookbooks.dev...vnd.sun.wadl+xml...
++
++This is the first difference between lazr.restfulclient's caching and
++httplib2's. httplib2 would store all requests for the service root in
++a filename based solely on the URL. This effectively limits httplib2
++to a single representation of a given resource: the WADL
++representation would be overwritten with the JSON
++representation. lazr.restfulclient incorporates the media type in the
++cache filename, so that WADL and JSON representations are stored
++separately.
++
++The second difference has to do with filename length limits. httplib2
++caps filenames at about 240 characters so that cache files can be
++stored on filesystems with 255-character filename length limits. For
++compatibility with eCryptfs filesystems, lazr.restfulclient goes
++further, and caps filenames at 150 characters.
++
++To test out the limit, let's create a cookbook with an incredibly
++long name.
++
++    >>> long_name = (
++    ...     "This cookbook name is amazingly long; so long that it will "
++    ...     "surely be truncated when it is incorporated into a file "
++    ...     "name for the cache. The cache file will contain a cached "
++    ...     "HTTP respone containing a JSON representation of of this "
++    ...     "cookbook, whose name, I repeat, is very long indeed.")
++    >>> len(long_name)
++    281
++
++    >>> import datetime
++    >>> date = datetime.datetime(1994, 1, 1)
++    >>> book = service.cookbooks.create(
++    ...     name=long_name, cuisine="General", copyright_date=date,
++    ...     price=10.22, last_printing=date)
++
++lazr.restfulclient automatically fetched a JSON representation of the
++new cookbook, so it's already present in the cache. Because a
++cookbook's URL incorporates its name, and this cookbook's name is
++incredibly long, it must have been truncated to fit on disk.
++
++    >>> [cookbook_cache_filename] = [file for file in listdir(tempdir)
++    ...                              if 'amazingly' in file]
++
++Indeed, the filename has been truncated to fit in the rough
++150-character safety limit for eCryptfs filesystems.
++
++    >>> len(cookbook_cache_filename)
++    150
++
++Despite the truncation, some of the useful information from the
++cookbook's name makes it into the filename, making it easy to find when
++manually crawling through the cache directory.
++
++    >>> print cookbook_cache_filename
++    cookbooks.dev...This%20cookbook%20name%20is%20amazingly%20long...
++
++To avoid conflicts caused by truncation, the filename always ends with
++an MD5 sum derived from the untruncated URL. Let's create a second
++cookbook whose name differs from the first cookbook only at the end.
++
++    >>> longer_name = long_name + ": The Sequel"
++    >>> book = service.cookbooks.create(
++    ...     name=longer_name, cuisine="General", copyright_date=date,
++    ...     price=10.22, last_printing=date)
++
++This cookbook's URL is identical to the first cookbook's URL for far
++longer than 150 characters. But since the truncated filename
++incorporates an MD5 sum based on the full URL, the two cookbooks are
++cached in separate files.
++
++    >>> [file1, file2] = [file for file in listdir(tempdir)
++    ...                   if 'amazingly' in file]
++
++The filenames are identical up to the last 32 characters, which is
++where the MD5 sum begins. But because the MD5 sums are different, they
++are not completely identical.
++
++    >>> file1[:-32] == file2[:-32]
++    True
++
++    >>> file1 == file2
++    False

lazr.restfulclient

Merge lp://staging/~leonardr/lazr.restfulclient/shorten-cache-filename into lp://staging/lazr.restfulclient

Commit message

Description of the change

Preview Diff

Subscribers