With regard to case--the test case is HTTP://www.google.com. Isn't caught with your current regex without specifying re.IGNORECASE. Otherwise I agree that the rest of the regex is not case sensitive.
WRT the brackets. The code following regex says:
10 for url in urls:
11 if (url.startswith('[') and url.endswith(']')) or load:
12 ignore.append(cleanurl(url))
So the url needs to include incorporate the [] in [http://www.google.com] so this test will work and the URL will be ignored when desired.
You asked about cleanurl:
def cleanurl(url):
if url.startswith('['):
url = url[1:]
if any([url.endswith(c) for c in '.;:]']):
url = url[:-1]
return url
Clean url, as I think you surmise, then takes the brackets off the url before appending it to the ignore list. cleanurl takes a url found by the regex and strips not only the []'s but also some other trailing punctuation, '.;:'
url = url[1:] is a slice operation that says make url equal to the old url but without the first character. url = url[:-1] says make the new url the old url minus the very last character. [url.endswith(c) for c in '.;:]' ] is a list comprehension that, for 'http://www.google.com;' returns [False, True, False, False] . Any ([False, True, False, False]) returns True because one of the elements is True (the ';' was found as the last character). I'm not sure I've seen the any function or the endswith string method used before, perhaps they looked odd to you. List comprehensions are cool shortcuts.
Let me know if something's still not clear.
There shouldn't be any problems with your other merges--unless I've missed one they should all be in that lernid-proposed branch already (it would be great if you'd double check me on that). I think we are down to this merge.
With regard to case--the test case is HTTP:// www.google. com. Isn't caught with your current regex without specifying re.IGNORECASE. Otherwise I agree that the rest of the regex is not case sensitive.
HTTP:// www.something. something is valid, though not the canonical way to put it. See http:// tools.ietf. org/html/ rfc3986# section- 3.1
WRT the brackets. The code following regex says: ('[') and url.endswith(']')) or load: append( cleanurl( url))
10 for url in urls:
11 if (url.startswith
12 ignore.
So the url needs to include incorporate the [] in [http:// www.google. com] so this test will work and the URL will be ignored when desired.
You asked about cleanurl:
def cleanurl(url): '['): endswith( c) for c in '.;:]']):
if url.startswith(
url = url[1:]
if any([url.
url = url[:-1]
return url
Clean url, as I think you surmise, then takes the brackets off the url before appending it to the ignore list. cleanurl takes a url found by the regex and strips not only the []'s but also some other trailing punctuation, '.;:'
url = url[1:] is a slice operation that says make url equal to the old url but without the first character. url = url[:-1] says make the new url the old url minus the very last character. [url.endswith(c) for c in '.;:]' ] is a list comprehension that, for 'http:// www.google. com;' returns [False, True, False, False] . Any ([False, True, False, False]) returns True because one of the elements is True (the ';' was found as the last character). I'm not sure I've seen the any function or the endswith string method used before, perhaps they looked odd to you. List comprehensions are cool shortcuts.
Let me know if something's still not clear.
There shouldn't be any problems with your other merges--unless I've missed one they should all be in that lernid-proposed branch already (it would be great if you'd double check me on that). I think we are down to this merge.