I think I've addressed all outstanding feedback so far. Some specifics:
On Tue, Jan 26, 2021 at 10:27:16PM -0000, Bryce Harrington wrote:
> For the URL regex, I ran it over Vcs-Git lines in the control files of
> packages I have checked out locally:
>
> for file in $(find . -name "control"); do grep Vcs-Git ${file} | xargs ~/check-re.py; done
>
> Here's a few examples of types of things that didn't match:
>
> https://git.launchpad.net/~ubuntu-server-dev/ubuntu/+source/rabbitmq-server
> git://anonscm.debian.org/openstack/python-seamicroclient.git
> <email address hidden>:sosreport-team/sosreport.git
> https://salsa.debian.org/vdr-team/#PACKAGE#.git
> https://salsa.debian.org/ruby-team/<%=
>
> The last two look more like errors than something that should be
> supported. + and ^git might be worth considering.
I'm only accepting https:// URLs for now. I've incorporated all the
above patterns into existing tests, including the ones that should be
rejected for now.
> Are you aware of Vcs-Git strings with ? in them? That char may be
> pretty rare for git urls.
Good shout. Added into the regexp and added a test.
> I suppose the more fundamental question than how git-ubuntu handles
> these cases, is how they'll be handled by dput->launchpad. Presumably
> they'll be filtering the links to an allowed subset? Or
> quoting/escaping invalid chars? If we can assume LP is filtering the
> weird stuff, that of course simplifies what needs to be tested for
> here.
Actually dput/Launchpad won't filter anything. Whatever the uploader
puts in, we'll get out at the other end. I care from the importer's end
that we aren't subverted by malicious input, and that we are able to
fetch the URL if we decide it's acceptable (barring network issues).
> I notice you're including * as an allowed character for git refs; is
> that intended, like for supporting wildcarding in refs?
It was intended, but I've now removed that following Colin's feedback,
and changed the header name from Vcs-Git-Refs to Vcs-Git-Ref. I've
adjusted tests accordingly.
> Could you do a sampling of refs from the imported packages to see what
> range of characters this'll likely be seeing in practice? (Might also
> be useful to run the dump through to test it.) Old projects and
> projects that have been migrated from one VCS to another sometimes have
> weird bits in their ancient history.
I'm not expecting to get refs like from existing Vcs-Git repositories.
These aren't normally based on git-ubuntu branches, and probably will
continue not to be. Instead I expect people to branch from git-ubuntu
branches, make their changes, push to a new ref somewhere in Launchpad,
and then provide those in Vcs-Git/Vcs-Git-Ref/Vcs-Git-Commit in their
changes file. This might well mean that Vcs-Git in changes files for
git-ubuntu will typically not match the Vcs-Git in debian/control.
My goal in testing my validation in tests for parsing/validating these
fields is therefore slightly different. What URLs and refs will
git-ubuntu users push to, _on Launchpad_ (as this will be an initial
requirement), and will we correctly validate all those?
My thinking for a belt and braces approach here is that I will also
provide tooling to write the headers into the changes file in a separate
branch (to follow). I intend to validate what I'm writing by calling the
same functions. This way, if it's going to fail, the user will be told
before upload and have an opportunity to seek help / adjust URLs and
proposed refs, rather than the rich history silently failing to make it
into git-ubuntu later from the user's perspective.
> Minor cosmetic, but in two of the regexes you list letters then numbers,
> and in the third the reverse. This would be more consistent:
>
> +VCS_GIT_COMMIT_VALIDATION = re.compile(r'[A-Za-z0-9]{40}')
Fixed, thanks.
> Technically I suppose you really only need [a-f0-9]{40} for SHAs?
Yes, but I figured that capitals were unambiguous and won't cause a
problem, so I added them to ensure that there isn't something somewhere
that generates capitals that would then fail.
> I don't think your patch is validating LP user or project names, but saw
> it came up in the IRC discussion. Colin's advice matches my own
> experiences. I had tried implementing some validators on bug reports
> way back when but Launchpad's acceptance algorithms are complicated with
> lots of special cases. If you go to the page to create a project, ppa,
> or team, there is text explaining allowed characters. E.g. "A short
> unique name, beginning with a lower-case letter or number, and
> containing only letters, numbers, dots, hyphens, or plus signs." Where
> I ran into trouble is that while these rules may hold for new names,
> there may be old ones or ones from imports that are grandfathered in or
> whatever, and all matter of weirdness when unicode gets involved. So I
> get his point about relying more on quoting such strings for your tool's
> internal use and letting Launchpad decide validity. But for git refs,
> shas, and urls it's a different situation.
Agreed. My question about usernames and project names was really about
what might end up in a Launchpad git URL, to make sure that my URL
validation wasn't too narrow. Ideally I'd not rely on that at all, but
I'm reluctant to trust that until I'm no longer calling git's CLI. For
that I need a pygit2 API, but we don't have that yet
(https://github.com/libgit2/pygit2/issues/1060).
> > I think the spec explains the requirement? I'm reluctant to have it
> > explained in two places - I'd rather that one refers to the other.
>
> I dunno, it strikes me as similar to the argument against having code
> docs because "the docs will get out of date when the code changes,
> better that people just read the code." But I was not suggesting to
> copy the spec, just a brief explanation for those of us too lazy to
> check references.
OK, adjusted.
>
> > > - "to that upload tag" -> "to the upload tag"
> >
> > Will fix - thanks!
Fixed.
> > > db9c906fb0cc4cbb4ea9706999dfff304d5f7970
> > > - If you plan to add pydocs for the parse_* routines, it could handy to have an example of the line(s) the routine will parse as valid. Or even just having that present in a comment would be useful.
> >
> > Agreed. I'm expecting to add parameterised tests for known good and bad
> > cases.
Examples added to the docstring, and tests added.
> > > - Presumably this relies on code changes to dput and/or launchpad, so links to appropriate references (bugs or commits) in the changelog message or comments might be useful for us to check on status of those external development tasks later.
> >
> > As it happens, we don't need to change Launchpad or dput at all. We just
> > need to provide a mechanism for users to easily insert the correct
> > headers into their changes file, which would mean extra arguments to
> > dpkg-genchanges (or dpkg-buildpackage or debuild, which will pass those
> > arguments through to dpkg-genchanges). I don't know exactly what form
> > this will take yet, but I can file a bug against usd-importer to track
> > that we need a UI.
>
> Do you expect any pushback on the feature from the dpkg-genchanges
> maintainers? If there's a chance of that, might be worth some initial
> discussion with them.
It won't need any dpkg-genchanges changes. It'll just need us to call
dpkg-genchanges (actually dpkg-buildpackage, which passes through) with
some extra parameters.
I think I've addressed all outstanding feedback so far. Some specifics:
On Tue, Jan 26, 2021 at 10:27:16PM -0000, Bryce Harrington wrote: /git.launchpad. net/~ubuntu- server- dev/ubuntu/ +source/ rabbitmq- server debian. org/openstack/ python- seamicroclient. git :sosreport- team/sosreport. git /salsa. debian. org/vdr- team/#PACKAGE#.git /salsa. debian. org/ruby- team/<%=
> For the URL regex, I ran it over Vcs-Git lines in the control files of
> packages I have checked out locally:
>
> for file in $(find . -name "control"); do grep Vcs-Git ${file} | xargs ~/check-re.py; done
>
> Here's a few examples of types of things that didn't match:
>
> https:/
> git://anonscm.
> <email address hidden>
> https:/
> https:/
>
> The last two look more like errors than something that should be
> supported. + and ^git might be worth considering.
I'm only accepting https:// URLs for now. I've incorporated all the
above patterns into existing tests, including the ones that should be
rejected for now.
> Are you aware of Vcs-Git strings with ? in them? That char may be
> pretty rare for git urls.
Good shout. Added into the regexp and added a test.
> I suppose the more fundamental question than how git-ubuntu handles
> these cases, is how they'll be handled by dput->launchpad. Presumably
> they'll be filtering the links to an allowed subset? Or
> quoting/escaping invalid chars? If we can assume LP is filtering the
> weird stuff, that of course simplifies what needs to be tested for
> here.
Actually dput/Launchpad won't filter anything. Whatever the uploader
puts in, we'll get out at the other end. I care from the importer's end
that we aren't subverted by malicious input, and that we are able to
fetch the URL if we decide it's acceptable (barring network issues).
> I notice you're including * as an allowed character for git refs; is
> that intended, like for supporting wildcarding in refs?
It was intended, but I've now removed that following Colin's feedback,
and changed the header name from Vcs-Git-Refs to Vcs-Git-Ref. I've
adjusted tests accordingly.
> Could you do a sampling of refs from the imported packages to see what
> range of characters this'll likely be seeing in practice? (Might also
> be useful to run the dump through to test it.) Old projects and
> projects that have been migrated from one VCS to another sometimes have
> weird bits in their ancient history.
I'm not expecting to get refs like from existing Vcs-Git repositories. Vcs-Git- Ref/Vcs- Git-Commit in their
These aren't normally based on git-ubuntu branches, and probably will
continue not to be. Instead I expect people to branch from git-ubuntu
branches, make their changes, push to a new ref somewhere in Launchpad,
and then provide those in Vcs-Git/
changes file. This might well mean that Vcs-Git in changes files for
git-ubuntu will typically not match the Vcs-Git in debian/control.
My goal in testing my validation in tests for parsing/validating these
fields is therefore slightly different. What URLs and refs will
git-ubuntu users push to, _on Launchpad_ (as this will be an initial
requirement), and will we correctly validate all those?
My thinking for a belt and braces approach here is that I will also
provide tooling to write the headers into the changes file in a separate
branch (to follow). I intend to validate what I'm writing by calling the
same functions. This way, if it's going to fail, the user will be told
before upload and have an opportunity to seek help / adjust URLs and
proposed refs, rather than the rich history silently failing to make it
into git-ubuntu later from the user's perspective.
> Minor cosmetic, but in two of the regexes you list letters then numbers, COMMIT_ VALIDATION = re.compile( r'[A-Za- z0-9]{40} ')
> and in the third the reverse. This would be more consistent:
>
> +VCS_GIT_
Fixed, thanks.
> Technically I suppose you really only need [a-f0-9]{40} for SHAs?
Yes, but I figured that capitals were unambiguous and won't cause a
problem, so I added them to ensure that there isn't something somewhere
that generates capitals that would then fail.
> I don't think your patch is validating LP user or project names, but saw
> it came up in the IRC discussion. Colin's advice matches my own
> experiences. I had tried implementing some validators on bug reports
> way back when but Launchpad's acceptance algorithms are complicated with
> lots of special cases. If you go to the page to create a project, ppa,
> or team, there is text explaining allowed characters. E.g. "A short
> unique name, beginning with a lower-case letter or number, and
> containing only letters, numbers, dots, hyphens, or plus signs." Where
> I ran into trouble is that while these rules may hold for new names,
> there may be old ones or ones from imports that are grandfathered in or
> whatever, and all matter of weirdness when unicode gets involved. So I
> get his point about relying more on quoting such strings for your tool's
> internal use and letting Launchpad decide validity. But for git refs,
> shas, and urls it's a different situation.
Agreed. My question about usernames and project names was really about /github. com/libgit2/ pygit2/ issues/ 1060).
what might end up in a Launchpad git URL, to make sure that my URL
validation wasn't too narrow. Ideally I'd not rely on that at all, but
I'm reluctant to trust that until I'm no longer calling git's CLI. For
that I need a pygit2 API, but we don't have that yet
(https:/
> > I think the spec explains the requirement? I'm reluctant to have it
> > explained in two places - I'd rather that one refers to the other.
>
> I dunno, it strikes me as similar to the argument against having code
> docs because "the docs will get out of date when the code changes,
> better that people just read the code." But I was not suggesting to
> copy the spec, just a brief explanation for those of us too lazy to
> check references.
OK, adjusted.
>
> > > - "to that upload tag" -> "to the upload tag"
> >
> > Will fix - thanks!
Fixed.
> > > db9c906fb0cc4cb b4ea9706999dfff 304d5f7970
> > > - If you plan to add pydocs for the parse_* routines, it could handy to have an example of the line(s) the routine will parse as valid. Or even just having that present in a comment would be useful.
> >
> > Agreed. I'm expecting to add parameterised tests for known good and bad
> > cases.
Examples added to the docstring, and tests added.
> > > - Presumably this relies on code changes to dput and/or launchpad, so links to appropriate references (bugs or commits) in the changelog message or comments might be useful for us to check on status of those external development tasks later.
> >
> > As it happens, we don't need to change Launchpad or dput at all. We just
> > need to provide a mechanism for users to easily insert the correct
> > headers into their changes file, which would mean extra arguments to
> > dpkg-genchanges (or dpkg-buildpackage or debuild, which will pass those
> > arguments through to dpkg-genchanges). I don't know exactly what form
> > this will take yet, but I can file a bug against usd-importer to track
> > that we need a UI.
>
> Do you expect any pushback on the feature from the dpkg-genchanges
> maintainers? If there's a chance of that, might be worth some initial
> discussion with them.
It won't need any dpkg-genchanges changes. It'll just need us to call
dpkg-genchanges (actually dpkg-buildpackage, which passes through) with
some extra parameters.
Robie