Merge ~racb/git-ubuntu:rich-history-from-changes-file into git-ubuntu:master

Proposed by Robie Basak
Status: Merged
Merged at revision: be7fa10f2ef0d8d7e8437c61593400d6c0a4a37b
Proposed branch: ~racb/git-ubuntu:rich-history-from-changes-file
Merge into: git-ubuntu:master
Diff against target: 1138 lines (+858/-56)
6 files modified
debian/control (+1/-0)
gitubuntu/importer.py (+369/-34)
gitubuntu/importer_test.py (+472/-22)
gitubuntu/source_information.py (+13/-0)
setup.py (+1/-0)
snap/snapcraft.yaml (+2/-0)
Reviewer Review Type Date Requested Status
Bryce Harrington Approve
Server Team CI bot continuous-integration Approve
git-ubuntu developers Pending
Review via email: mp+396756@code.staging.launchpad.net

Commit message

Make Jenkins happy

To post a comment you must log in.
Revision history for this message
Robie Basak (racb) wrote :

Request for some input from Bryce please. I suggest you start from the fetch_rich_history_from_changes_file() docstring which I think should be good enough to give you an intro to what I'm up to. See also: https://irclogs.ubuntu.com/2021/01/22/%23launchpad.html

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:fd6bc2c91e956dcc5b5c759518e44e7bbabd5057

No commit message was specified in the merge proposal. Click on the following link and set the commit message (if you want jenkins to rebuild you need to trigger it yourself):
https://code.launchpad.net/~racb/usd-importer/+git/usd-importer/+merge/396756/+edit-commit-message

https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/4/
Executed test runs:
    SUCCESS: VM Setup
    SUCCESS: Build
    SUCCESS: Unit Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/4//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:db9c906fb0cc4cbb4ea9706999dfff304d5f7970

No commit message was specified in the merge proposal. Click on the following link and set the commit message (if you want jenkins to rebuild you need to trigger it yourself):
https://code.launchpad.net/~racb/usd-importer/+git/usd-importer/+merge/396756/+edit-commit-message

https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/5/
Executed test runs:
    SUCCESS: VM Setup
    SUCCESS: Build
    SUCCESS: Unit Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/5//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Bryce Harrington (bryce) wrote :

Overall from first pass review looks pretty solid. I really like the approach of relying on the .changes file for carrying the Vcs-* stuff. Not having to do upload tagging will be a welcome usability improvement.

I suppose the main thing you're looking for feedback on is about using git cli vs pygit2 for retrieving arbitrary urls. I see what you mean that the api as-is requires creating a local repository object first; all the examples online I see for this are taking that approach. I suppose you could define a namespace to import the temporary repositories into, however I also don't see too much problem just calling the git cli directly in this case.

Taking a step back, git-ubuntu seems to end up needing to use this type of workaround often with pygit2, which makes me wonder if it would be worth the effort to introduce a wrapper class with a convenience API that hides the details of whether pygit2 or the git cli is used? Seems like it would be cleaner, although not sure it'd be worth the effort. I vaguely recall we discussed something like this before... Anyway just a thought.

Beyond that, a few small notes fwiw:

84e0a21bc1ad00497893a9e89b58484d48daef6a
  - Good refactoring to move warnings up a level
  - test_validate_upload_tag() - In codedocs the type of expected_result is variable since it could be either a bool or an exception class. (Might be clearer to have expected_exception as a separate parameter for clarity, rather than overloading expected_result?)

9ae83e0d9a6e80e2160531b6023b48fb82c04fc6
  - I wonder if validate_rich_history()'s code description would benefit from a bit of elaboration now that it's got an extra level of indirection
  - "to that upload tag" -> "to the upload tag"

ae13bed802f8c4626c2c41ce8eb910c50253b03d
  - LGTM

db9c906fb0cc4cbb4ea9706999dfff304d5f7970
  - If you plan to add pydocs for the parse_* routines, it could handy to have an example of the line(s) the routine will parse as valid. Or even just having that present in a comment would be useful.
  - Presumably this relies on code changes to dput and/or launchpad, so links to appropriate references (bugs or commits) in the changelog message or comments might be useful for us to check on status of those external development tasks later.
  - Is the output of get_changes_file_url() reliable/dependable enough that we don't need to validate url? (I'm imagining validating would be checking it's a valid url and that the host is launchpad.net, etc.)
  - Should urllib.request.urlopen try and catch 404's, et al.? Assuming the url is always a launchpad url, most cases of fetch errors will likely be launchpad service outages or similar.
  - Alternate way to check for keys:
- has_required_fields = all(
- field in changes
- for field in ['Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs']
- )
- if not has_required_fields:
- return None
+ if not changes.keys() >= ['Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs']:
+ return None
  - Test cases are a bit hard to grok

review: Needs Information
Revision history for this message
Robie Basak (racb) wrote :
Download full text (6.4 KiB)

On Tue, Jan 26, 2021 at 12:25:16AM -0000, Bryce Harrington wrote:
> I suppose the main thing you're looking for feedback on is about using git cli vs pygit2 for retrieving arbitrary urls. I see what you mean that the api as-is requires creating a local repository object first; all the examples online I see for this are taking that approach. I suppose you could define a namespace to import the temporary repositories into, however I also don't see too much problem just calling the git cli directly in this case.

Yes, and thanks. But also, are you happy with my approach to input
validation, given Colin's comments? A couple of pitfalls, for example,
are ensuring that Vcs-Git-Refs: does not contain ':'s, since these could
be used to overwrite arbitrary local refs, and that Vcs-Git-Url: doesn't
use a scheme like ext::, which could be used to execute arbitrary code.

Though these would need Launchpad to give me a bad/malicious changes
file, and Launchpad is already limiting what it accepts to changes files
that have been signed by uploaders. Still, it seemed appropriate to
check these for ourselves, rather than relying on uploaders, since the
importer shouldn't be able to be subverted by arbitrary uploaders
either.

> Taking a step back, git-ubuntu seems to end up needing to use this type of workaround often with pygit2, which makes me wonder if it would be worth the effort to introduce a wrapper class with a convenience API that hides the details of whether pygit2 or the git cli is used? Seems like it would be cleaner, although not sure it'd be worth the effort. I vaguely recall we discussed something like this before... Anyway just a thought.

This does exist - the GitUbuntuRepository class. However I'm reluctant
to add more to that class when there's just one case where a particular
feature is needed. That class provides the git_run method, which we are
using - so it's not any harder to audit "exceptional" git calls this
way.

> Beyond that, a few small notes fwiw:
>
> 84e0a21bc1ad00497893a9e89b58484d48daef6a
> - Good refactoring to move warnings up a level
> - test_validate_upload_tag() - In codedocs the type of expected_result is variable since it could be either a bool or an exception class. (Might be clearer to have expected_exception as a separate parameter for clarity, rather than overloading expected_result?)

I see what you're saying. I'm on the fence. I'm not sure it'd make
things clearer, since we'd have to provide a None (or equivalent) for
when an exception is raised, which will then make it look like a return
value is expected. I think this is an either-way thing so I'd like to
leave it as it is.

> 9ae83e0d9a6e80e2160531b6023b48fb82c04fc6
> - I wonder if validate_rich_history()'s code description would benefit from a bit of elaboration now that it's got an extra level of indirection

I think the spec explains the requirement? I'm reluctant to have it
explained in two places - I'd rather that one refers to the other.

> - "to that upload tag" -> "to the upload tag"

Will fix - thanks!

> ae13bed802f8c4626c2c41ce8eb910c50253b03d
> - LGTM
>
> db9c906fb0cc4cbb4ea9706999dfff304d5f7970
> - If you plan to add pydocs for...

Read more...

Revision history for this message
Bryce Harrington (bryce) wrote :
Download full text (10.3 KiB)

On Tue, Jan 26, 2021 at 03:18:22PM -0000, Robie Basak wrote:
> On Tue, Jan 26, 2021 at 12:25:16AM -0000, Bryce Harrington wrote:
> > I suppose the main thing you're looking for feedback on is about using git cli vs pygit2 for retrieving arbitrary urls. I see what you mean that the api as-is requires creating a local repository object first; all the examples online I see for this are taking that approach. I suppose you could define a namespace to import the temporary repositories into, however I also don't see too much problem just calling the git cli directly in this case.
>
> Yes, and thanks. But also, are you happy with my approach to input
> validation, given Colin's comments? A couple of pitfalls, for example,
> are ensuring that Vcs-Git-Refs: does not contain ':'s, since these could
> be used to overwrite arbitrary local refs, and that Vcs-Git-Url: doesn't
> use a scheme like ext::, which could be used to execute arbitrary code.

Yeah I like the approach you're taking of testing positively for allowed
characters. Then, the restrictions can be relaxed if/when new chars or
special cases are encountered.

For the URL regex, I ran it over Vcs-Git lines in the control files of
packages I have checked out locally:

  for file in $(find . -name "control"); do grep Vcs-Git ${file} | xargs ~/check-re.py; done

Here's a few examples of types of things that didn't match:

  https://git.launchpad.net/~ubuntu-server-dev/ubuntu/+source/rabbitmq-server
  git://anonscm.debian.org/openstack/python-seamicroclient.git
  <email address hidden>:sosreport-team/sosreport.git
  https://salsa.debian.org/vdr-team/#PACKAGE#.git
  https://salsa.debian.org/ruby-team/<%=

The last two look more like errors than something that should be
supported. + and ^git might be worth considering.

Are you aware of Vcs-Git strings with ? in them? That char may be
pretty rare for git urls.

I suppose the more fundamental question than how git-ubuntu handles
these cases, is how they'll be handled by dput->launchpad. Presumably
they'll be filtering the links to an allowed subset? Or
quoting/escaping invalid chars? If we can assume LP is filtering the
weird stuff, that of course simplifies what needs to be tested for
here.

I notice you're including * as an allowed character for git refs; is
that intended, like for supporting wildcarding in refs?

Could you do a sampling of refs from the imported packages to see what
range of characters this'll likely be seeing in practice? (Might also
be useful to run the dump through to test it.) Old projects and
projects that have been migrated from one VCS to another sometimes have
weird bits in their ancient history.

Minor cosmetic, but in two of the regexes you list letters then numbers,
and in the third the reverse. This would be more consistent:

+VCS_GIT_COMMIT_VALIDATION = re.compile(r'[A-Za-z0-9]{40}')

Technically I suppose you really only need [a-f0-9]{40} for SHAs?

I don't think your patch is validating LP user or project names, but saw
it came up in the IRC discussion. Colin's advice matches my own
experiences. I had tried implementing some validators on bug reports
way back when but Launchpad's acceptance algorithms are...

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:0a97e045ccea42a1980362453c79f6a7bf69ec55
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/6/
Executed test runs:
    SUCCESS: VM Setup
    FAILED: Build

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/6//rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:0a97e045ccea42a1980362453c79f6a7bf69ec55
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/7/
Executed test runs:
    SUCCESS: VM Setup
    SUCCESS: Build
    SUCCESS: Unit Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/7//rebuild

review: Approve (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:efad8227a098cd46cce3b59d853c3f8a6cb84c55
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/38/
Executed test runs:
    SUCCESS: VM Setup
    SUCCESS: Build
    SUCCESS: VM Reset
    SUCCESS: Unit Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/38//rebuild

review: Approve (continuous-integration)
Revision history for this message
Robie Basak (racb) wrote :
Download full text (7.3 KiB)

I think I've addressed all outstanding feedback so far. Some specifics:

On Tue, Jan 26, 2021 at 10:27:16PM -0000, Bryce Harrington wrote:
> For the URL regex, I ran it over Vcs-Git lines in the control files of
> packages I have checked out locally:
>
> for file in $(find . -name "control"); do grep Vcs-Git ${file} | xargs ~/check-re.py; done
>
> Here's a few examples of types of things that didn't match:
>
> https://git.launchpad.net/~ubuntu-server-dev/ubuntu/+source/rabbitmq-server
> git://anonscm.debian.org/openstack/python-seamicroclient.git
> <email address hidden>:sosreport-team/sosreport.git
> https://salsa.debian.org/vdr-team/#PACKAGE#.git
> https://salsa.debian.org/ruby-team/<%=
>
> The last two look more like errors than something that should be
> supported. + and ^git might be worth considering.

I'm only accepting https:// URLs for now. I've incorporated all the
above patterns into existing tests, including the ones that should be
rejected for now.

> Are you aware of Vcs-Git strings with ? in them? That char may be
> pretty rare for git urls.

Good shout. Added into the regexp and added a test.

> I suppose the more fundamental question than how git-ubuntu handles
> these cases, is how they'll be handled by dput->launchpad. Presumably
> they'll be filtering the links to an allowed subset? Or
> quoting/escaping invalid chars? If we can assume LP is filtering the
> weird stuff, that of course simplifies what needs to be tested for
> here.

Actually dput/Launchpad won't filter anything. Whatever the uploader
puts in, we'll get out at the other end. I care from the importer's end
that we aren't subverted by malicious input, and that we are able to
fetch the URL if we decide it's acceptable (barring network issues).

> I notice you're including * as an allowed character for git refs; is
> that intended, like for supporting wildcarding in refs?

It was intended, but I've now removed that following Colin's feedback,
and changed the header name from Vcs-Git-Refs to Vcs-Git-Ref. I've
adjusted tests accordingly.

> Could you do a sampling of refs from the imported packages to see what
> range of characters this'll likely be seeing in practice? (Might also
> be useful to run the dump through to test it.) Old projects and
> projects that have been migrated from one VCS to another sometimes have
> weird bits in their ancient history.

I'm not expecting to get refs like from existing Vcs-Git repositories.
These aren't normally based on git-ubuntu branches, and probably will
continue not to be. Instead I expect people to branch from git-ubuntu
branches, make their changes, push to a new ref somewhere in Launchpad,
and then provide those in Vcs-Git/Vcs-Git-Ref/Vcs-Git-Commit in their
changes file. This might well mean that Vcs-Git in changes files for
git-ubuntu will typically not match the Vcs-Git in debian/control.

My goal in testing my validation in tests for parsing/validating these
fields is therefore slightly different. What URLs and refs will
git-ubuntu users push to, _on Launchpad_ (as this will be an initial
requirement), and will we correctly validate all those?

My thinking for a belt and braces approach here i...

Read more...

Revision history for this message
Bryce Harrington (bryce) wrote :

Thanks, the updates look good and answers to all the questions sound fine.
I did one more code review pass and didn't spot anything to mention.

They critical piece for this patch is the filtration of potential malicious input. It looks like more than enough thought has gone into this versus the (probably low) level of risk we're likely to see in practice.

LGTM, +1.

review: Approve

There was an error fetching revisions from git servers. Please try again in a few minutes. If the problem persists, contact Launchpad support.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches