On Tue, Jan 26, 2021 at 12:25:16AM -0000, Bryce Harrington wrote:
> I suppose the main thing you're looking for feedback on is about using git cli vs pygit2 for retrieving arbitrary urls.  I see what you mean that the api as-is requires creating a local repository object first; all the examples online I see for this are taking that approach.  I suppose you could define a namespace to import the temporary repositories into, however I also don't see too much problem just calling the git cli directly in this case.

Yes, and thanks. But also, are you happy with my approach to input
validation, given Colin's comments? A couple of pitfalls, for example,
are ensuring that Vcs-Git-Refs: does not contain ':'s, since these could
be used to overwrite arbitrary local refs, and that Vcs-Git-Url: doesn't
use a scheme like ext::, which could be used to execute arbitrary code.

Though these would need Launchpad to give me a bad/malicious changes
file, and Launchpad is already limiting what it accepts to changes files
that have been signed by uploaders. Still, it seemed appropriate to
check these for ourselves, rather than relying on uploaders, since the
importer shouldn't be able to be subverted by arbitrary uploaders
either.

> Taking a step back, git-ubuntu seems to end up needing to use this type of workaround often with pygit2, which makes me wonder if it would be worth the effort to introduce a wrapper class with a convenience API that hides the details of whether pygit2 or the git cli is used?  Seems like it would be cleaner, although not sure it'd be worth the effort.  I vaguely recall we discussed something like this before...  Anyway just a thought.

This does exist - the GitUbuntuRepository class. However I'm reluctant
to add more to that class when there's just one case where a particular
feature is needed. That class provides the git_run method, which we are
using - so it's not any harder to audit "exceptional" git calls this
way.

> Beyond that, a few small notes fwiw:
> 
> 84e0a21bc1ad00497893a9e89b58484d48daef6a
>   - Good refactoring to move warnings up a level
>   - test_validate_upload_tag() - In codedocs the type of expected_result is variable since it could be either a bool or an exception class.  (Might be clearer to have expected_exception as a separate parameter for clarity, rather than overloading expected_result?)

I see what you're saying. I'm on the fence. I'm not sure it'd make
things clearer, since we'd have to provide a None (or equivalent) for
when an exception is raised, which will then make it look like a return
value is expected. I think this is an either-way thing so I'd like to
leave it as it is.

> 9ae83e0d9a6e80e2160531b6023b48fb82c04fc6
>   - I wonder if validate_rich_history()'s code description would benefit from a bit of elaboration now that it's got an extra level of indirection

I think the spec explains the requirement? I'm reluctant to have it
explained in two places - I'd rather that one refers to the other.

>   - "to that upload tag" -> "to the upload tag"

Will fix - thanks!

> ae13bed802f8c4626c2c41ce8eb910c50253b03d
>   - LGTM
> 
> db9c906fb0cc4cbb4ea9706999dfff304d5f7970
>   - If you plan to add pydocs for the parse_* routines, it could handy to have an example of the line(s) the routine will parse as valid.  Or even just having that present in a comment would be useful.

Agreed. I'm expecting to add parameterised tests for known good and bad
cases.

>   - Presumably this relies on code changes to dput and/or launchpad, so links to appropriate references (bugs or commits) in the changelog message or comments might be useful for us to check on status of those external development tasks later.

As it happens, we don't need to change Launchpad or dput at all. We just
need to provide a mechanism for users to easily insert the correct
headers into their changes file, which would mean extra arguments to
dpkg-genchanges (or dpkg-buildpackage or debuild, which will pass those
arguments through to dpkg-genchanges). I don't know exactly what form
this will take yet, but I can file a bug against usd-importer to track
that we need a UI.

>   - Is the output of get_changes_file_url() reliable/dependable enough that we don't need to validate url?  (I'm imagining validating would be checking it's a valid url and that the host is launchpad.net, etc.)
>   - Should urllib.request.urlopen try and catch 404's, et al.?  Assuming the url is always a launchpad url, most cases of fetch errors will likely be launchpad service outages or similar.

I'll answer the above two questions together. In general, the importer
is assuming that everything from Launchpad is valid, and crashes
otherwise. There's currently no distinction between soft and hard fails,
which is perhaps something that might be worth adding across the
codebase one day. Apart from that I think trusting Launchpad to avoid
crashing is OK. This approach has been working well in that we detect
import failures following crashes and address them. For soft failures we
have simply been retrying the failures manually first, which could do
with some automation but seems to be OK in principle.

The exception for input validation for the changes file is because we
don't want user input that is valid to Launchpad to be able to subvert
our behaviour or our unrelated-to-that-upload output. But I think it's
OK if Launchpad itself can subvert us, since it can do that anyway.

There are many things that fetch one input from Launchpad that refers to
another, and if the first is bad, then we'll crash. I think this is
another one of those cases, and so doesn't need special handling.
Another case, for example, is in where we get the URLs to download the
sources for a particular publication from.

So if Launchpad gives us a URL that it then can't serve, then we'll
crash and fail the import, and we'll have a Launchpad bug to
investigate.

>   - Alternate way to check for keys:
> -    has_required_fields = all(
> -        field in changes
> -        for field in ['Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs']
> -    )
> -    if not has_required_fields:
> -        return None
> +    if not changes.keys() >= ['Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs']:
> +        return None

Nice! I'll adopt that - thanks. Maybe the second operand should be
explicitly a set:

    if not changes.keys() >= {'Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs'}:

>   - Test cases are a bit hard to grok

Any suggestions please? The most I managed to do was factor out
populate_rich_history() and temporary_changes_file_url().