On Tue, Jan 26, 2021 at 12:25:16AM -0000, Bryce Harrington wrote: > I suppose the main thing you're looking for feedback on is about using git cli vs pygit2 for retrieving arbitrary urls. I see what you mean that the api as-is requires creating a local repository object first; all the examples online I see for this are taking that approach. I suppose you could define a namespace to import the temporary repositories into, however I also don't see too much problem just calling the git cli directly in this case. Yes, and thanks. But also, are you happy with my approach to input validation, given Colin's comments? A couple of pitfalls, for example, are ensuring that Vcs-Git-Refs: does not contain ':'s, since these could be used to overwrite arbitrary local refs, and that Vcs-Git-Url: doesn't use a scheme like ext::, which could be used to execute arbitrary code. Though these would need Launchpad to give me a bad/malicious changes file, and Launchpad is already limiting what it accepts to changes files that have been signed by uploaders. Still, it seemed appropriate to check these for ourselves, rather than relying on uploaders, since the importer shouldn't be able to be subverted by arbitrary uploaders either. > Taking a step back, git-ubuntu seems to end up needing to use this type of workaround often with pygit2, which makes me wonder if it would be worth the effort to introduce a wrapper class with a convenience API that hides the details of whether pygit2 or the git cli is used? Seems like it would be cleaner, although not sure it'd be worth the effort. I vaguely recall we discussed something like this before... Anyway just a thought. This does exist - the GitUbuntuRepository class. However I'm reluctant to add more to that class when there's just one case where a particular feature is needed. That class provides the git_run method, which we are using - so it's not any harder to audit "exceptional" git calls this way. > Beyond that, a few small notes fwiw: > > 84e0a21bc1ad00497893a9e89b58484d48daef6a > - Good refactoring to move warnings up a level > - test_validate_upload_tag() - In codedocs the type of expected_result is variable since it could be either a bool or an exception class. (Might be clearer to have expected_exception as a separate parameter for clarity, rather than overloading expected_result?) I see what you're saying. I'm on the fence. I'm not sure it'd make things clearer, since we'd have to provide a None (or equivalent) for when an exception is raised, which will then make it look like a return value is expected. I think this is an either-way thing so I'd like to leave it as it is. > 9ae83e0d9a6e80e2160531b6023b48fb82c04fc6 > - I wonder if validate_rich_history()'s code description would benefit from a bit of elaboration now that it's got an extra level of indirection I think the spec explains the requirement? I'm reluctant to have it explained in two places - I'd rather that one refers to the other. > - "to that upload tag" -> "to the upload tag" Will fix - thanks! > ae13bed802f8c4626c2c41ce8eb910c50253b03d > - LGTM > > db9c906fb0cc4cbb4ea9706999dfff304d5f7970 > - If you plan to add pydocs for the parse_* routines, it could handy to have an example of the line(s) the routine will parse as valid. Or even just having that present in a comment would be useful. Agreed. I'm expecting to add parameterised tests for known good and bad cases. > - Presumably this relies on code changes to dput and/or launchpad, so links to appropriate references (bugs or commits) in the changelog message or comments might be useful for us to check on status of those external development tasks later. As it happens, we don't need to change Launchpad or dput at all. We just need to provide a mechanism for users to easily insert the correct headers into their changes file, which would mean extra arguments to dpkg-genchanges (or dpkg-buildpackage or debuild, which will pass those arguments through to dpkg-genchanges). I don't know exactly what form this will take yet, but I can file a bug against usd-importer to track that we need a UI. > - Is the output of get_changes_file_url() reliable/dependable enough that we don't need to validate url? (I'm imagining validating would be checking it's a valid url and that the host is launchpad.net, etc.) > - Should urllib.request.urlopen try and catch 404's, et al.? Assuming the url is always a launchpad url, most cases of fetch errors will likely be launchpad service outages or similar. I'll answer the above two questions together. In general, the importer is assuming that everything from Launchpad is valid, and crashes otherwise. There's currently no distinction between soft and hard fails, which is perhaps something that might be worth adding across the codebase one day. Apart from that I think trusting Launchpad to avoid crashing is OK. This approach has been working well in that we detect import failures following crashes and address them. For soft failures we have simply been retrying the failures manually first, which could do with some automation but seems to be OK in principle. The exception for input validation for the changes file is because we don't want user input that is valid to Launchpad to be able to subvert our behaviour or our unrelated-to-that-upload output. But I think it's OK if Launchpad itself can subvert us, since it can do that anyway. There are many things that fetch one input from Launchpad that refers to another, and if the first is bad, then we'll crash. I think this is another one of those cases, and so doesn't need special handling. Another case, for example, is in where we get the URLs to download the sources for a particular publication from. So if Launchpad gives us a URL that it then can't serve, then we'll crash and fail the import, and we'll have a Launchpad bug to investigate. > - Alternate way to check for keys: > - has_required_fields = all( > - field in changes > - for field in ['Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs'] > - ) > - if not has_required_fields: > - return None > + if not changes.keys() >= ['Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs']: > + return None Nice! I'll adopt that - thanks. Maybe the second operand should be explicitly a set: if not changes.keys() >= {'Vcs-Git', 'Vcs-Git-Commit', 'Vcs-Git-Refs'}: > - Test cases are a bit hard to grok Any suggestions please? The most I managed to do was factor out populate_rich_history() and temporary_changes_file_url().