Merge ~rbalint/autopkgtest-cloud:more-retries into autopkgtest-cloud:master

Proposed by Balint Reczey
Status: Merged
Merge reported by: Balint Reczey
Merged at revision: bc4d6865bf81ffd8f5958b5dabc503fef9289f94
Proposed branch: ~rbalint/autopkgtest-cloud:more-retries
Merge into: autopkgtest-cloud:master
Diff against target: 156 lines (+65/-53)
1 file modified
worker/worker (+65/-53)
Reviewer Review Type Date Requested Status
Iain Lane (community) Approve
Review via email: mp+384609@code.staging.launchpad.net
To post a comment you must log in.
Revision history for this message
Iain Lane (laney) wrote :

Thanks, but sorry, this isn't going to work as-is.

To understand why, find the code where FAIL_STRINGS_REGEX is used. It's exclusively where a run has *temporarily* failed (exitcode == 16). What we had here was a permanent failure (code 4). We need to check that the exitcode is in (2, 4, 6, 8) and then grep for a different set of strings - obviously factoring out the existing logic in the most sensible way to apply to both cases.

Also the 'Temporary failure' should not be restricted to a package - we can get this happening in runs of any test. Those other ones are restricted to systemd* and linux* because they are packages which genuinely can - and do - break booting of the instances completely when they have a bug. autopkgtest isn't great at catching that so we do it outside.

OOI, do you have any ideas why the clock is moving backwards? That feels like something we should fix properly rather than retrying on, if we can understand why it happens and why provisioning hasn't given us the right time (or if our idea of time got out of step with reality...).

review: Needs Fixing
Revision history for this message
Balint Reczey (rbalint) wrote :

@laney, thanks, I've updated the name resolution related retry and dropped the clock skew one because it is not very frequent.

My suspect is systemd-timesyncd, I've left a comment in LP: #1880839.

Revision history for this message
Iain Lane (laney) wrote :

Still needs fixing to run this code in the case of exit code 2, 4, 6, 8 - second paragraph above.

See

https://git.launchpad.net/autopkgtest-cloud/tree/worker/worker#n534

this is the block that needs updating.

Revision history for this message
Balint Reczey (rbalint) wrote :

Thanks, indeed.
I think retry should not run for 2 and 8, and I'm not sure about 12, 14 and 20, so I've added 4 and 6 as safe bets.

Revision history for this message
Iain Lane (laney) wrote :

Sorry Balint, this still needs fixing. :(

I'm not being clear, let me try some more:

There are two types of failure.

1. Permanent failures, which cause a test to be marked as a fail in the database and frontends (e.g. proposed-migration or the autopkgtest.ubuntu.com website) display this as a failure.
2. Temporary failures, where autopkgtest thinks that *it* caused the failure, or it's otherwise transient. These are reported by exit code 16 and autopkgtest-cloud queues these to be re-run.
  2.5. Sometimes permanent failures are misdetected as temporary ones, so we have this code to convert the two. In that case we override the code 16 to a code 4 ("at least one test failed").

What you want to introduce is a 1.5 that's the kind of opposite of 2.5. We want to convert some kinds of permanent failure into temporary ones, so that they get retried. We do *not* want to generally start retrying all permanent failures, which is what the MP currently would do.

I think that *above* the "if code == 16 ..." line, you should add a check for "if code in (2, 4, 6, 8):", which:

  - Greps the log for one of the *new* (different variable) strings that we want to retry on
  - If found, return as if it were a temporary failure, so that we retry. Do this up to three times.

Of course that code will be *common* with 2.5, so you can probably move some of that into functions and call it in both places.

review: Needs Fixing
Revision history for this message
Balint Reczey (rbalint) wrote :

@laney Could you please take a look again?

Revision history for this message
Iain Lane (laney) wrote :

Alright, I think this looks good, let's try it. Thanks for sticking with it.

review: Approve
Revision history for this message
Balint Reczey (rbalint) wrote :

@laney Could you please merge it? I can't.

Revision history for this message
Balint Reczey (rbalint) wrote :

It is actually merged, sorry.

There was an error fetching revisions from git servers. Please try again in a few minutes. If the problem persists, contact Launchpad support.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches