Merge ~till-kamppeter/network-manager:master into network-manager:master

Proposed by Till Kamppeter
Status: Merged
Merged at revision: d4ca8bd9d020f82b63afffa6a905a6901d43a7f7
Proposed branch: ~till-kamppeter/network-manager:master
Merge into: network-manager:master
Diff against target: 143 lines (+69/-12)
1 file modified
debian/tests/nm.py (+69/-12)
Reviewer Review Type Date Requested Status
Iain Lane (community) Approve
Review via email: mp+369586@code.staging.launchpad.net

Commit message

nm.py autopkgtest: Added timers to make the main loops time out if the asynchronous processes do not finish.

Description of the change

This change avoids that in case of failure of the asynchronous operations the script does not get stuck in the main loop making the autopkgtest application killing it after hours and without clue about what went wrong. If the asynchronous operations do not finish withing 5 minutes the main loops which wait for them are stopped by timer.
This does not influence the probability of a test passing or failing due to tasks completing too slowly. It only improves debuggability.

To post a comment you must log in.
Revision history for this message
Iain Lane (laney) wrote :

Thanks for working on this.

I've seen this kind of timeout go wrong when a machine is slow but the test is legitimately proceeding. Then you get a lot of "bump timeout"-style changes and the timeout gets in the way.

Do you know that autopkgtest has various "--timeout" options (notably "--timeout-test"). If you're having problems when iterating to debug the tests failing for another reason, have you considered setting that for yourself instead?

Do you know which operations are hanging and why? Would it be possible to stop that happening?

I'll give you some specific review comments inline.

review: Needs Fixing
Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

Laney, I have fixed all the issues you mentioned inline now.

The advantage of the main loop timeout in the script is that if the timeout happens and the test fails by that, a Python traceback is shown and one sees where the failure happened. If I call the autopkgtest utility with a shortened timeout, the hanging script simply gets killed earlier and I do not get any output about where the hang happened. So the timeout in the script is better for the debuggability.

Revision history for this message
Iain Lane (laney) wrote :

I'm worried that adding random timeouts is *worse* for reliability though, if they get set wrong and start causing failures themselves. You also only timeout at a couple of specific points, but presumably other parts could start behaving badly and you wouldn't get any improved debuggability then.

So:

Can we get to the bottom of why `add_and_activate_connection` is hanging?

Is it possible to add much more debugging output into the tests and run them verbosely so that we don't need to timeout ourselves, but can see from the output where we've got up to?

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

The reason why I have added these timeouts is that these are the only points in the whole script where the script can actually infinitely hang. add_and_activate_connection() is the only asynchroneous method used in the script, and it requires to run a GLib main loop to wait for the background process to finish and to carry on exactly when it has finished. The disadvantage of the mail loop is that if the background process never finishes then the main loop never ends and the script gets finally killed by autopkgtest after ~2 hours.

All the rest of the script uses in the case that it has to wait for an action to finish the assertEventually() method which checks a state repeatedly and exits successfully when the state has been reached and exits with failure when the state has not been reached after a given timeout.

With my change every action which has to be waited for is only waited for for a given time and not infinitely.

The failure which actually showed me the possibility of the main loop hanging was the callback function not being adapted to the API change. Now after the callback function being corrected there was no more hang any more.

Revision history for this message
Iain Lane (laney) wrote :

What about my last question? If you are dead set on introducing this then I'll shut up, but I'm telling you that I've seen timeouts like this go wrong more times than I'd like and so I'm trying to push you to avoid it.

I gave you more inline comments too in case you didn't see those.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

I have also already tried to make the tests more verbose, but adding print(...) lines do not make the output appearing in the logs. It gets filtered somewhere in pitti's complex magic of test framework.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

I see that the introduction of a timeout in the main loops makes things overly complicated, if there is a way to make the tests verbose and to get the verbose output into the logs even if autopkgtest kills the script because of it hanging, one could simple leave the main loops alone.

Revision history for this message
Iain Lane (laney) wrote :

OK, well you've got all my input now, so it's up to you which path you decide to go down.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

I have fixed the issues remarked inline now, note that in case of successful connection the timer is removed by the "GLib.source_remove(self.timeout_tag)" call in the add_activate_cb() callback functions. I also did not find any constants to give names to the error numbers (as one has for C libraries defined in the .h files). Therefore in the exception handlers it is checked for error 19.

During further tests I have found out that in the Wi-Fi case most failures occur due to the kernel-based AP emulator not having come up when hostapd is started, making hostapd fail the test already before it comes to the starting of Network Manager.

In the Ethernet case there are actually failures happening (on some starts) in the connection main loop, but I do not know whether this is a bug of Network Manager or something does not have started correctly before this step.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

The last commit (65f104c) should solve bug 1836209.

Revision history for this message
Iain Lane (laney) wrote :

OK Till, thanks, I'm going to upload this now because I've run it a bunch of times here and it hasn't failed. I have an uneasy feeling about removing code without understanding why it's going wrong, so I think it would be good if we could understand that - but having a passing NM is also valuable to us.

review: Approve

There was an error fetching revisions from git servers. Please try again in a few minutes. If the problem persists, contact Launchpad support.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches

to all changes: