Merge lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade into lp://staging/python-jujuclient

Proposed by Ian Booth
Status: Merged
Merged at revision: 59
Proposed branch: lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade
Merge into: lp://staging/python-jujuclient
Diff against target: 101 lines (+47/-8)
2 files modified
jujuclient.py (+27/-7)
test_jujuclient.py (+20/-1)
To merge this branch: bzr merge lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade
Reviewer Review Type Date Requested Status
Adam Collard (community) Needs Fixing
Kapil Thangavelu Approve
John A Meinel (community) Approve
Review via email: mp+260658@code.staging.launchpad.net

Commit message

A small moment of time after a freshly bootstrapped Juju environment accepts connections, it may be in an upgrading state.

This branch changes the RPC calls so that if an "upgrade in progress" error is reported, then then RPC call is retried. The retries occur every 1 second, up to a minute.

This change allows the deployer to be more robust to Juju upgrades, since it will now not simply error out.

Description of the change

Fixes bug bug 146017

A small moment of time after a freshly bootstrapped Juju environment accepts connections, it may be in an upgrading state.

This branch changes the RPC calls so that if an "upgrade in progress" error is reported, then then RPC call is retried. The retries occur every 1 second, up to a minute.

This change allows the deployer to be more robust to Juju upgrades, since it will now not simply error out.

To post a comment you must log in.
60. By Ian Booth

Fix test

Revision history for this message
John A Meinel (jameinel) wrote :

Overall I think LGTM. I do have a couple comments.

Revision history for this message
John A Meinel (jameinel) :
review: Approve
61. By Ian Booth

Fix formatting

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

i'd suggest UPPER CASING CONSTANTS, the block their being inserted into is full of instance variables where as these are not.

How long is the upgrade expected to last? On new systems 5-10s depending on machine, and extant envs o(n) size of env?

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

lgtm

review: Approve
Revision history for this message
Adam Collard (adam-collard) :
review: Needs Fixing
Revision history for this message
Adam Collard (adam-collard) :
Revision history for this message
Adam Collard (adam-collard) :
62. By Ian Booth

Remove stupid ++

63. By Ian Booth

Tweak while loop for rpc retries

Revision history for this message
Adam Collard (adam-collard) wrote :

I don't think the proposed change will fix the bug.

The error in the linked bug occurred when deployer tried to look at the status (called .get_stat()), which made a new watcher and (tried to) started it. That's when the error that Juju was upgrading occurred.

This branch would make that RPC call to start a new watcher get retried, whilst Juju is upgrading. However, once the upgrade is complete, IIUC one would need to re-authenticate to Juju (and probably renegotiate facade versioning - we've just got a new version of Juju!).

In other words, the retry is at the wrong level.

Revision history for this message
Ian Booth (wallyworld) wrote :

> I don't think the proposed change will fix the bug.
>
> The error in the linked bug occurred when deployer tried to look at the status
> (called .get_stat()), which made a new watcher and (tried to) started it.
> That's when the error that Juju was upgrading occurred.
>
> This branch would make that RPC call to start a new watcher get retried,
> whilst Juju is upgrading. However, once the upgrade is complete, IIUC one
> would need to re-authenticate to Juju (and probably renegotiate facade
> versioning - we've just got a new version of Juju!).
>
> In other words, the retry is at the wrong level.

There's appears to be a little misunderstanding of what this fix does. It does not handle the case where Juju disconnects clients when the agents reboot after an agent upgrade.

What the fix does is this. When the Juju agent starts up, the API is in a limited "upgrading" mode while Juju a) performs any version upgrade steps, and b) checks if any agent upgrades are required. Until both of these complete, any attempts to do much more than get status will result in an "upgrade in progress" error. There are no agent restarts during any of this. So the deployer will not become disconnected. But it will get confused in the short window where an "upgrade in progress" error is reported, so this MP teaches it how to deal with that. It simply retries until the full API becomes available, which is what happens 99.9999% of the time when no agent upgrades are necessary.

Revision history for this message
Ian Booth (wallyworld) wrote :

TL;DR; this MP fixes bug 1460171 not bug 1455260

Revision history for this message
Ian Booth (wallyworld) wrote :

> i'd suggest UPPER CASING CONSTANTS, the block their being inserted into is
> full of instance variables where as these are not.
>
> How long is the upgrade expected to last? On new systems 5-10s depending on
> machine, and extant envs o(n) size of env?

I didn't treat those values as constants, just member variables with defaults. I left it open for them to be changed by set_reconnect_params() if required later.

I have no exact figures on how long upgrade steps (not agent upgrades) will take to run. Typically seconds from what has been observed so far. So 1 minute seems like more than enough. Note this is upgrade steps which run once on the state server, not agent upgrades which requires each agent to restart.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches

to all changes: