python-jujuclient

Merge lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade into lp://staging/python-jujuclient

retry-on-upgrade
Merge into trunk

Proposed by Ian Booth on 2015-05-31

Status:

Merged

Merged at revision:

59

Proposed branch:

lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade

Merge into:

lp://staging/python-jujuclient

Diff against target:

101 lines (+47/-8)

2 files modified

jujuclient.py (+27/-7)
test_jujuclient.py (+20/-1)

To merge this branch:

bzr merge lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade

Related bugs:

Bug #1460171: Deployer fails because juju thinks it is upgrading

High

Fix Released

Remove

Link a bug report

Reviewer	Review Type	Date Requested	Status
Adam Collard (community)			Needs Fixing on 2015-06-01
Kapil Thangavelu		2015-05-31	Approve on 2015-06-01
John A Meinel (community)			Approve on 2015-06-01
Review via email: mp+260658@code.staging.launchpad.net

Commit message

A small moment of time after a freshly bootstrapped Juju environment accepts connections, it may be in an upgrading state.

This branch changes the RPC calls so that if an "upgrade in progress" error is reported, then then RPC call is retried. The retries occur every 1 second, up to a minute.

This change allows the deployer to be more robust to Juju upgrades, since it will now not simply error out.

Description of the change

Fixes bug bug 146017

A small moment of time after a freshly bootstrapped Juju environment accepts connections, it may be in an upgrading state.

This branch changes the RPC calls so that if an "upgrade in progress" error is reported, then then RPC call is retried. The retries occur every 1 second, up to a minute.

This change allows the deployer to be more robust to Juju upgrades, since it will now not simply error out.

lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade updated on 2015-06-01

60. By Ian Booth on 2015-06-01: Fix test

Revision history for this message

John A Meinel (jameinel) wrote on 2015-06-01:

#

Overall I think LGTM. I do have a couple comments.

Revision history for this message

John A Meinel (jameinel) on 2015-06-01:

#

review: Approve

lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade updated on 2015-06-01

61. By Ian Booth on 2015-06-01: Fix formatting

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2015-06-01:

#

i'd suggest UPPER CASING CONSTANTS, the block their being inserted into is full of instance variables where as these are not.

How long is the upgrade expected to last? On new systems 5-10s depending on machine, and extant envs o(n) size of env?

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2015-06-01:

#

lgtm

review: Approve

Revision history for this message

Adam Collard (adam-collard) on 2015-06-01:

#

review: Needs Fixing

Revision history for this message

Adam Collard (adam-collard) on 2015-06-01:

#

Revision history for this message

Adam Collard (adam-collard) on 2015-06-01:

#

lp://staging/~wallyworld/python-jujuclient/retry-on-upgrade updated on 2015-06-01

62. By Ian Booth on 2015-06-01: Remove stupid ++
63. By Ian Booth on 2015-06-01: Tweak while loop for rpc retries

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-06-01:

#

I don't think the proposed change will fix the bug.

The error in the linked bug occurred when deployer tried to look at the status (called .get_stat()), which made a new watcher and (tried to) started it. That's when the error that Juju was upgrading occurred.

This branch would make that RPC call to start a new watcher get retried, whilst Juju is upgrading. However, once the upgrade is complete, IIUC one would need to re-authenticate to Juju (and probably renegotiate facade versioning - we've just got a new version of Juju!).

In other words, the retry is at the wrong level.

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-06-01:

#

> I don't think the proposed change will fix the bug.
>
> The error in the linked bug occurred when deployer tried to look at the status
> (called .get_stat()), which made a new watcher and (tried to) started it.
> That's when the error that Juju was upgrading occurred.
>
> This branch would make that RPC call to start a new watcher get retried,
> whilst Juju is upgrading. However, once the upgrade is complete, IIUC one
> would need to re-authenticate to Juju (and probably renegotiate facade
> versioning - we've just got a new version of Juju!).
>
> In other words, the retry is at the wrong level.

There's appears to be a little misunderstanding of what this fix does. It does not handle the case where Juju disconnects clients when the agents reboot after an agent upgrade.

What the fix does is this. When the Juju agent starts up, the API is in a limited "upgrading" mode while Juju a) performs any version upgrade steps, and b) checks if any agent upgrades are required. Until both of these complete, any attempts to do much more than get status will result in an "upgrade in progress" error. There are no agent restarts during any of this. So the deployer will not become disconnected. But it will get confused in the short window where an "upgrade in progress" error is reported, so this MP teaches it how to deal with that. It simply retries until the full API becomes available, which is what happens 99.9999% of the time when no agent upgrades are necessary.

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-06-01:

#

TL;DR; this MP fixes bug 1460171 not bug 1455260

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-06-01:

#

> i'd suggest UPPER CASING CONSTANTS, the block their being inserted into is
> full of instance variables where as these are not.
>
> How long is the upgrade expected to last? On new systems 5-10s depending on
> machine, and extant envs o(n) size of env?

I didn't treat those values as constants, just member variables with defaults. I left it open for them to be changed by set_reconnect_params() if required later.

I have no exact figures on how long upgrade steps (not agent upgrades) will take to run. Typically seconds from what has been observed so far. So 1 minute seems like more than enough. Note this is upgrade steps which run once on the state server, not agent upgrades which requires each agent to restart.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches

to all changes:

Ian Booth

juju-deployers