Merge lp://staging/~fwereade/pyjuju/fix-charm-upgrade into lp://staging/pyjuju

Proposed by William Reade
Status: Merged
Approved by: Kapil Thangavelu
Approved revision: 458
Merged at revision: 458
Proposed branch: lp://staging/~fwereade/pyjuju/fix-charm-upgrade
Merge into: lp://staging/pyjuju
Prerequisite: lp://staging/~fwereade/pyjuju/dont-stop-workflows
Diff against target: 3535 lines (+1424/-808)
15 files modified
docs/source/internals/unit-agent-persistence.rst (+138/-0)
juju/agents/tests/test_unit.py (+79/-141)
juju/agents/unit.py (+59/-136)
juju/control/tests/test_resolved.py (+16/-8)
juju/control/tests/test_status.py (+11/-37)
juju/control/tests/test_upgrade_charm.py (+4/-2)
juju/errors.py (+11/-1)
juju/lib/statemachine.py (+105/-28)
juju/lib/tests/test_statemachine.py (+214/-76)
juju/state/service.py (+10/-10)
juju/tests/test_errors.py (+31/-9)
juju/unit/lifecycle.py (+122/-24)
juju/unit/tests/test_lifecycle.py (+121/-50)
juju/unit/tests/test_workflow.py (+435/-227)
juju/unit/workflow.py (+68/-59)
To merge this branch: bzr merge lp://staging/~fwereade/pyjuju/fix-charm-upgrade
Reviewer Review Type Date Requested Status
Kapil Thangavelu (community) Approve
Jim Baker (community) Approve
Review via email: mp+85271@code.staging.launchpad.net

Description of the change

I'm pretty sure that charm upgrades will now:

* fail silently, as before, on early errors (when no changes have been made to any state apart from the upgrade flag)
* error out on early errors if recovering from an earlier failed operation (the *whole* *thing* needs to complete successfully...)
* induce charm_upgrade_error workflow states, when the workflow comes up in a "started" state but midway through a charm upgrade
* do the Right Thing re fire_hooks, which is to *ignore* fire_hooks in any invocation in which the charm is actually written to disk; in this case, it is vital to *always* fire the upgraded-charm hook, because we guarantee it's the *first* one fired after the operation completes.
* actually overwrite old charms, instead of just unpacking into the same directory (and thus leaving droppings).

To post a comment you must log in.
Revision history for this message
Jim Baker (jimbaker) wrote :

+1, looks good to me. The only thing:

$ pep8 juju/unit
juju/unit/tests/test_lifecycle.py:830:1: E303 too many blank lines (3)
juju/unit/tests/test_workflow.py:619:80: E501 line too long (80 characters)

review: Approve
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

[0]

+ def catch_up(self):
+ # There's nothing we can wait on to determine when all the consequences
+ # of the cb_watch_upgrade_flag have come to pass; this seems to be a
+ # "reliable workaround"...
+ for _ in range(10):
+ yield self.poke_zk()

eek, sleep and pray is not kosher ;-). tests need a definitive observation point, either using the existing wait_on_state helper or create a wait on log message helper, to track the done ness of cb watch upgrade.

[1]

charm directories can contain charm state, ie. their not read only. So they can't be overwritten wholesale. they'd need a manifest on install to delta compute the upgrade for files to delete, i'd go ahead and punt on this one for now, it can be handled separately.

review: Needs Fixing
Revision history for this message
William Reade (fwereade) wrote :

[0]

Replaced with a wait_for_log method.

[1]

Fixed.

446. By William Reade

merge parent

447. By William Reade

merge parent

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

[2]

+ returnValue(state_dict.get("state"))

this seems suspect, we have some state data, but its invalid, and we transparently assume its a new/starting workflow .. what if this was a previously initialized system, we'd be transparently resetting it without warning.

[3]

    @property
    def _upgrade_flag(self):
        return "/units/%s/upgrade" % self._internal_id

for clarity, should be _path suffixed.

[4]

There don't appear to be any tests for charm upgrade prepare failing in the upgrade charm lifecycle method, but irrelevant given the following.

[5]

as discussed in person, let's remove the first_attempt logic, any error in the upgrade transition should result in an error state.

448. By William Reade

added upgrade_charm_ready and charm_replaced states, along with suitable transitions and error states

Revision history for this message
William Reade (fwereade) wrote :

[2]

Hmm, very sensible, not sure why I did that :-/.

[3]

Sounds good, done.

[4/5]

OK, following discussion, I've broken the upgrade process into 3 basic transitions (plus associated error/retry) ones: "begin_charm_upgrade", "replace_charm", "finish_charm_upgrade", with 2 new (non-error) states: "charm_upgrade_ready" and "charm_replaced".

There's no more first_attempt malarkey, BUT the "begin_charm_upgrade" transition will raise an error if there is no new charm available, hence aborting the upgrade operation (and keeping the workflow in state "started"); and thereby ensuring that any upgrade operation that makes it to the "charm_upgrade_ready" state represents a real upgrade, and must therefore pass through "replace_charm" and "finish_charm_upgrade" before it returns to the "started" state.

449. By William Reade

address review point

450. By William Reade

address final review point

451. By William Reade

merge parent

452. By William Reade

merge parent

Revision history for this message
William Reade (fwereade) wrote :

Forces in play/further justifications for chosen approach:

1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.
2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).
3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.
4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states
5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.

Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".

Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.

Does any of this help at all?

Revision history for this message
Kapil Thangavelu (hazmat) wrote :
Download full text (4.2 KiB)

Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012:
> Forces in play/further justifications for chosen approach:
>
> 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.

Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL
transition (either disk or zk) that its needed at all, and a better fit to model
it as such than as additional state/transitions. Its main purpose is to
signal the transition that's currently in play to allow for recovery and
re-execution of that transition instead of merely entering the last recorded
state.

> 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).

yeah.. the distinction between first_attempt was ever checked by the tests.

> 3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.

i think the refactoring thats been done to subsume some of the callback
into the transition is good. Capturing the entire operation into the
workflow machinery has some benefits wrt to status reporting/recording failures.

> 4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states

The question is the recovery harmed by rexecuting the whole transition? Ie.
what's the value from an error recovery point of view of the separate states.

> 5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.
>

I don't think feel this is really a big deal, given its simplicity. There's an
open bug out for creating a formula cache that could alleviate the redownload.

> Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".
>

Yeah.. i've back-tracked on this looking over the implementation and given
given gustavo's WAL insight. It feels like the additional states and
transitions are just adding complexity esp while juggling the executor. Its also
doesn't seems like a benefit to a user h...

Read more...

453. By William Reade

manually back out r448 without screwing up future merges, I hope

454. By William Reade

remove UL.upgrade_charm's first_attempt kwarg

Revision history for this message
William Reade (fwereade) wrote :

Still prefer the old way, but done now :).

455. By William Reade

merge parent

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

[6]

As we discussed over g+, this is looking good The main thing is that the callback/deferred passing should be removed in favor of having lib/statemachine.py do saving/clearing of transitions in progress against state variables. This will make generic recovery of any inflight transition and perhaps special casing in synchronize.

It should also obviate the need for _notify_upgrading, and upgrade_charm's cb_upgrading.

    @inlineCallbacks
    def _notify_upgrading(self, is_upgrading):

[7] just a comment, looking at upgrade_charm's implementation.

The distinction between upgrade.prepare, upgrade.run, and upgrade.ready looks a little superficial. The upgrade.prepare should work or raise an error. On a retry of an upgrade-charm error state we should always ..
This also has the interesting side effect of firing hooks on juju resolved if we're just now downloading the charm successfully, but that seems correct.

456. By William Reade

WorkflowState now (1) has a synchronize method that re-runs inflight

transitions after restart, and (2) requires external locking on state-
changing methods.

(1) is a required change due to the verdict on fix-charm-upgrade (that is:
we don't want additional states for the upgrade process, and we should use
write-ahead logging to maintain state across process death); (2) is
consequently required because of the interaction between the upgrade flag
watch and the resolved watch, both of which take `state == "started"` to
mean "it's safe to fire a transition". This was a pre-existing bug, but was
IMO exacerbated severely by the fact that we now never leave the "started"
state while upgrading a charm -- enough so that I'm not comfortable
proposing a patch that fails to address this issue.

This diff is *big*, but the vast majority of the changes are as a result of
the WorkflowState locking; in particular, many many tests set workflow state
in one way or another, and the necessary changes add a *lot* of noise. In
hindsight, the locking changes could have been made independently, but I
don't think the resulting pair of branches would actually be significantly
easier to deal with.

Various explanatory notes follow:

* External WorkflowState locking (rather than automagic internal locking) was
  chosen for simplicity's sake on the part of the client as well: when one
  to fire a transition conditional on some state being active, it makes
  sense to grab the lock, check, and fire the transition, in the certain
  knowledge that nobody can have changed state underneath you.

* The obvious problem with DeferredLock (that you can tell it's locked, but
  not who owns the lock) has minimal impact, thanks to the unit tests.
  Consider the following scenarios:

  * Code tries to lock, test not holding lock: we're fine.
  * Code tries to lock, test is holding lock: bad test; deadlocks.
  * Code fails to lock, test not holding lock: bad code: asserts.
  * Code fails to lock, test is holding lock: bad code AND bad test.

  The final scenario is the only dangerous one, but it's just a special
  case of the fact that you can *always* write a bad test that passes bad
  code; I think attempting to solve this is out of scope for this universe.

* WorkflowState.fire_transition sets the inflight transition once it knows
  the requested transition is valid, and only clears it explicitly if the
  transition fails without an error transition to follow up (when a
  transition succeeds, set_state implicitly clears inflight).

* WorkflowState.synchronize (1) detects and re-runs inflight transitions and
  (2) has taken over responsibility for the automatic transitions (eg
  None->installed->started) which had previously been handled in UWS/RWS.
  The overlapping concept of success_transition_id has been dropped;
  transitions can now be marked as "automatic".

* WorkflowState.set_inflight really wants to be private, but is exposed for
  testing's sake; get_inflight needs to be exposed so that UWS.synchronize
  doesn't inappropriately start the executor when recovering from mid-
  upgrade process death.

* Er...

* That's it.

R=
CC=
https://codereview.appspot.com/5647064

457. By William Reade

break up synchronize test to avoid occasional timeout during full test runs

Revision history for this message
William Reade (fwereade) wrote :

OK, state-machine-sync has been merged in; are we good to go?

458. By William Reade

merge parent

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

woohoo! make the magic happen :-)

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches

to status/vote changes: