Merge into trunk : fix-charm-upgrade : Code : pyjuju

Reviewer	Date Requested	Status
Kapil Thangavelu (community)		Approve on 2012-02-21
Jim Baker (community)	2011-12-12	Approve on 2012-01-09
Review via email: mp+85271@code.staging.launchpad.net

Revision history for this message

Jim Baker (jimbaker) wrote on 2012-01-09:

#

+1, looks good to me. The only thing:

$ pep8 juju/unit
juju/unit/tests/test_lifecycle.py:830:1: E303 too many blank lines (3)
juju/unit/tests/test_workflow.py:619:80: E501 line too long (80 characters)

review: Approve

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-01-11:

#

[0]

+ def catch_up(self):
+ # There's nothing we can wait on to determine when all the consequences
+ # of the cb_watch_upgrade_flag have come to pass; this seems to be a
+ # "reliable workaround"...
+ for _ in range(10):
+ yield self.poke_zk()

eek, sleep and pray is not kosher ;-). tests need a definitive observation point, either using the existing wait_on_state helper or create a wait on log message helper, to track the done ness of cb watch upgrade.

[1]

charm directories can contain charm state, ie. their not read only. So they can't be overwritten wholesale. they'd need a manifest on install to delta compute the upgrade for files to delete, i'd go ahead and punt on this one for now, it can be handled separately.

review: Needs Fixing

Revision history for this message

William Reade (fwereade) wrote on 2012-01-11:

#

[0]

Replaced with a wait_for_log method.

[1]

Fixed.

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-01-12

446. By William Reade on 2012-01-11: merge parent
447. By William Reade on 2012-01-12: merge parent

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-01-13:

#

[2]

+ returnValue(state_dict.get("state"))

this seems suspect, we have some state data, but its invalid, and we transparently assume its a new/starting workflow .. what if this was a previously initialized system, we'd be transparently resetting it without warning.

[3]

    @property
    def _upgrade_flag(self):
        return "/units/%s/upgrade" % self._internal_id

for clarity, should be _path suffixed.

[4]

There don't appear to be any tests for charm upgrade prepare failing in the upgrade charm lifecycle method, but irrelevant given the following.

[5]

as discussed in person, let's remove the first_attempt logic, any error in the upgrade transition should result in an error state.

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-01-16

448. By William Reade on 2012-01-16: added upgrade_charm_ready and charm_replaced states, along with suitable transitions and error states

Revision history for this message

William Reade (fwereade) wrote on 2012-01-16:

#

[2]

Hmm, very sensible, not sure why I did that :-/.

[3]

Sounds good, done.

[4/5]

OK, following discussion, I've broken the upgrade process into 3 basic transitions (plus associated error/retry) ones: "begin_charm_upgrade", "replace_charm", "finish_charm_upgrade", with 2 new (non-error) states: "charm_upgrade_ready" and "charm_replaced".

There's no more first_attempt malarkey, BUT the "begin_charm_upgrade" transition will raise an error if there is no new charm available, hence aborting the upgrade operation (and keeping the workflow in state "started"); and thereby ensuring that any upgrade operation that makes it to the "charm_upgrade_ready" state represents a real upgrade, and must therefore pass through "replace_charm" and "finish_charm_upgrade" before it returns to the "started" state.

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-01-30

449. By William Reade on 2012-01-16: address review point
450. By William Reade on 2012-01-16: address final review point
451. By William Reade on 2012-01-23: merge parent
452. By William Reade on 2012-01-30: merge parent

Revision history for this message

William Reade (fwereade) wrote on 2012-01-31:

#

Forces in play/further justifications for chosen approach:

1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.
2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).
3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.
4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states
5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.

Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".

Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.

Does any of this help at all?

Forces in play/further justifications for chosen approach:

1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.
2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).
3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.
4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states
5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.

Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".

Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.

Does any of this help at all?

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-01-31:

#

Download full text (4.2 KiB)

Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012:
> Forces in play/further justifications for chosen approach:
>
> 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.

Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL
transition (either disk or zk) that its needed at all, and a better fit to model
it as such than as additional state/transitions. Its main purpose is to
signal the transition that's currently in play to allow for recovery and
re-execution of that transition instead of merely entering the last recorded
state.

> 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).

yeah.. the distinction between first_attempt was ever checked by the tests.

> 3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.

i think the refactoring thats been done to subsume some of the callback
into the transition is good. Capturing the entire operation into the
workflow machinery has some benefits wrt to status reporting/recording failures.

> 4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states

The question is the recovery harmed by rexecuting the whole transition? Ie.
what's the value from an error recovery point of view of the separate states.

> 5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.
>

I don't think feel this is really a big deal, given its simplicity. There's an
open bug out for creating a formula cache that could alleviate the redownload.

> Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".
>

Yeah.. i've back-tracked on this looking over the implementation and given
given gustavo's WAL insight. It feels like the additional states and
transitions are just adding complexity esp while juggling the executor. Its also
doesn't seems like a benefit to a user h...

Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012:
> Forces in play/further justifications for chosen approach:
> 
> 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states.

Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL 
transition (either disk or zk) that its needed at all, and a better fit to model 
it as such than as additional state/transitions. Its main purpose is to 
signal the transition that's currently in play to allow for recovery and 
re-execution of that transition instead of merely entering the last recorded 
state.

> 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape).

yeah.. the distinction between first_attempt was ever checked by the tests.

> 3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints.

i think the refactoring thats been done to subsume some of the callback 
into the transition is good. Capturing the entire operation into the 
workflow machinery has some benefits wrt to status reporting/recording failures.

> 4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states

The question is the recovery harmed by rexecuting the whole transition? Ie. 
what's the value from an error recovery point of view of the separate states.

> 5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*.
>

I don't think feel this is really a big deal, given its simplicity. There's an 
open bug out for creating a formula cache that could alleviate the redownload.

> Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started".
>

Yeah.. i've back-tracked on this looking over the implementation and given
given gustavo's WAL insight. It feels like the additional states and 
transitions are just adding complexity esp while juggling the executor. Its also 
doesn't seems like a  benefit to a user having to understand the semantic 
differences between the various upgrade states. The previous version of the 
branch would work well afaics given a removal of the first_attempt logic.

The removal of the started state variable could follow along with a generic 
transition WAL.

> Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times.
> 
> Does any of this help at all?

there are three different lifecycle methods all coordinating around the 
executor global variable, and those lifecycle methods may be called by 
any of seven different upgrade transitions (across the four states). Those 
states would all need to appropriately reset the executor for recovery/startup.

Given that combining those lifecycle methods into a single method/entry point 
doesn't deter from their functionality at all, It seems like a clear win to me 
for simplicity too just have the one lifecycle method, two transitions, and one 
state.

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-02

453. By William Reade on 2012-02-02: manually back out r448 without screwing up future merges, I hope
454. By William Reade on 2012-02-02: remove UL.upgrade_charm's first_attempt kwarg

Revision history for this message

William Reade (fwereade) wrote on 2012-02-02:

#

Still prefer the old way, but done now :).

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-08

455. By William Reade on 2012-02-08: merge parent

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-02-08:

#

[6]

As we discussed over g+, this is looking good The main thing is that the callback/deferred passing should be removed in favor of having lib/statemachine.py do saving/clearing of transitions in progress against state variables. This will make generic recovery of any inflight transition and perhaps special casing in synchronize.

It should also obviate the need for _notify_upgrading, and upgrade_charm's cb_upgrading.

@inlineCallbacks
def _notify_upgrading(self, is_upgrading):

[7] just a comment, looking at upgrade_charm's implementation.

The distinction between upgrade.prepare, upgrade.run, and upgrade.ready looks a little superficial. The upgrade.prepare should work or raise an error. On a retry of an upgrade-charm error state we should always ..
This also has the interesting side effect of firing hooks on juju resolved if we're just now downloading the charm successfully, but that seems correct.

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-21

456. By William Reade on 2012-02-21

WorkflowState now (1) has a synchronize method that re-runs inflight

transitions after restart, and (2) requires external locking on state-
changing methods.

(1) is a required change due to the verdict on fix-charm-upgrade (that is:
we don't want additional states for the upgrade process, and we should use
write-ahead logging to maintain state across process death); (2) is
consequently required because of the interaction between the upgrade flag
watch and the resolved watch, both of which take `state == "started"` to
mean "it's safe to fire a transition". This was a pre-existing bug, but was
IMO exacerbated severely by the fact that we now never leave the "started"
state while upgrading a charm -- enough so that I'm not comfortable
proposing a patch that fails to address this issue.

This diff is *big*, but the vast majority of the changes are as a result of
the WorkflowState locking; in particular, many many tests set workflow state
in one way or another, and the necessary changes add a *lot* of noise. In
hindsight, the locking changes could have been made independently, but I
don't think the resulting pair of branches would actually be significantly
easier to deal with.

Various explanatory notes follow:

* External WorkflowState locking (rather than automagic internal locking) was
  chosen for simplicity's sake on the part of the client as well: when one
  to fire a transition conditional on some state being active, it makes
  sense to grab the lock, check, and fire the transition, in the certain
  knowledge that nobody can have changed state underneath you.

* The obvious problem with DeferredLock (that you can tell it's locked, but
not who owns the lock) has minimal impact, thanks to the unit tests.
Consider the following scenarios:

  * Code tries to lock, test not holding lock: we're fine.
  * Code tries to lock, test is holding lock: bad test; deadlocks.
  * Code fails to lock, test not holding lock: bad code: asserts.
  * Code fails to lock, test is holding lock: bad code AND bad test.

  The final scenario is the only dangerous one, but it's just a special
  case of the fact that you can *always* write a bad test that passes bad
  code; I think attempting to solve this is out of scope for this universe.

* WorkflowState.fire_transition sets the inflight transition once it knows
  the requested transition is valid, and only clears it explicitly if the
  transition fails without an error transition to follow up (when a
  transition succeeds, set_state implicitly clears inflight).

* WorkflowState.synchronize (1) detects and re-runs inflight transitions and
  (2) has taken over responsibility for the automatic transitions (eg
  None->installed->started) which had previously been handled in UWS/RWS.
  The overlapping concept of success_transition_id has been dropped;
  transitions can now be marked as "automatic".

* WorkflowState.set_inflight really wants to be private, but is exposed for
  testing's sake; get_inflight needs to be exposed so that UWS.synchronize
  doesn't inappropriately start the executor when recovering from mid-
  upgrade process death.

* Er...

* That's it.

R=
CC=
https://codereview.appspot.com/5647064

457. By William Reade on 2012-02-21

break up synchronize test to avoid occasional timeout during full test runs

Revision history for this message

William Reade (fwereade) wrote on 2012-02-21:

#

OK, state-machine-sync has been merged in; are we good to go?

lp://staging/~fwereade/pyjuju/fix-charm-upgrade updated on 2012-02-21

458. By William Reade on 2012-02-21: merge parent

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-02-21:

#

woohoo! make the magic happen :-)

review: Approve

pyjuju

Merge lp://staging/~fwereade/pyjuju/fix-charm-upgrade into lp://staging/pyjuju

Commit message

Description of the change

Preview Diff

Subscribers