Excerpts from William Reade's message of Tue Jan 31 09:11:25 UTC 2012: > Forces in play/further justifications for chosen approach: > > 1) The use of a state variable for the critical section was criticised as being a bit surprising/magical, so it seemed better to store states as, well, states. Indeed, it does, but as gustavo pointed out in irc, its really for lack of a WAL transition (either disk or zk) that its needed at all, and a better fit to model it as such than as additional state/transitions. Its main purpose is to signal the transition that's currently in play to allow for recovery and re-execution of that transition instead of merely entering the last recorded state. > 2) first_attempt was also criticised as too complex, so I dropped the silent-early-failure mode (ie once we know an upgrade is a sensible thing to do, we're locked into it and can't escape). yeah.. the distinction between first_attempt was ever checked by the tests. > 3) I was reluctant to move too much logic into the top-level callback for the upgrade flag watch. It is true that putting the check of service-charm-id/unit-charm-id in there would indeed eliminate the first state transition, but it seemed cleanest to just try the transition and allow the state machinery to handle backing out in response to the lifecycle's complaints. i think the refactoring thats been done to subsume some of the callback into the transition is good. Capturing the entire operation into the workflow machinery has some benefits wrt to status reporting/recording failures. > 4) Splitting upgrade-charm and run-hook into two distinct transitions is definitely a win, because (i) the possible errors in the two states are very different, and it's good to distinguish them; (ii) breaking the process into a pipeline like this eliminated further complexity in the original UL.upgrade_charm method (we handle all state tracking with explicit state machine states The question is the recovery harmed by rexecuting the whole transition? Ie. what's the value from an error recovery point of view of the separate states. > 5) The repeated check for service/unit charm ids in UL.replace_charm is not *necessary*, but it seemed helpful to avoid redownloading and reinstalling exactly the same bits. We could drop it without doing anything any actual *harm*. > I don't think feel this is really a big deal, given its simplicity. There's an open bug out for creating a formula cache that could alleviate the redownload. > Summary: we agreed on having 2 transitions to handle the process of upgrading; farming the initial can-we-actually-upgrade check out to a third new transition, which fails back out to "started", keeps responsibility for state-tracking in the workflow where, IMO, it should be anyway. I don't consider the additional workflow states to represent serious complexity: the only other state it interacts with is "started". > Yeah.. i've back-tracked on this looking over the implementation and given given gustavo's WAL insight. It feels like the additional states and transitions are just adding complexity esp while juggling the executor. Its also doesn't seems like a benefit to a user having to understand the semantic differences between the various upgrade states. The previous version of the branch would work well afaics given a removal of the first_attempt logic. The removal of the started state variable could follow along with a generic transition WAL. > Aside: rereading IRC, I may have missed something re: reentrancy concerns. Each of the transition methods on UL is, AFAICT, safe for restart midway through; and the workflow is responsible for ensuring we only run them at sensible times. > > Does any of this help at all? there are three different lifecycle methods all coordinating around the executor global variable, and those lifecycle methods may be called by any of seven different upgrade transitions (across the four states). Those states would all need to appropriately reset the executor for recovery/startup. Given that combining those lifecycle methods into a single method/entry point doesn't deter from their functionality at all, It seems like a clear win to me for simplicity too just have the one lifecycle method, two transitions, and one state.