couple of typos and some vague rambling thoughts; I'll try to write them up in some more detail for the list. https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst File source/drafts/stopping-units.rst (right): https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode41 source/drafts/stopping-units.rst:41: its killed to properly terminate. This example can be extended to s/its/it's/ https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59 source/drafts/stopping-units.rst:59: against all the relevant entitites. [See Topology Internals] Please define "relevant entities". When stopping a service, for example, do you mean we should mark all the following: - all units of that service - all unit relations of that service - all machines running only that service - the service itself ? I think it'd be easier to think of things cascading a bit more -- so individual entities detected their own impending death by watching for changes to their immediate parents' statuses... but I may have to think it through a little more. https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70 source/drafts/stopping-units.rst:70: - unit agent disables its upstart file Not sure about this; can't we just always write agents with a "normal exit 0" stanza and depend on that? Otherwise, if we get a "weird" failure at this precise point, we won't get an opportunity to start up again and tidy ourselves up nicely; and besides it seems more correct to me to let the machine agent *always* tidy up the unit agent upstart files it originally created (which it will have to do anyway if it times out...). https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode72 source/drafts/stopping-units.rst:72: - unit agent process exists s/exists/exits with a code indicating to upstart that it should not be restarted/ (?) https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74 source/drafts/stopping-units.rst:74: - machine agent watch fires on 'stop' node deletion, shutsdown the units container. s/shutsdown/shuts down/ But regardless, I think we need to consider this in a little more detail -- we're not guaranteed to have a container, after all, and we probably shouldn't be killing whole containers if we're just shutting down a subordinate charm, for example. https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78 source/drafts/stopping-units.rst:78: node contents updated to reflect this. If the PA kills a machine, it will then be responsible for cleaning up the MA's nodes *and* any unit nodes? This doesn't seem quite right... sounds like a job for the GC really (because -- I *think* -- it's going to have to deal with unit cleanup as a consequence of forcible machine termination. (If the MA doesn't handle a stop request properly, probably it didn't stop its units either.). https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode89 source/drafts/stopping-units.rst:89: and garbage collects their topology footprint and zk state. I think this needs a bit more detail; see below. https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode100 source/drafts/stopping-units.rst:100: internal data structures. Sounds sensible. I'm wondering if it would be sensible to extend the general idea to unit relations, replacing the existing mechanism; in combination with a gc node, for example, it could be useful for units to hold a don't-delete-yet lock on other participants in their relations (until the point at which the unit itself has processed their removal). (I think we do need something like this: a relation hook can theoretically execute with an arbitrarily old members list, and the only reason this isn't currently a problem is because we don't delete the old members' settings nodes, so the relation hook commands can continue to cheerfully extract data from dead unit relations. (I think, anyway: the details have grown hazy. Please correct me if I need it ;))) But then... if a UA holding a "lock" on another unit relation's settings gets forcibly killed, we need a way to detect that *that* lock is no longer relevant... it's not insoluble, sure, but it could get fiddly, and whatever we do I think we'll need a semi-smart GC that can figure out these sorts of issues, and I think it's worth speccing out the details of the GC in a bit more depth. https://codereview.appspot.com/5847053/