Nice to see this moving forward. Some comments:


https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4
source/drafts/stopping-units.rst:4: -------------
Sorry for nitpicking, but can we please stick to a consistent convention
for headers? I believe that's what we have today:

Top level: ===
Inner level: ---

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode30
source/drafts/stopping-units.rst:30: while still allowing for
termination of errant children.
This is a very dense paragraph. Can it be simplified to something like:

"""
Juju records the desired state in the topology, and preserves that state
while the actions to accomplish it are still taking place.
"""

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode32
source/drafts/stopping-units.rst:32: How it works, what follows is a a
worked example going through a unit
"How it works, what follows"?

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode58
source/drafts/stopping-units.rst:58: actions are recorded in the
topology as a key 'action' value 'destroy'
The "action: destroy" terminology feels a bit out of place here. This is
pretty much a procedure call, but the topology holds state instead.
Something like "destroyed: true" would be more appropriate. It remains
true even after the action has taken place.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode65
source/drafts/stopping-units.rst:65: - Machine writes to
/units/unit-x/stop / sets watch on /units/unit-x/stop and timer.
See below regarding the watch side of it.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode69
source/drafts/stopping-units.rst:69: - executing stop hook.
This should be done after departing from relations.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent updates
/units/unit-x/stop to reflect the action has been performed.
That step doesn't seem necessary. We can reuse the existing liveness
mechanism that is already supported by the agent. When the agent dies,
the machine will know it.

It also seems like a nice idea to have the stop ZooKeeper node existing
for as long as that's the intended state of the unit. We have to need
that node on startup for watching it anyway.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode75
source/drafts/stopping-units.rst:75: - machine agent watch fires on
'stop' node deletion, shuts down the unit, removing
Also touches the above point.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
What about garbage collecting on the next action after the decision that
it's fine to GC it? I don't think we need a specific process that has
the task of GCing the topology.

In both cases, though, we have an issue: when is it ok to GC it? This
proposal is not providing any means for deciding on that, which means
the data becomes eternal. That hole indicates postponing that side of
the problem may not be a good idea. We can do something simple, but it'd
be good implement cleaning up at the same time.

https://codereview.appspot.com/5847053/