Code review comment for lp://staging/~hazmat/pyjuju/unit-stop

Revision history for this message
William Reade (fwereade) wrote :

couple of typos and some vague rambling thoughts; I'll try to write them
up in some more detail for the list.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode41
source/drafts/stopping-units.rst:41: its killed to properly terminate.
This example can be extended to
s/its/it's/

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
Please define "relevant entities". When stopping a service, for example,
do you mean we should mark all the following:
  - all units of that service
  - all unit relations of that service
  - all machines running only that service
  - the service itself
?

I think it'd be easier to think of things cascading a bit more -- so
individual entities detected their own impending death by watching for
changes to their immediate parents' statuses... but I may have to think
it through a little more.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70
source/drafts/stopping-units.rst:70: - unit agent disables its upstart
file
Not sure about this; can't we just always write agents with a "normal
exit 0" stanza and depend on that?

Otherwise, if we get a "weird" failure at this precise point, we won't
get an opportunity to start up again and tidy ourselves up nicely; and
besides it seems more correct to me to let the machine agent *always*
tidy up the unit agent upstart files it originally created (which it
will have to do anyway if it times out...).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent process exists
s/exists/exits with a code indicating to upstart that it should not be
restarted/

(?)

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74
source/drafts/stopping-units.rst:74: - machine agent watch fires on
'stop' node deletion, shutsdown the units container.
s/shutsdown/shuts down/

But regardless, I think we need to consider this in a little more detail
-- we're not guaranteed to have a container, after all, and we probably
shouldn't be killing whole containers if we're just shutting down a
subordinate charm, for example.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
If the PA kills a machine, it will then be responsible for cleaning up
the MA's nodes *and* any unit nodes? This doesn't seem quite right...
sounds like a job for the GC really (because -- I *think* -- it's going
to have to deal with unit cleanup as a consequence of forcible machine
termination. (If the MA doesn't handle a stop request properly, probably
it didn't stop its units either.).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode89
source/drafts/stopping-units.rst:89: and garbage collects their topology
footprint and zk state.
I think this needs a bit more detail; see below.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode100
source/drafts/stopping-units.rst:100: internal data structures.
Sounds sensible. I'm wondering if it would be sensible to extend the
general idea to unit relations, replacing the existing mechanism; in
combination with a gc node, for example, it could be useful for units to
hold a don't-delete-yet lock on other participants in their relations
(until the point at which the unit itself has processed their removal).

(I think we do need something like this: a relation hook can
theoretically execute with an arbitrarily old members list, and the only
reason this isn't currently a problem is because we don't delete the old
members' settings nodes, so the relation hook commands can continue to
cheerfully extract data from dead unit relations. (I think, anyway: the
details have grown hazy. Please correct me if I need it ;)))

But then... if a UA holding a "lock" on another unit relation's settings
gets forcibly killed, we need a way to detect that *that* lock is no
longer relevant... it's not insoluble, sure, but it could get fiddly,
and whatever we do I think we'll need a semi-smart GC that can figure
out these sorts of issues, and I think it's worth speccing out the
details of the GC in a bit more depth.

https://codereview.appspot.com/5847053/

« Back to merge proposal