https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
Please define "relevant entities". When stopping a service, for example,
do you mean we should mark all the following:
- all units of that service
- all unit relations of that service
- all machines running only that service
- the service itself
?
I think it'd be easier to think of things cascading a bit more -- so
individual entities detected their own impending death by watching for
changes to their immediate parents' statuses... but I may have to think
it through a little more.
Otherwise, if we get a "weird" failure at this precise point, we won't
get an opportunity to start up again and tidy ourselves up nicely; and
besides it seems more correct to me to let the machine agent *always*
tidy up the unit agent upstart files it originally created (which it
will have to do anyway if it times out...).
But regardless, I think we need to consider this in a little more detail
-- we're not guaranteed to have a container, after all, and we probably
shouldn't be killing whole containers if we're just shutting down a
subordinate charm, for example.
https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
If the PA kills a machine, it will then be responsible for cleaning up
the MA's nodes *and* any unit nodes? This doesn't seem quite right...
sounds like a job for the GC really (because -- I *think* -- it's going
to have to deal with unit cleanup as a consequence of forcible machine
termination. (If the MA doesn't handle a stop request properly, probably
it didn't stop its units either.).
https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode100
source/drafts/stopping-units.rst:100: internal data structures.
Sounds sensible. I'm wondering if it would be sensible to extend the
general idea to unit relations, replacing the existing mechanism; in
combination with a gc node, for example, it could be useful for units to
hold a don't-delete-yet lock on other participants in their relations
(until the point at which the unit itself has processed their removal).
(I think we do need something like this: a relation hook can
theoretically execute with an arbitrarily old members list, and the only
reason this isn't currently a problem is because we don't delete the old
members' settings nodes, so the relation hook commands can continue to
cheerfully extract data from dead unit relations. (I think, anyway: the
details have grown hazy. Please correct me if I need it ;)))
But then... if a UA holding a "lock" on another unit relation's settings
gets forcibly killed, we need a way to detect that *that* lock is no
longer relevant... it's not insoluble, sure, but it could get fiddly,
and whatever we do I think we'll need a semi-smart GC that can figure
out these sorts of issues, and I think it's worth speccing out the
details of the GC in a bit more depth.
couple of typos and some vague rambling thoughts; I'll try to write them
up in some more detail for the list.
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst drafts/ stopping- units.rst (right):
File source/
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode41 drafts/ stopping- units.rst: 41: its killed to properly terminate.
source/
This example can be extended to
s/its/it's/
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode59 drafts/ stopping- units.rst: 59: against all the relevant entitites.
source/
[See Topology Internals]
Please define "relevant entities". When stopping a service, for example,
do you mean we should mark all the following:
- all units of that service
- all unit relations of that service
- all machines running only that service
- the service itself
?
I think it'd be easier to think of things cascading a bit more -- so
individual entities detected their own impending death by watching for
changes to their immediate parents' statuses... but I may have to think
it through a little more.
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode70 drafts/ stopping- units.rst: 70: - unit agent disables its upstart
source/
file
Not sure about this; can't we just always write agents with a "normal
exit 0" stanza and depend on that?
Otherwise, if we get a "weird" failure at this precise point, we won't
get an opportunity to start up again and tidy ourselves up nicely; and
besides it seems more correct to me to let the machine agent *always*
tidy up the unit agent upstart files it originally created (which it
will have to do anyway if it times out...).
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode72 drafts/ stopping- units.rst: 72: - unit agent process exists
source/
s/exists/exits with a code indicating to upstart that it should not be
restarted/
(?)
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode74 drafts/ stopping- units.rst: 74: - machine agent watch fires on
source/
'stop' node deletion, shutsdown the units container.
s/shutsdown/shuts down/
But regardless, I think we need to consider this in a little more detail
-- we're not guaranteed to have a container, after all, and we probably
shouldn't be killing whole containers if we're just shutting down a
subordinate charm, for example.
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode78 drafts/ stopping- units.rst: 78: node contents updated to reflect
source/
this.
If the PA kills a machine, it will then be responsible for cleaning up
the MA's nodes *and* any unit nodes? This doesn't seem quite right...
sounds like a job for the GC really (because -- I *think* -- it's going
to have to deal with unit cleanup as a consequence of forcible machine
termination. (If the MA doesn't handle a stop request properly, probably
it didn't stop its units either.).
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode89 drafts/ stopping- units.rst: 89: and garbage collects their topology
source/
footprint and zk state.
I think this needs a bit more detail; see below.
https:/ /codereview. appspot. com/5847053/ diff/1/ source/ drafts/ stopping- units.rst# newcode100 drafts/ stopping- units.rst: 100: internal data structures.
source/
Sounds sensible. I'm wondering if it would be sensible to extend the
general idea to unit relations, replacing the existing mechanism; in
combination with a gc node, for example, it could be useful for units to
hold a don't-delete-yet lock on other participants in their relations
(until the point at which the unit itself has processed their removal).
(I think we do need something like this: a relation hook can
theoretically execute with an arbitrarily old members list, and the only
reason this isn't currently a problem is because we don't delete the old
members' settings nodes, so the relation hook commands can continue to
cheerfully extract data from dead unit relations. (I think, anyway: the
details have grown hazy. Please correct me if I need it ;)))
But then... if a UA holding a "lock" on another unit relation's settings
gets forcibly killed, we need a way to detect that *that* lock is no
longer relevant... it's not insoluble, sure, but it could get fiddly,
and whatever we do I think we'll need a semi-smart GC that can figure
out these sorts of issues, and I think it's worth speccing out the
details of the GC in a bit more depth.
https:/ /codereview. appspot. com/5847053/