Thanks for the review!


http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
On 2012/03/17 12:20:30, fwereade wrote:
> Please define "relevant entities". When stopping a service, for
example, do you
> mean we should mark all the following:
>   - all units of that service
>   - all unit relations of that service
>   - all machines running only that service
>   - the service itself
> ?

> I think it'd be easier to think of things cascading a bit more -- so
individual
> entities detected their own impending death by watching for changes to
their
> immediate parents' statuses... but I may have to think it through a
little more.


Good point, for a service it would set the appropriate value for all the
units. I like the notion of a children watching parents, but its not
clear its nesc., but ultimately its the assigned machine's
responsibility to shutdown the unit, and it seems just as well.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70
source/drafts/stopping-units.rst:70: - unit agent disables its upstart
file
On 2012/03/17 12:20:30, fwereade wrote:
> Not sure about this; can't we just always write agents with a "normal
exit 0"
> stanza and depend on that?

> Otherwise, if we get a "weird" failure at this precise point, we won't
get an
> opportunity to start up again and tidy ourselves up nicely; and
besides it seems
> more correct to me to let the machine agent *always* tidy up the unit
agent
> upstart files it originally created (which it will have to do anyway
if it times
> out...).

That sounds good, although a wierd failure might just as well exit 0 ;-)
But yeah.. the parent should perform final cleanup, and i like the
upstart stanza approach.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74
source/drafts/stopping-units.rst:74: - machine agent watch fires on
'stop' node deletion, shutsdown the units container.
On 2012/03/17 12:20:30, fwereade wrote:
> s/shutsdown/shuts down/

> But regardless, I think we need to consider this in a little more
detail --
> we're not guaranteed to have a container, after all, and we probably
shouldn't
> be killing whole containers if we're just shutting down a subordinate
charm, for
> example.

a subordinate is handled slightly differently, same protocol different
parent. a subordinate is cleaned up by its parent/primary unit. most of
the mechanics of container vs. not are already abstracted out behind
unit deploy factories.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
On 2012/03/17 12:20:30, fwereade wrote:
> If the PA kills a machine, it will then be responsible for cleaning up
the MA's
> nodes *and* any unit nodes? This doesn't seem quite right... sounds
like a job
> for the GC really (because -- I *think* -- it's going to have to deal
with unit
> cleanup as a consequence of forcible machine termination. (If the MA
doesn't
> handle a stop request properly, probably it didn't stop its units
either.).

machines can't be terminated currently if they have units assigned. but
that does bring up the question of exactly when do the units become
unassigned from the machine. in that particularly case its not identity
that we're removing, and the parent supervisor is the machine agent is
already going to be dealing with shutting it down.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode100
source/drafts/stopping-units.rst:100: internal data structures.
On 2012/03/17 12:20:30, fwereade wrote:
> Sounds sensible. I'm wondering if it would be sensible to extend the
general
> idea to unit relations, replacing the existing mechanism; in
combination with a
> gc node, for example, it could be useful for units to hold a
don't-delete-yet
> lock on other participants in their relations (until the point at
which the unit
> itself has processed their removal).

> (I think we do need something like this: a relation hook can
theoretically
> execute with an arbitrarily old members list, and the only reason this
isn't
> currently a problem is because we don't delete the old members'
settings nodes,
> so the relation hook commands can continue to cheerfully extract data
from dead
> unit relations. (I think, anyway: the details have grown hazy. Please
correct me
> if I need it ;)))

> But then... if a UA holding a "lock" on another unit relation's
settings gets
> forcibly killed, we need a way to detect that *that* lock is no longer
> relevant... it's not insoluble, sure, but it could get fiddly, and
whatever we
> do I think we'll need a semi-smart GC that can figure out these sorts
of issues,
> and I think it's worth speccing out the details of the GC in a bit
more depth.


re extension, I think we'll have more interesting mechanisms for
multi-node workflows. Its possible to do here, but it feels orthogonal
to stop problem. I will note that clint was interested in this sort of
feature a few months ago.

per our discussion on irc, gc can be fairly conservative with relations,
only cleaning it out after all endpoints service units that participated
in the relation are gone.

http://codereview.appspot.com/5847053/