Merge into docs : unit-stop : Code : pyjuju

Reviewer	Review Type	Date Requested	Status
Mark Mims		2012-03-17	Pending
Review via email: mp+98035@code.staging.launchpad.net

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-03-17:

#

Reviewers: mp+98035_code.launchpad.net,

Message:
Please take a look.

Description:
A spec for how to cleanly stop agents.

Defines an additional state protocol for stopping services along with
topology
support for recording user actions.

https://code.launchpad.net/~hazmat/juju/unit-stop/+merge/98035

(do not edit description out of merge proposal)

Please review this at https://codereview.appspot.com/5847053/

Affected files:
A [revision details]
A source/drafts/stopping-units.rst

Reply

Revision history for this message

William Reade (fwereade) wrote on 2012-03-17:

#

Download full text (4.5 KiB)

couple of typos and some vague rambling thoughts; I'll try to write them
up in some more detail for the list.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode41
source/drafts/stopping-units.rst:41: its killed to properly terminate.
This example can be extended to
s/its/it's/

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
Please define "relevant entities". When stopping a service, for example,
do you mean we should mark all the following:
  - all units of that service
  - all unit relations of that service
  - all machines running only that service
  - the service itself
?

I think it'd be easier to think of things cascading a bit more -- so
individual entities detected their own impending death by watching for
changes to their immediate parents' statuses... but I may have to think
it through a little more.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70
source/drafts/stopping-units.rst:70: - unit agent disables its upstart
file
Not sure about this; can't we just always write agents with a "normal
exit 0" stanza and depend on that?

Otherwise, if we get a "weird" failure at this precise point, we won't
get an opportunity to start up again and tidy ourselves up nicely; and
besides it seems more correct to me to let the machine agent *always*
tidy up the unit agent upstart files it originally created (which it
will have to do anyway if it times out...).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent process exists
s/exists/exits with a code indicating to upstart that it should not be
restarted/

(?)

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74
source/drafts/stopping-units.rst:74: - machine agent watch fires on
'stop' node deletion, shutsdown the units container.
s/shutsdown/shuts down/

But regardless, I think we need to consider this in a little more detail
-- we're not guaranteed to have a container, after all, and we probably
shouldn't be killing whole containers if we're just shutting down a
subordinate charm, for example.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
If the PA kills a machine, it will then be responsible for cleaning up
the MA's nodes *and* any unit nodes? This doesn't seem quite right...
sounds like a job for the GC really (because -- I *think* -- it's going
to have to deal with unit cleanup as a consequence of forcible machine
termination. (If the MA doesn't handle a stop request properly, probably
it didn't stop its units either.).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode89
source/drafts/stopping-units.rst:89: and garbage collects their topology
footprint and zk state.
I think t...

couple of typos and some vague rambling thoughts; I'll try to write them
up in some more detail for the list.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode41
source/drafts/stopping-units.rst:41: its killed to properly terminate.
This example can be extended to
s/its/it's/

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
Please define "relevant entities". When stopping a service, for example,
do you mean we should mark all the following:
  - all units of that service
  - all unit relations of that service
  - all machines running only that service
  - the service itself
?

I think it'd be easier to think of things cascading a bit more -- so
individual entities detected their own impending death by watching for
changes to their immediate parents' statuses... but I may have to think
it through a little more.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70
source/drafts/stopping-units.rst:70: - unit agent disables its upstart
file
Not sure about this; can't we just always write agents with a "normal
exit 0" stanza and depend on that?

Otherwise, if we get a "weird" failure at this precise point, we won't
get an opportunity to start up again and tidy ourselves up nicely; and
besides it seems more correct to me to let the machine agent *always*
tidy up the unit agent upstart files it originally created (which it
will have to do anyway if it times out...).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent process exists
s/exists/exits with a code indicating to upstart that it should not be
restarted/

(?)

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74
source/drafts/stopping-units.rst:74: - machine agent watch fires on
'stop' node deletion, shutsdown the units container.
s/shutsdown/shuts down/

But regardless, I think we need to consider this in a little more detail
-- we're not guaranteed to have a container, after all, and we probably
shouldn't be killing whole containers if we're just shutting down a
subordinate charm, for example.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
If the PA kills a machine, it will then be responsible for cleaning up
the MA's nodes *and* any unit nodes? This doesn't seem quite right...
sounds like a job for the GC really (because -- I *think* -- it's going
to have to deal with unit cleanup as a consequence of forcible machine
termination. (If the MA doesn't handle a stop request properly, probably
it didn't stop its units either.).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode89
source/drafts/stopping-units.rst:89: and garbage collects their topology
footprint and zk state.
I think this needs a bit more detail; see below.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode100
source/drafts/stopping-units.rst:100: internal data structures.
Sounds sensible. I'm wondering if it would be sensible to extend the
general idea to unit relations, replacing the existing mechanism; in
combination with a gc node, for example, it could be useful for units to
hold a don't-delete-yet lock on other participants in their relations
(until the point at which the unit itself has processed their removal).

(I think we do need something like this: a relation hook can
theoretically execute with an arbitrarily old members list, and the only
reason this isn't currently a problem is because we don't delete the old
members' settings nodes, so the relation hook commands can continue to
cheerfully extract data from dead unit relations. (I think, anyway: the
details have grown hazy. Please correct me if I need it ;)))

But then... if a UA holding a "lock" on another unit relation's settings
gets forcibly killed, we need a way to detect that *that* lock is no
longer relevant... it's not insoluble, sure, but it could get fiddly,
and whatever we do I think we'll need a semi-smart GC that can figure
out these sorts of issues, and I think it's worth speccing out the
details of the GC in a bit more depth.

https://codereview.appspot.com/5847053/

Reply

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-03-21:

#

Download full text (5.4 KiB)

Thanks for the review!

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
On 2012/03/17 12:20:30, fwereade wrote:
> Please define "relevant entities". When stopping a service, for
example, do you
> mean we should mark all the following:
> - all units of that service
> - all unit relations of that service
> - all machines running only that service
> - the service itself
> ?

> I think it'd be easier to think of things cascading a bit more -- so
individual
> entities detected their own impending death by watching for changes to
their
> immediate parents' statuses... but I may have to think it through a
little more.

Good point, for a service it would set the appropriate value for all the
units. I like the notion of a children watching parents, but its not
clear its nesc., but ultimately its the assigned machine's
responsibility to shutdown the unit, and it seems just as well.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70
source/drafts/stopping-units.rst:70: - unit agent disables its upstart
file
On 2012/03/17 12:20:30, fwereade wrote:
> Not sure about this; can't we just always write agents with a "normal
exit 0"
> stanza and depend on that?

> Otherwise, if we get a "weird" failure at this precise point, we won't
get an
> opportunity to start up again and tidy ourselves up nicely; and
besides it seems
> more correct to me to let the machine agent *always* tidy up the unit
agent
> upstart files it originally created (which it will have to do anyway
if it times
> out...).

That sounds good, although a wierd failure might just as well exit 0 ;-)
But yeah.. the parent should perform final cleanup, and i like the
upstart stanza approach.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74
source/drafts/stopping-units.rst:74: - machine agent watch fires on
'stop' node deletion, shutsdown the units container.
On 2012/03/17 12:20:30, fwereade wrote:
> s/shutsdown/shuts down/

> But regardless, I think we need to consider this in a little more
detail --
> we're not guaranteed to have a container, after all, and we probably
shouldn't
> be killing whole containers if we're just shutting down a subordinate
charm, for
> example.

a subordinate is handled slightly differently, same protocol different
parent. a subordinate is cleaned up by its parent/primary unit. most of
the mechanics of container vs. not are already abstracted out behind
unit deploy factories.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
On 2012/03/17 12:20:30, fwereade wrote:
> If the PA kills a machine, it will then be responsible for cleaning up
the MA's
> nodes *and* any unit nodes? This doesn't seem quite right... sounds
like a job
> for the GC really (because -- I *think* -- it's going to have to deal
with unit
> cleanup...

Thanks for the review!

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59
source/drafts/stopping-units.rst:59: against all the relevant entitites.
[See Topology Internals]
On 2012/03/17 12:20:30, fwereade wrote:
> Please define "relevant entities". When stopping a service, for
example, do you
> mean we should mark all the following:
>   - all units of that service
>   - all unit relations of that service
>   - all machines running only that service
>   - the service itself
> ?

> I think it'd be easier to think of things cascading a bit more -- so
individual
> entities detected their own impending death by watching for changes to
their
> immediate parents' statuses... but I may have to think it through a
little more.

Good point, for a service it would set the appropriate value for all the
units. I like the notion of a children watching parents, but its not
clear its nesc., but ultimately its the assigned machine's
responsibility to shutdown the unit, and it seems just as well.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode70
source/drafts/stopping-units.rst:70: - unit agent disables its upstart
file
On 2012/03/17 12:20:30, fwereade wrote:
> Not sure about this; can't we just always write agents with a "normal
exit 0"
> stanza and depend on that?

> Otherwise, if we get a "weird" failure at this precise point, we won't
get an
> opportunity to start up again and tidy ourselves up nicely; and
besides it seems
> more correct to me to let the machine agent *always* tidy up the unit
agent
> upstart files it originally created (which it will have to do anyway
if it times
> out...).

That sounds good, although a wierd failure might just as well exit 0 ;-)
But yeah.. the parent should perform final cleanup, and i like the
upstart stanza approach.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode74
source/drafts/stopping-units.rst:74: - machine agent watch fires on
'stop' node deletion, shutsdown the units container.
On 2012/03/17 12:20:30, fwereade wrote:
> s/shutsdown/shuts down/

> But regardless, I think we need to consider this in a little more
detail --
> we're not guaranteed to have a container, after all, and we probably
shouldn't
> be killing whole containers if we're just shutting down a subordinate
charm, for
> example.

a subordinate is handled slightly differently, same protocol different
parent. a subordinate is cleaned up by its parent/primary unit. most of
the mechanics of container vs. not are already abstracted out behind
unit deploy factories.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode78
source/drafts/stopping-units.rst:78: node contents updated to reflect
this.
On 2012/03/17 12:20:30, fwereade wrote:
> If the PA kills a machine, it will then be responsible for cleaning up
the MA's
> nodes *and* any unit nodes? This doesn't seem quite right... sounds
like a job
> for the GC really (because -- I *think* -- it's going to have to deal
with unit
> cleanup as a consequence of forcible machine termination. (If the MA
doesn't
> handle a stop request properly, probably it didn't stop its units
either.).

machines can't be terminated currently if they have units assigned. but
that does bring up the question of exactly when do the units become
unassigned from the machine. in that particularly case its not identity
that we're removing, and the parent supervisor is the machine agent is
already going to be dealing with shutting it down.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode100
source/drafts/stopping-units.rst:100: internal data structures.
On 2012/03/17 12:20:30, fwereade wrote:
> Sounds sensible. I'm wondering if it would be sensible to extend the
general
> idea to unit relations, replacing the existing mechanism; in
combination with a
> gc node, for example, it could be useful for units to hold a
don't-delete-yet
> lock on other participants in their relations (until the point at
which the unit
> itself has processed their removal).

> (I think we do need something like this: a relation hook can
theoretically
> execute with an arbitrarily old members list, and the only reason this
isn't
> currently a problem is because we don't delete the old members'
settings nodes,
> so the relation hook commands can continue to cheerfully extract data
from dead
> unit relations. (I think, anyway: the details have grown hazy. Please
correct me
> if I need it ;)))

> But then... if a UA holding a "lock" on another unit relation's
settings gets
> forcibly killed, we need a way to detect that *that* lock is no longer
> relevant... it's not insoluble, sure, but it could get fiddly, and
whatever we
> do I think we'll need a semi-smart GC that can figure out these sorts
of issues,
> and I think it's worth speccing out the details of the GC in a bit
more depth.

re extension, I think we'll have more interesting mechanisms for
multi-node workflows. Its possible to do here, but it feels orthogonal
to stop problem. I will note that clint was interested in this sort of
feature a few months ago.

per our discussion on irc, gc can be fairly conservative with relations,
only cleaning it out after all endpoints service units that participated
in the relation are gone.

http://codereview.appspot.com/5847053/

Reply

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-03-21:

#

Please take a look.

https://codereview.appspot.com/5847053/

Reply

Revision history for this message

Gustavo Niemeyer (niemeyer) wrote on 2012-03-21:

#

Download full text (3.5 KiB)

Nice to see this moving forward. Some comments:

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4
source/drafts/stopping-units.rst:4: -------------
Sorry for nitpicking, but can we please stick to a consistent convention
for headers? I believe that's what we have today:

Top level: ===
Inner level: ---

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode30
source/drafts/stopping-units.rst:30: while still allowing for
termination of errant children.
This is a very dense paragraph. Can it be simplified to something like:

"""
Juju records the desired state in the topology, and preserves that state
while the actions to accomplish it are still taking place.
"""

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode32
source/drafts/stopping-units.rst:32: How it works, what follows is a a
worked example going through a unit
"How it works, what follows"?

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode58
source/drafts/stopping-units.rst:58: actions are recorded in the
topology as a key 'action' value 'destroy'
The "action: destroy" terminology feels a bit out of place here. This is
pretty much a procedure call, but the topology holds state instead.
Something like "destroyed: true" would be more appropriate. It remains
true even after the action has taken place.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode65
source/drafts/stopping-units.rst:65: - Machine writes to
/units/unit-x/stop / sets watch on /units/unit-x/stop and timer.
See below regarding the watch side of it.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode69
source/drafts/stopping-units.rst:69: - executing stop hook.
This should be done after departing from relations.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent updates
/units/unit-x/stop to reflect the action has been performed.
That step doesn't seem necessary. We can reuse the existing liveness
mechanism that is already supported by the agent. When the agent dies,
the machine will know it.

It also seems like a nice idea to have the stop ZooKeeper node existing
for as long as that's the intended state of the unit. We have to need
that node on startup for watching it anyway.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode75
source/drafts/stopping-units.rst:75: - machine agent watch fires on
'stop' node deletion, shuts down the unit, removing
Also touches the above point.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
What about garbage collecting on the next action after the decision that
it's fine to GC it? I don't think we need a specific process that has
the task of GCing the topology.
...

Nice to see this moving forward. Some comments:

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4
source/drafts/stopping-units.rst:4: -------------
Sorry for nitpicking, but can we please stick to a consistent convention
for headers? I believe that's what we have today:

Top level: ===
Inner level: ---

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode30
source/drafts/stopping-units.rst:30: while still allowing for
termination of errant children.
This is a very dense paragraph. Can it be simplified to something like:

"""
Juju records the desired state in the topology, and preserves that state
while the actions to accomplish it are still taking place.
"""

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode32
source/drafts/stopping-units.rst:32: How it works, what follows is a a
worked example going through a unit
"How it works, what follows"?

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode58
source/drafts/stopping-units.rst:58: actions are recorded in the
topology as a key 'action' value 'destroy'
The "action: destroy" terminology feels a bit out of place here. This is
pretty much a procedure call, but the topology holds state instead.
Something like "destroyed: true" would be more appropriate. It remains
true even after the action has taken place.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode65
source/drafts/stopping-units.rst:65: - Machine writes to
/units/unit-x/stop / sets watch on /units/unit-x/stop and timer.
See below regarding the watch side of it.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode69
source/drafts/stopping-units.rst:69: - executing stop hook.
This should be done after departing from relations.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent updates
/units/unit-x/stop to reflect the action has been performed.
That step doesn't seem necessary. We can reuse the existing liveness
mechanism that is already supported by the agent. When the agent dies,
the machine will know it.

It also seems like a nice idea to have the stop ZooKeeper node existing
for as long as that's the intended state of the unit. We have to need
that node on startup for watching it anyway.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode75
source/drafts/stopping-units.rst:75: - machine agent watch fires on
'stop' node deletion, shuts down the unit, removing
Also touches the above point.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
What about garbage collecting on the next action after the decision that
it's fine to GC it? I don't think we need a specific process that has
the task of GCing the topology.

In both cases, though, we have an issue: when is it ok to GC it? This
proposal is not providing any means for deciding on that, which means
the data becomes eternal. That hole indicates postponing that side of
the problem may not be a good idea. We can do something simple, but it'd
be good implement cleaning up at the same time.

https://codereview.appspot.com/5847053/

Reply

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-03-26:

#

Download full text (4.6 KiB)

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4
source/drafts/stopping-units.rst:4: -------------
On 2012/03/21 20:07:53, niemeyer wrote:
> Sorry for nitpicking, but can we please stick to a consistent
convention for
> headers? I believe that's what we have today:

> Top level: ===
> Inner level: ---

done.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode30
source/drafts/stopping-units.rst:30: while still allowing for
termination of errant children.
On 2012/03/21 20:07:53, niemeyer wrote:
> This is a very dense paragraph. Can it be simplified to something
like:

> """
> Juju records the desired state in the topology, and preserves that
state while
> the actions to accomplish it are still taking place.
> """

sounds good.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode32
source/drafts/stopping-units.rst:32: How it works, what follows is a a
worked example going through a unit
On 2012/03/21 20:07:53, niemeyer wrote:
> "How it works, what follows"?

removed

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode58
source/drafts/stopping-units.rst:58: actions are recorded in the
topology as a key 'action' value 'destroy'
On 2012/03/21 20:07:53, niemeyer wrote:
> The "action: destroy" terminology feels a bit out of place here. This
is pretty
> much a procedure call, but the topology holds state instead. Something
like
> "destroyed: true" would be more appropriate. It remains true even
after the
> action has taken place.

action here was a really a name for a coordinated state transition, ie.
an implied final state. recording the state instead of the activity
sounds good. i have a minor preference for a generic key over per state
keys, as i can't think of any multi-valued concurrent activities/states
here, but not guessing via constraints in the face of ambiguity does
seem to be the wiser choice.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode69
source/drafts/stopping-units.rst:69: - executing stop hook.
On 2012/03/21 20:07:53, niemeyer wrote:
> This should be done after departing from relations.
definitely

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent updates
/units/unit-x/stop to reflect the action has been performed.
On 2012/03/21 20:07:53, niemeyer wrote:
> That step doesn't seem necessary. We can reuse the existing liveness
mechanism
> that is already supported by the agent. When the agent dies, the
machine will
> know it.

> It also seems like a nice idea to have the stop ZooKeeper node
existing for as
> long as that's the intended state of the unit. We have to need that
node on
> startup for watching it anyway.

i wanted to avoid presence node usage, due its known problems with
session expiration. ie if an agent has a restart in the middle of the
stop process, the machine agent watch may erroneously think the agent...

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4
source/drafts/stopping-units.rst:4: -------------
On 2012/03/21 20:07:53, niemeyer wrote:
> Sorry for nitpicking, but can we please stick to a consistent
convention for
> headers? I believe that's what we have today:

> Top level: ===
> Inner level: ---

done.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode30
source/drafts/stopping-units.rst:30: while still allowing for
termination of errant children.
On 2012/03/21 20:07:53, niemeyer wrote:
> This is a very dense paragraph. Can it be simplified to something
like:

> """
> Juju records the desired state in the topology, and preserves that
state while
> the actions to accomplish it are still taking place.
> """

sounds good.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode32
source/drafts/stopping-units.rst:32: How it works, what follows is a a
worked example going through a unit
On 2012/03/21 20:07:53, niemeyer wrote:
> "How it works, what follows"?

removed

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode58
source/drafts/stopping-units.rst:58: actions are recorded in the
topology as a key 'action' value 'destroy'
On 2012/03/21 20:07:53, niemeyer wrote:
> The "action: destroy" terminology feels a bit out of place here. This
is pretty
> much a procedure call, but the topology holds state instead. Something
like
> "destroyed: true" would be more appropriate. It remains true even
after the
> action has taken place.

action here was a really a name for a coordinated state transition, ie.
an implied final state. recording the state instead of the activity
sounds good. i have a minor preference for a generic key over per state
keys, as i can't think of any multi-valued concurrent activities/states
here, but not guessing via constraints in the face of ambiguity does
seem to be the wiser choice.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode69
source/drafts/stopping-units.rst:69: - executing stop hook.
On 2012/03/21 20:07:53, niemeyer wrote:
> This should be done after departing from relations.
definitely

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode72
source/drafts/stopping-units.rst:72: - unit agent updates
/units/unit-x/stop to reflect the action has been performed.
On 2012/03/21 20:07:53, niemeyer wrote:
> That step doesn't seem necessary. We can reuse the existing liveness
mechanism
> that is already supported by the agent. When the agent dies, the
machine will
> know it.

> It also seems like a nice idea to have the stop ZooKeeper node
existing for as
> long as that's the intended state of the unit. We have to need that
node on
> startup for watching it anyway.

i wanted to avoid presence node usage, due its known problems with
session expiration. ie if an agent has a restart in the middle of the
stop process, the machine agent watch may erroneously think the agent is
up. The stop node explicitness feels more robust in this context, and as
you've noted its nesc for the coordination aspect anyways. Its also nice
to clean up the local coordination structure when the coordination is
done.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/21 20:07:53, niemeyer wrote:
> What about garbage collecting on the next action after the decision
that it's
> fine to GC it? I don't think we need a specific process that has the
task of
> GCing the topology.

> In both cases, though, we have an issue: when is it ok to GC it? This
proposal
> is not providing any means for deciding on that, which means the data
becomes
> eternal. That hole indicates postponing that side of the problem may
not be a
> good idea. We can do something simple, but it'd be good implement
cleaning up at
> the same time.

There is some cost associated to doing gc every user action, its n
roundtrips for the client to check. I agree it is simpler for user-gc,
but there are other forms of garbage that could be opportunistically
collected with an agent (ie. old /relations, non topology referenced
service nodes) Its not clear that garbage will be a problem in practice
though outside of the topology, so i'm fine with the user initiated gc
confined to topology intents.

https://codereview.appspot.com/5847053/

Reply

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-03-26:

#

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/26 00:42:07, hazmat wrote:
> On 2012/03/21 20:07:53, niemeyer wrote:
> > What about garbage collecting on the next action after the decision
that it's
> > fine to GC it? I don't think we need a specific process that has the
task of
> > GCing the topology.
> >
> > In both cases, though, we have an issue: when is it ok to GC it?
This proposal
> > is not providing any means for deciding on that, which means the
data becomes
> > eternal. That hole indicates postponing that side of the problem may
not be a
> > good idea. We can do something simple, but it'd be good implement
cleaning up
> at
> > the same time.

> There is some cost associated to doing gc every user action, its n
roundtrips
> for the client to check. I agree it is simpler for user-gc, but there
are other
> forms of garbage that could be opportunistically collected with an
agent (ie.
> old /relations, non topology referenced service nodes) Its not clear
that
> garbage will be a problem in practice though outside of the topology,
so i'm
> fine with the user initiated gc confined to topology intents.

i'm still a bit concerned about the lag here. Take a service with 10
units, if we destroy the service, we end up recording destroy against
the 10 units for their assigned machine coordination and the service.

I suppose that goes away if we just record the action once against the
service, and have machine agents watching all their assigned service
unit's service in the topology.

In either case, for the next user interaction, we have several
roundtrips queued for the next user interaction one per unit to verify
their completion/destruction. It feels like instilling an unpredictable
lag, esp. as the activity is an ongoing one that may take several
minutes (executing the stop hook) and thus require gc checking for the
next several user commands. And in a multi-user usage, the user who
finally deleted the object from the topology would be random, which also
feels questionable from an audit perspective.

Given that, i'd prefer the dedicated gc agent.

https://codereview.appspot.com/5847053/

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/26 00:42:07, hazmat wrote:
> On 2012/03/21 20:07:53, niemeyer wrote:
> > What about garbage collecting on the next action after the decision
that it's
> > fine to GC it? I don't think we need a specific process that has the
task of
> > GCing the topology.
> >
> > In both cases, though, we have an issue: when is it ok to GC it?
This proposal
> > is not providing any means for deciding on that, which means the
data becomes
> > eternal. That hole indicates postponing that side of the problem may
not be a
> > good idea. We can do something simple, but it'd be good implement
cleaning up
> at
> > the same time.

> There is some cost associated to doing gc every user action, its n
roundtrips
> for the client to check. I agree it is simpler for user-gc, but there
are other
> forms of garbage that could be opportunistically collected with an
agent (ie.
> old /relations, non topology referenced service nodes) Its not clear
that
> garbage will be a problem in practice though outside of the topology,
so i'm
> fine with the user initiated gc confined to topology intents.

i'm still a bit concerned about the lag here. Take a service with 10
units, if we destroy the service, we end up recording destroy against
the 10 units for their assigned machine coordination and the service.

I suppose that goes away if we just record the action once against the
service, and have machine agents watching all their assigned service
unit's service in the topology.

In either case, for the next user interaction, we have several
roundtrips queued for the next user interaction one per unit to verify
their completion/destruction. It feels like instilling an unpredictable
lag, esp. as the activity is an ongoing one that may take several
minutes (executing the stop hook) and thus require gc checking for the
next several user commands. And in a multi-user usage, the user who
finally deleted the object from the topology would be random, which also
feels questionable from an audit perspective.

Given that, i'd prefer the dedicated gc agent.

https://codereview.appspot.com/5847053/

Reply

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2012-03-26:

#

Please take a look.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/21 20:07:53, niemeyer wrote:
> What about garbage collecting on the next action after the decision
that it's
> fine to GC it? I don't think we need a specific process that has the
task of
> GCing the topology.

> In both cases, though, we have an issue: when is it ok to GC it? This
proposal
> is not providing any means for deciding on that, which means the data
becomes
> eternal. That hole indicates postponing that side of the problem may
not be a
> good idea. We can do something simple, but it'd be good implement
cleaning up at
> the same time.

wrt to when is it OK to GC it, i've added an additional section to the
document regarding when its okay.

https://codereview.appspot.com/5847053/

Reply

lp://staging/~hazmat/pyjuju/unit-stop updated on 2012-03-26

6. By Kapil Thangavelu on 2012-03-26: restore stop node creation to supervisory agent, too many races otherwise, the topo is the source of truth

Revision history for this message

Gustavo Niemeyer (niemeyer) wrote on 2012-03-26:

#

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode75
source/drafts/stopping-units.rst:75: - machine agent watch fires on
'stop' node deletion, shuts down the unit, removing
On 2012/03/21 20:07:53, niemeyer wrote:
> Also touches the above point.

This one is still pending.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/26 00:59:22, hazmat wrote:
> i'm still a bit concerned about the lag here.
(...)
> Given that, i'd prefer the dedicated gc agent.

Sounds good, but let's please not swipe the problem under the carpet. If
we'll have a dedicated GC, let's implement it upfront at the same time
we implement the garbage itself. It may be an additional command on
juju-admin, for example, that is called periodically on the zookeeper
machine.

While implementing GC, we may figure details about the implementation
that might be best done differently, so we shouldn't be tying ourselves
to an implementation and postponing the decision on how to GC to a
latter point.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode50
source/drafts/stopping-units.rst:50: are recorded in the topology as a
'destroy' key with a True value
Thanks. Please just s/destroy/destroyed/, otherwise we would still have
an action to be carried rather than a state.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode83
source/drafts/stopping-units.rst:83: Principal actions via the client
will trigger gc activities against
New and unfinished sentence.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode113
source/drafts/stopping-units.rst:113: ``condition``: no presence and no
stop node.
Should be "no presence node and existent stop node."

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode128
source/drafts/stopping-units.rst:128: ``condition``: no presence and no
stop node.
Ditto.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode136
source/drafts/stopping-units.rst:136: agent/domain state ( machine,
unit, service), its best to treat them
s/its/it's/

https://codereview.appspot.com/5847053/

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode75
source/drafts/stopping-units.rst:75: - machine agent watch fires on
'stop' node deletion, shuts down the unit, removing
On 2012/03/21 20:07:53, niemeyer wrote:
> Also touches the above point.

This one is still pending.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/26 00:59:22, hazmat wrote:
> i'm still a bit concerned about the lag here.
(...)
> Given that, i'd prefer the dedicated gc agent.

Sounds good, but let's please not swipe the problem under the carpet. If
we'll have a dedicated GC, let's implement it upfront at the same time
we implement the garbage itself. It may be an additional command on
juju-admin, for example, that is called periodically on the zookeeper
machine.

While implementing GC, we may figure details about the implementation
that might be best done differently, so we shouldn't be tying ourselves
to an implementation and postponing the decision on how to GC to a
latter point.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode50
source/drafts/stopping-units.rst:50: are recorded in the topology as a
'destroy' key with a True value
Thanks. Please just s/destroy/destroyed/, otherwise we would still have
an action to be carried rather than a state.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode83
source/drafts/stopping-units.rst:83: Principal actions via the client
will trigger gc activities against
New and unfinished sentence.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode113
source/drafts/stopping-units.rst:113: ``condition``: no presence and no
stop node.
Should be "no presence node and existent stop node."

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode128
source/drafts/stopping-units.rst:128: ``condition``: no presence and no
stop node.
Ditto.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units.rst#newcode136
source/drafts/stopping-units.rst:136: agent/domain state ( machine,
unit, service), its best to treat them
s/its/it's/

https://codereview.appspot.com/5847053/

Reply

pyjuju

Merge lp://staging/~hazmat/pyjuju/unit-stop into lp://staging/~juju/pyjuju/docs

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers