Merge into trunk : 055-apiserver-heartbeat-analyzing : Code : juju-core

Reviewer	Review Type	Date Requested	Status
Juju Engineering		2013-11-12	Pending
Review via email: mp+194862@code.staging.launchpad.net

Revision history for this message

Frank Mueller (themue) wrote on 2013-11-12:

#

Reviewers: mp+194862_code.launchpad.net,

Message:
Please take a look.

Description:
apiserver: analyzing ping timeouts

This is a first fix of #1205451 handling externally
killed juju nodes. Currently the pings of clients to
the API server are not analyzed. With this change
each ping leads to the reset of a timer waiting for
3 minutes. If it is not reset (by the ping) the timer
signal tells the pinger to kill the connection.

The overall problem with the presence pinger,
especially regarding to HA environments, is not
covered by this change.

https://code.launchpad.net/~themue/juju-core/055-apiserver-heartbeat-analyzing/+merge/194862

(do not edit description out of merge proposal)

Please review this at https://codereview.appspot.com/24040044/

Affected files (+114, -8 lines):
   A [revision details]
   M state/apiserver/export_test.go
   M state/apiserver/root.go
   M state/apiserver/root_test.go

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-11-12:

#

Download full text (3.6 KiB)

This is going in a reasonable direction, but I have a few comments to
make. It's a bit of a pity that this adds yet another goroutine per
connection, but I think that can be sorted out at some other time (we
could probably have one goroutine that deals with all the pinging). The
other thing that this CL does not do is actually close the client
connection. That can probably be left for a followup though.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251
state/apiserver/root.go:251: pinger, err := newSrvPinger(r,
3*time.Minute)
This seems weird to me. For a start, it's racy - technically a client is
allowed to call Ping twice concurrently, and if it does that, we'll have
two srvPingers active. Secondly, it seems odd that if a client never
bothers to ping, they will never get one of these allocated, so we'll
never get the timeout behaviour. Thirdly, just deleting all the
resources without killing the connection seems unlikely to be worth it.

I think you need to set this up inside apiRootForEntity.
Additionally, I don't think you should return the srvPinger directly
from this method - that exposes its Stop method to the API, which you
probably don't want to do. I make a suggestion below.

One other thing: I think the actual value of the duration should perhaps
be tied to the value used by the client (perhaps the client should use
the value divided by 10 or something)

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode312
state/apiserver/root.go:312: reset chan interface{}
s/interface{}/struct{}/

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode315
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a
running killer.
This type isn't really a pinger - it doesn't do any pinging.

How about something like this?

type resourceTimeout struct {
    tomb tomb.Tomb
    timer *time.Timer
    reset chan struct{}
    resource killer
}

// newResourceTimeout will call resource.Kill if
// Refresh is not repeatedly called on the returned resourceTimeout
// within the given timeout.
func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode332
state/apiserver/root.go:332: p.reset <- struct{}{}
This needs to make sure that the pinger isn't being stopped
by selecting on the tomb's dying chan too.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode337
state/apiserver/root.go:337: // p.tomb.Kill(nil)
Why is this commented out?

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode345
state/apiserver/root.go:345: defer close(p.reset)
Why close the reset channel?

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode350
state/apiserver/root.go:350: case <-time.After(p.timeout):
I think you should use NewTimer here (and Stop it after the select) to
avoid creating a bit of garbage each time a value is received on reset.
Even better, just use a single Timer instance inside ...

This is going in a reasonable direction, but I have a few comments to
make. It's a bit of a pity that this adds yet another goroutine per
connection, but I think that can be sorted out at some other time (we
could probably have one goroutine that deals with all the pinging). The
other thing that this CL does not do is actually close the client
connection. That can probably be left for a followup though.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251
state/apiserver/root.go:251: pinger, err := newSrvPinger(r,
3*time.Minute)
This seems weird to me. For a start, it's racy - technically a client is
allowed to call Ping twice concurrently, and if it does that, we'll have
two srvPingers active. Secondly, it seems odd that if a client never
bothers to ping, they will never get one of these allocated, so we'll
never get the timeout behaviour. Thirdly, just deleting all the
resources without killing the connection seems unlikely to be worth it.

I think you need to set this up inside apiRootForEntity.
Additionally, I don't think you should return the srvPinger directly
from this method - that exposes its Stop method to the API, which you
probably don't want to do. I make a suggestion below.

One other thing: I think the actual value of the duration should perhaps
be tied to the value used by the client (perhaps the client should use
the value divided by 10 or something)

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode312
state/apiserver/root.go:312: reset   chan interface{}
s/interface{}/struct{}/

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode315
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a
running killer.
This type isn't really a pinger - it doesn't do any pinging.

How about something like this?

type resourceTimeout struct {
    tomb tomb.Tomb
    timer *time.Timer
    reset chan struct{}
    resource killer
}

// newResourceTimeout will call resource.Kill if
// Refresh is not repeatedly called on the returned resourceTimeout
// within the given timeout.
func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode332
state/apiserver/root.go:332: p.reset <- struct{}{}
This needs to make sure that the pinger isn't being stopped
by selecting on the tomb's dying chan too.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode337
state/apiserver/root.go:337: // p.tomb.Kill(nil)
Why is this commented out?

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode345
state/apiserver/root.go:345: defer close(p.reset)
Why close the reset channel?

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode350
state/apiserver/root.go:350: case <-time.After(p.timeout):
I think you should use NewTimer here (and Stop it after the select) to
avoid creating a bit of garbage each time a value is received on reset.
Even better, just use a single Timer instance inside p and call Reset on
that.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go#newcode65
state/apiserver/root_test.go:65: time.Sleep(10 * time.Millisecond)
This test will trigger a race detector error - there's no
synchronisation between the Kill call and the read of killed.

Better to make Kill send the time on a channel.

https://codereview.appspot.com/24040044/

Revision history for this message

William Reade (fwereade) wrote on 2013-11-14:

#

Thanks rog -- just emphasising a couple of your points.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251
state/apiserver/root.go:251: pinger, err := newSrvPinger(r,
3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client
is allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping,
they will
> never get one of these allocated, so we'll never get the timeout
behaviour.
> Thirdly, just deleting all the resources without killing the
connection seems
> unlikely to be worth it.

> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly
from this
> method - that exposes its Stop method to the API, which you probably
don't want
> to do. I make a suggestion below.

> One other thing: I think the actual value of the duration should
perhaps be tied
> to the value used by the client (perhaps the client should use the
value divided
> by 10 or something)

Strong agreement with all of this. Create one per connection, use the
Ping method purely for resetting the connection-killing timeout. And
please make a timeout actually kill the connection -- that's the point
of this branch, and I don't think it should be left to a later one.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode315
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a
running killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.

> How about something like this?

> type resourceTimeout struct {
> tomb tomb.Tomb
> timer *time.Timer
> reset chan struct{}
> resource killer
> }

> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

FWIW, this is not (should not be) about killing resources -- it should
be about closing the connection. The logic for cleaning up resources
already exists, and is triggered by closing the connection.

https://codereview.appspot.com/24040044/

Thanks rog -- just emphasising a couple of your points.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251
state/apiserver/root.go:251: pinger, err := newSrvPinger(r,
3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client
is allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping,
they will
> never get one of these allocated, so we'll never get the timeout
behaviour.
> Thirdly, just deleting all the resources without killing the
connection seems
> unlikely to be worth it.

> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly
from this
> method - that exposes its Stop method to the API, which you probably
don't want
> to do. I make a suggestion below.

> One other thing: I think the actual value of the duration should
perhaps be tied
> to the value used by the client (perhaps the client should use the
value divided
> by 10 or something)

Strong agreement with all of this. Create one per connection, use the
Ping method purely for resetting the connection-killing timeout. And
please make a timeout actually kill the connection -- that's the point
of this branch, and I don't think it should be left to a later one.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode315
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a
running killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.

> How about something like this?

> type resourceTimeout struct {
>     tomb tomb.Tomb
>     timer *time.Timer
>     reset chan struct{}
>     resource killer
> }

> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

FWIW, this is not (should not be) about killing resources -- it should
be about closing the connection. The logic for cleaning up resources
already exists, and is triggered by closing the connection.

https://codereview.appspot.com/24040044/

Revision history for this message

Frank Mueller (themue) wrote on 2013-11-18:

#

Download full text (3.8 KiB)

Please take a look.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251
state/apiserver/root.go:251: pinger, err := newSrvPinger(r,
3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client
is allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping,
they will
> never get one of these allocated, so we'll never get the timeout
behaviour.
> Thirdly, just deleting all the resources without killing the
connection seems
> unlikely to be worth it.

> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly
from this
> method - that exposes its Stop method to the API, which you probably
don't want
> to do. I make a suggestion below.

> One other thing: I think the actual value of the duration should
perhaps be tied
> to the value used by the client (perhaps the client should use the
value divided
> by 10 or something)

Found a different approach. Only the part with the duration should be
discussed and may be part of the next round. Not sure about it.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode312
state/apiserver/root.go:312: reset chan interface{}
On 2013/11/12 16:01:21, rog wrote:
> s/interface{}/struct{}/

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode315
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a
running killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.

> How about something like this?

> type resourceTimeout struct {
> tomb tomb.Tomb
> timer *time.Timer
> reset chan struct{}
> resource killer
> }

> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

Yeah, looks better. Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode332
state/apiserver/root.go:332: p.reset <- struct{}{}
On 2013/11/12 16:01:21, rog wrote:
> This needs to make sure that the pinger isn't being stopped
> by selecting on the tomb's dying chan too.

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode337
state/apiserver/root.go:337: // p.tomb.Kill(nil)
On 2013/11/12 16:01:21, rog wrote:
> Why is this commented out?

Argh, forgot after a test run. :/

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode345
state/apiserver/root.go:345: defer close(p.reset)
On 2013/11/12 16:01:21, rog wrote:
> Why close the reset channel?

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode350
state/apiserver/root.go:350: case <-time.After(p.timeout):
On 2013/11/12 16:01:21, rog wrote:
> I think you should use NewTimer here (and Stop it after th...

Please take a look.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251
state/apiserver/root.go:251: pinger, err := newSrvPinger(r,
3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client
is allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping,
they will
> never get one of these allocated, so we'll never get the timeout
behaviour.
> Thirdly, just deleting all the resources without killing the
connection seems
> unlikely to be worth it.

> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly
from this
> method - that exposes its Stop method to the API, which you probably
don't want
> to do. I make a suggestion below.

> One other thing: I think the actual value of the duration should
perhaps be tied
> to the value used by the client (perhaps the client should use the
value divided
> by 10 or something)

Found a different approach. Only the part with the duration should be
discussed and may be part of the next round. Not sure about it.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode312
state/apiserver/root.go:312: reset   chan interface{}
On 2013/11/12 16:01:21, rog wrote:
> s/interface{}/struct{}/

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode315
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a
running killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.

> How about something like this?

> type resourceTimeout struct {
>     tomb tomb.Tomb
>     timer *time.Timer
>     reset chan struct{}
>     resource killer
> }

> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

Yeah, looks better. Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode332
state/apiserver/root.go:332: p.reset <- struct{}{}
On 2013/11/12 16:01:21, rog wrote:
> This needs to make sure that the pinger isn't being stopped
> by selecting on the tomb's dying chan too.

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode337
state/apiserver/root.go:337: // p.tomb.Kill(nil)
On 2013/11/12 16:01:21, rog wrote:
> Why is this commented out?

Argh, forgot after a test run. :/

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode345
state/apiserver/root.go:345: defer close(p.reset)
On 2013/11/12 16:01:21, rog wrote:
> Why close the reset channel?

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode350
state/apiserver/root.go:350: case <-time.After(p.timeout):
On 2013/11/12 16:01:21, rog wrote:
> I think you should use NewTimer here (and Stop it after the select) to
avoid
> creating a bit of garbage each time a value is received on reset. Even
better,
> just use a single Timer instance inside p and call Reset on that.

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go#newcode65
state/apiserver/root_test.go:65: time.Sleep(10 * time.Millisecond)
On 2013/11/12 16:01:21, rog wrote:
> This test will trigger a race detector error - there's no
synchronisation
> between the Kill call and the read of killed.

> Better to make Kill send the time on a channel.

Done.

https://codereview.appspot.com/24040044/

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-11-18:

#

Not quite there yet. Some thoughts below.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode44
state/apiserver/root.go:44: initialRoot: root,
Rather than having initialRoot inside srvRoot, I'd prefer
to explicitly include the members we need (srv and rpcConn).

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode52
state/apiserver/root.go:52: r.pingTimeout =
newPingTimeout(r.entity.Tag(), action, 3*time.Minute)
There's nothing that cleans this up when the connection is killed.

Also, it would be good to have the timeout in a named constant,
and connected somehow to the client ping logic.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode240
state/apiserver/root.go:240: // Pinger returns a server pinger tracing
the client pings and
// Pinger returns an object that can be pinged
// by calling its Ping method. If this method
// is not called frequently enough, the connection
// will be dropped.

?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode296
state/apiserver/root.go:296: tag string
Do we need the tag here?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode302
state/apiserver/root.go:302: // newPingTimeout creates a ping timeout
instance
// newPingTimeout returns a new pingTimeout instance
// that invokes the given action if there is more
// than the given timeout interval between calls
// to its Ping method.

?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode311
state/apiserver/root.go:311: go func() {
I *think* that if you do:

defer action()

here that you'll be able to register the pingTimeout
as a resource and have the action close the rpc connection.

alternatively, call

go pt.action()

below instead of calling it synchronously.

On balance I think I prefer the latter option.
I'd make action() not return an error, so the caller
can decide what they might want to do with any error.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode328
state/apiserver/root.go:328: func (pt *pingTimeout) stop() error {
I don't think this is ever called.

https://codereview.appspot.com/24040044/

Not quite there yet. Some thoughts below.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode44
state/apiserver/root.go:44: initialRoot: root,
Rather than having initialRoot inside srvRoot, I'd prefer
to explicitly include the members we need (srv and rpcConn).

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode52
state/apiserver/root.go:52: r.pingTimeout =
newPingTimeout(r.entity.Tag(), action, 3*time.Minute)
There's nothing that cleans this up when the connection is killed.

Also, it would be good to have the timeout in a named constant,
and connected somehow to the client ping logic.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode240
state/apiserver/root.go:240: // Pinger returns a server pinger tracing
the client pings and
// Pinger returns an object that can be pinged
// by calling its Ping method. If this method
// is not called frequently enough, the connection
// will be dropped.

?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode296
state/apiserver/root.go:296: tag     string
Do we need the tag here?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode302
state/apiserver/root.go:302: // newPingTimeout creates a ping timeout
instance
// newPingTimeout returns a new pingTimeout instance
// that invokes the given action if there is more
// than the given timeout interval between calls
// to its Ping method.

?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode311
state/apiserver/root.go:311: go func() {
I *think* that if you do:

defer action()

here that you'll be able to register the pingTimeout
as a resource and have the action close the rpc connection.

alternatively, call

go pt.action()

below instead of calling it synchronously.

On balance I think I prefer the latter option.
I'd make action() not return an error, so the caller
can decide what they might want to do with any error.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode328
state/apiserver/root.go:328: func (pt *pingTimeout) stop() error {
I don't think this is ever called.

https://codereview.appspot.com/24040044/

Revision history for this message

Frank Mueller (themue) wrote on 2013-11-19:

#

Please take a look.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode44
state/apiserver/root.go:44: initialRoot: root,
On 2013/11/18 18:19:48, rog wrote:
> Rather than having initialRoot inside srvRoot, I'd prefer
> to explicitly include the members we need (srv and rpcConn).

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode52
state/apiserver/root.go:52: r.pingTimeout =
newPingTimeout(r.entity.Tag(), action, 3*time.Minute)
On 2013/11/18 18:19:48, rog wrote:
> There's nothing that cleans this up when the connection is killed.

> Also, it would be good to have the timeout in a named constant,
> and connected somehow to the client ping logic.

Yep, cleanup has been prepared but led to the deadlock in the first
approach. Will be changed.

Please describe "... connected somehow to the client ping logic" more.
Shall different agent influence their ping times, shall this be
configurable or fixed by agent type, what is the benefit of this idea
and is it worth the increasing complexity?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode240
state/apiserver/root.go:240: // Pinger returns a server pinger tracing
the client pings and
On 2013/11/18 18:19:48, rog wrote:
> // Pinger returns an object that can be pinged
> // by calling its Ping method. If this method
> // is not called frequently enough, the connection
> // will be dropped.

> ?

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode296
state/apiserver/root.go:296: tag string
On 2013/11/18 18:19:48, rog wrote:
> Do we need the tag here?

Removed, I only needed it during testing.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode302
state/apiserver/root.go:302: // newPingTimeout creates a ping timeout
instance
On 2013/11/18 18:19:48, rog wrote:
> // newPingTimeout returns a new pingTimeout instance
> // that invokes the given action if there is more
> // than the given timeout interval between calls
> // to its Ping method.

> ?

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode311
state/apiserver/root.go:311: go func() {
On 2013/11/18 18:19:48, rog wrote:
> I *think* that if you do:

> defer action()

> here that you'll be able to register the pingTimeout
> as a resource and have the action close the rpc connection.

> alternatively, call

> go pt.action()

> below instead of calling it synchronously.

> On balance I think I prefer the latter option.
> I'd make action() not return an error, so the caller
> can decide what they might want to do with any error.

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode328
state/apiserver/root.go:328: func (pt *pingTimeout) stop() error {
On 2013/11/18 18:19:48, rog wrote:
> I don't think this is ever called.

No it's done, it already had been intended for the timeout cleanup.

https://codereview.appspot.com/24040044/

Please take a look.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode44
state/apiserver/root.go:44: initialRoot: root,
On 2013/11/18 18:19:48, rog wrote:
> Rather than having initialRoot inside srvRoot, I'd prefer
> to explicitly include the members we need (srv and rpcConn).

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode52
state/apiserver/root.go:52: r.pingTimeout =
newPingTimeout(r.entity.Tag(), action, 3*time.Minute)
On 2013/11/18 18:19:48, rog wrote:
> There's nothing that cleans this up when the connection is killed.

> Also, it would be good to have the timeout in a named constant,
> and connected somehow to the client ping logic.

Yep, cleanup has been prepared but led to the deadlock in the first
approach. Will be changed.

Please describe "... connected somehow to the client ping logic" more.
Shall different agent influence their ping times, shall this be
configurable or fixed by agent type, what is the benefit of this idea
and is it worth the increasing complexity?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode240
state/apiserver/root.go:240: // Pinger returns a server pinger tracing
the client pings and
On 2013/11/18 18:19:48, rog wrote:
> // Pinger returns an object that can be pinged
> // by calling its Ping method. If this method
> // is not called frequently enough, the connection
> // will be dropped.

> ?

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode296
state/apiserver/root.go:296: tag     string
On 2013/11/18 18:19:48, rog wrote:
> Do we need the tag here?

Removed, I only needed it during testing.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode302
state/apiserver/root.go:302: // newPingTimeout creates a ping timeout
instance
On 2013/11/18 18:19:48, rog wrote:
> // newPingTimeout returns a new pingTimeout instance
> // that invokes the given action if there is more
> // than the given timeout interval between calls
> // to its Ping method.

> ?

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode311
state/apiserver/root.go:311: go func() {
On 2013/11/18 18:19:48, rog wrote:
> I *think* that if you do:

>     defer action()

> here that you'll be able to register the pingTimeout
> as a resource and have the action close the rpc connection.

> alternatively, call

>     go pt.action()

> below instead of calling it synchronously.

> On balance I think I prefer the latter option.
> I'd make action() not return an error, so the caller
> can decide what they might want to do with any error.

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode328
state/apiserver/root.go:328: func (pt *pingTimeout) stop() error {
On 2013/11/18 18:19:48, rog wrote:
> I don't think this is ever called.

No it's done, it already had been intended for the timeout cleanup.

https://codereview.appspot.com/24040044/

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-11-19:

#

> Please describe "... connected somehow to the client ping logic" more.
Shall
> different agent influence their ping times, shall this be configurable
or fixed
> by agent type, what is the benefit of this idea and is it worth the
increasing
> complexity?

I suggest a specific solution in the comment on the timeout const.

This is really close now, thanks. Just a few small things left.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33
state/apiserver/root.go:33: const timeout = 3 * time.Minute
s/timeout/maxPingInterval/
?
(more obvious what the timeout is)

I think I'd define it inside state/api. Then the api client
code can use it to calculate an appropriate ping interval.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode317
state/apiserver/root.go:317: // that invokes the given action if there
is more
s/action/action asynchronously/

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode329
state/apiserver/root.go:329: go pt.action()
I don't think we want to call this if stop has been called.
when I said "below" I meant in the <-timer.C arm of the select
statement.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode53
state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
s/10/2/

i don't think there's really any particular reason to make this test run
any longer than necessary.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode60
state/apiserver/root_test.go:60: closed := <-closedc
select {
case closed = <-closed:
case <-testing.LongWait:
c.Fatalf("action never executed")
}

otherwise the test will hang forever if the action never
gets called.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode61
state/apiserver/root_test.go:61: closeDiff :=
closed.Sub(broken).Nanoseconds() / 1000000
or:
closeDiff := closed.Sub(broken) / time.Millisecond

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode62
state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff
<= 60, gc.Equals, true)
50 <= closeDiff && closeDiff <= 60

is a nice convention for range checks.

I'd also make the allowable range much larger to cater
for dubious scheduling in the cloud, so that this doesn't
break too often.

perhaps 50 <= closeDiff && closeDiff <= 100

https://codereview.appspot.com/24040044/

> Please describe "... connected somehow to the client ping logic" more.
Shall
> different agent influence their ping times, shall this be configurable
or fixed
> by agent type, what is the benefit of this idea and is it worth the
increasing
> complexity?

I suggest a specific solution in the comment on the timeout const.

This is really close now, thanks. Just a few small things left.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33
state/apiserver/root.go:33: const timeout = 3 * time.Minute
s/timeout/maxPingInterval/
?
(more obvious what the timeout is)

I think I'd define it inside state/api. Then the api client
code can use it to calculate an appropriate ping interval.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode317
state/apiserver/root.go:317: // that invokes the given action if there
is more
s/action/action asynchronously/

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode329
state/apiserver/root.go:329: go pt.action()
I don't think we want to call this if stop has been called.
when I said "below" I meant in the <-timer.C arm of the select
statement.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode53
state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
s/10/2/

i don't think there's really any particular reason to make this test run
any longer than necessary.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode60
state/apiserver/root_test.go:60: closed := <-closedc
select {
case closed = <-closed:
case <-testing.LongWait:
     c.Fatalf("action never executed")
}

otherwise the test will hang forever if the action never
gets called.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode61
state/apiserver/root_test.go:61: closeDiff :=
closed.Sub(broken).Nanoseconds() / 1000000
or:
closeDiff := closed.Sub(broken) / time.Millisecond

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode62
state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff
<= 60, gc.Equals, true)
50 <= closeDiff && closeDiff <= 60

is a nice convention for range checks.

I'd also make the allowable range much larger to cater
for dubious scheduling in the cloud, so that this doesn't
break too often.

perhaps 50 <= closeDiff && closeDiff <= 100

https://codereview.appspot.com/24040044/

Revision history for this message

Frank Mueller (themue) wrote on 2013-11-19:

#

Please take a look.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33
state/apiserver/root.go:33: const timeout = 3 * time.Minute
On 2013/11/19 14:44:08, rog wrote:
> s/timeout/maxPingInterval/
> ?
> (more obvious what the timeout is)

> I think I'd define it inside state/api. Then the api client
> code can use it to calculate an appropriate ping interval.

Renamed it and made a comment. I would move it not earlier than the
according API changes which could be done in another CL.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode317
state/apiserver/root.go:317: // that invokes the given action if there
is more
On 2013/11/19 14:44:08, rog wrote:
> s/action/action asynchronously/

Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode329
state/apiserver/root.go:329: go pt.action()
On 2013/11/19 14:44:08, rog wrote:
> I don't think we want to call this if stop has been called.
> when I said "below" I meant in the <-timer.C arm of the select
statement.

Ah, already wondered.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode53
state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
On 2013/11/19 14:44:08, rog wrote:
> s/10/2/

> i don't think there's really any particular reason to make this test
run any
> longer than necessary.

Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode60
state/apiserver/root_test.go:60: closed := <-closedc
On 2013/11/19 14:44:08, rog wrote:
> select {
> case closed = <-closed:
> case <-testing.LongWait:
> c.Fatalf("action never executed")
> }

> otherwise the test will hang forever if the action never
> gets called.

Had luck so far, but that's better.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode61
state/apiserver/root_test.go:61: closeDiff :=
closed.Sub(broken).Nanoseconds() / 1000000
On 2013/11/19 14:44:08, rog wrote:
> or:
> closeDiff := closed.Sub(broken) / time.Millisecond

Gnah! Sure. :)

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode62
state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff
<= 60, gc.Equals, true)
On 2013/11/19 14:44:08, rog wrote:
> 50 <= closeDiff && closeDiff <= 60

> is a nice convention for range checks.

> I'd also make the allowable range much larger to cater
> for dubious scheduling in the cloud, so that this doesn't
> break too often.

> perhaps 50 <= closeDiff && closeDiff <= 100

Done.

https://codereview.appspot.com/24040044/

Please take a look.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33
state/apiserver/root.go:33: const timeout = 3 * time.Minute
On 2013/11/19 14:44:08, rog wrote:
> s/timeout/maxPingInterval/
> ?
> (more obvious what the timeout is)

> I think I'd define it inside state/api. Then the api client
> code can use it to calculate an appropriate ping interval.

Renamed it and made a comment. I would move it not earlier than the
according API changes which could be done in another CL.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode317
state/apiserver/root.go:317: // that invokes the given action if there
is more
On 2013/11/19 14:44:08, rog wrote:
> s/action/action asynchronously/

Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode329
state/apiserver/root.go:329: go pt.action()
On 2013/11/19 14:44:08, rog wrote:
> I don't think we want to call this if stop has been called.
> when I said "below" I meant in the <-timer.C arm of the select
statement.

Ah, already wondered.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode53
state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
On 2013/11/19 14:44:08, rog wrote:
> s/10/2/

> i don't think there's really any particular reason to make this test
run any
> longer than necessary.

Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode60
state/apiserver/root_test.go:60: closed := <-closedc
On 2013/11/19 14:44:08, rog wrote:
> select {
> case closed = <-closed:
> case <-testing.LongWait:
>      c.Fatalf("action never executed")
> }

> otherwise the test will hang forever if the action never
> gets called.

Had luck so far, but that's better.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode61
state/apiserver/root_test.go:61: closeDiff :=
closed.Sub(broken).Nanoseconds() / 1000000
On 2013/11/19 14:44:08, rog wrote:
> or:
> closeDiff := closed.Sub(broken) / time.Millisecond

Gnah! Sure. :)

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode62
state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff
<= 60, gc.Equals, true)
On 2013/11/19 14:44:08, rog wrote:
> 50 <= closeDiff && closeDiff <= 60

> is a nice convention for range checks.

> I'd also make the allowable range much larger to cater
> for dubious scheduling in the cloud, so that this doesn't
> break too often.

> perhaps 50 <= closeDiff && closeDiff <= 100

Done.

https://codereview.appspot.com/24040044/

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2013-11-20:

#

On 2013/11/19 15:52:37, mue wrote:
> Please take a look.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
> File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33
> state/apiserver/root.go:33: const timeout = 3 * time.Minute
> On 2013/11/19 14:44:08, rog wrote:
> > s/timeout/maxPingInterval/
> > ?
> > (more obvious what the timeout is)
> >
> > I think I'd define it inside state/api. Then the api client
> > code can use it to calculate an appropriate ping interval.

> Renamed it and made a comment. I would move it not earlier than the
according
> API changes which could be done in another CL.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode317
> state/apiserver/root.go:317: // that invokes the given action if there
is more
> On 2013/11/19 14:44:08, rog wrote:
> > s/action/action asynchronously/

> Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode329
> state/apiserver/root.go:329: go pt.action()
> On 2013/11/19 14:44:08, rog wrote:
> > I don't think we want to call this if stop has been called.
> > when I said "below" I meant in the <-timer.C arm of the select
statement.

> Ah, already wondered.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
> File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode53
> state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
> On 2013/11/19 14:44:08, rog wrote:
> > s/10/2/
> >
> > i don't think there's really any particular reason to make this test
run any
> > longer than necessary.

> Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode60
> state/apiserver/root_test.go:60: closed := <-closedc
> On 2013/11/19 14:44:08, rog wrote:
> > select {
> > case closed = <-closed:
> > case <-testing.LongWait:
> > c.Fatalf("action never executed")
> > }
> >
> > otherwise the test will hang forever if the action never
> > gets called.

> Had luck so far, but that's better.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode61
> state/apiserver/root_test.go:61: closeDiff :=
closed.Sub(broken).Nanoseconds() /
> 1000000
> On 2013/11/19 14:44:08, rog wrote:
> > or:
> > closeDiff := closed.Sub(broken) / time.Millisecond

> Gnah! Sure. :)

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode62
> state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff
<= 60,
> gc.Equals, true)
> On 2013/11/19 14:44:08, rog wrote:
> > 50 <= closeDiff && closeDiff <= 60
> >
> > is a nice convention for range checks.
> >
> > I'd also make the allowable range much larger to cater
> > for dubious scheduling in the cloud, so that this doesn't
> > break too often.
> >
> > perhaps 50 <= closeDiff && closeDiff <= 100

> Done.

LGTM, thanks.

https://codereview.appspot.com/24040044/

On 2013/11/19 15:52:37, mue wrote:
> Please take a look.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
> File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33
> state/apiserver/root.go:33: const timeout = 3 * time.Minute
> On 2013/11/19 14:44:08, rog wrote:
> > s/timeout/maxPingInterval/
> > ?
> > (more obvious what the timeout is)
> >
> > I think I'd define it inside state/api. Then the api client
> > code can use it to calculate an appropriate ping interval.

> Renamed it and made a comment. I would move it not earlier than the
according
> API changes which could be done in another CL.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode317
> state/apiserver/root.go:317: // that invokes the given action if there
is more
> On 2013/11/19 14:44:08, rog wrote:
> > s/action/action asynchronously/

> Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode329
> state/apiserver/root.go:329: go pt.action()
> On 2013/11/19 14:44:08, rog wrote:
> > I don't think we want to call this if stop has been called.
> > when I said "below" I meant in the <-timer.C arm of the select
statement.

> Ah, already wondered.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
> File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode53
> state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
> On 2013/11/19 14:44:08, rog wrote:
> > s/10/2/
> >
> > i don't think there's really any particular reason to make this test
run any
> > longer than necessary.

> Done.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode60
> state/apiserver/root_test.go:60: closed := <-closedc
> On 2013/11/19 14:44:08, rog wrote:
> > select {
> > case closed = <-closed:
> > case <-testing.LongWait:
> >     c.Fatalf("action never executed")
> > }
> >
> > otherwise the test will hang forever if the action never
> > gets called.

> Had luck so far, but that's better.

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode61
> state/apiserver/root_test.go:61: closeDiff :=
closed.Sub(broken).Nanoseconds() /
> 1000000
> On 2013/11/19 14:44:08, rog wrote:
> > or:
> > closeDiff := closed.Sub(broken) / time.Millisecond

> Gnah! Sure. :)

https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go#newcode62
> state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff
<= 60,
> gc.Equals, true)
> On 2013/11/19 14:44:08, rog wrote:
> > 50 <= closeDiff && closeDiff <= 60
> >
> > is a nice convention for range checks.
> >
> > I'd also make the allowable range much larger to cater
> > for dubious scheduling in the cloud, so that this doesn't
> > break too often.
> >
> > perhaps 50 <= closeDiff && closeDiff <= 100

> Done.

LGTM, thanks.

https://codereview.appspot.com/24040044/

juju-core

Merge lp://staging/~themue/juju-core/055-apiserver-heartbeat-analyzing into lp://staging/~go-bot/juju-core/trunk

Commit message

Description of the change

Preview Diff

Subscribers