Merge lp://staging/~themue/juju-core/055-apiserver-heartbeat-analyzing into lp://staging/~go-bot/juju-core/trunk
- 055-apiserver-heartbeat-analyzing
- Merge into trunk
Status: | Merged |
---|---|
Approved by: | Frank Mueller |
Approved revision: | no longer in the source branch. |
Merged at revision: | 2085 |
Proposed branch: | lp://staging/~themue/juju-core/055-apiserver-heartbeat-analyzing |
Merge into: | lp://staging/~go-bot/juju-core/trunk |
Diff against target: |
232 lines (+131/-14) 4 files modified
state/apiserver/admin.go (+1/-1) state/apiserver/export_test.go (+2/-0) state/apiserver/root.go (+102/-13) state/apiserver/root_test.go (+26/-0) |
To merge this branch: | bzr merge lp://staging/~themue/juju-core/055-apiserver-heartbeat-analyzing |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Juju Engineering | Pending | ||
Review via email: mp+194862@code.staging.launchpad.net |
Commit message
apiserver: analyzing ping timeouts
This is a first fix of #1205451 handling externally
killed juju nodes. Currently the pings of clients to
the API server are not analyzed. With this change
each ping leads to the reset of a timer waiting for
3 minutes. If it is not reset (by the ping) the timer
signal tells the pinger to kill the connection.
The overall problem with the presence pinger,
especially regarding to HA environments, is not
covered by this change.
Description of the change
apiserver: analyzing ping timeouts
This is a first fix of #1205451 handling externally
killed juju nodes. Currently the pings of clients to
the API server are not analyzed. With this change
each ping leads to the reset of a timer waiting for
3 minutes. If it is not reset (by the ping) the timer
signal tells the pinger to kill the connection.
The overall problem with the presence pinger,
especially regarding to HA environments, is not
covered by this change.
Frank Mueller (themue) wrote : | # |
Roger Peppe (rogpeppe) wrote : | # |
This is going in a reasonable direction, but I have a few comments to
make. It's a bit of a pity that this adds yet another goroutine per
connection, but I think that can be sorted out at some other time (we
could probably have one goroutine that deals with all the pinging). The
other thing that this CL does not do is actually close the client
connection. That can probably be left for a followup though.
https:/
File state/apiserver
https:/
state/apiserver
3*time.Minute)
This seems weird to me. For a start, it's racy - technically a client is
allowed to call Ping twice concurrently, and if it does that, we'll have
two srvPingers active. Secondly, it seems odd that if a client never
bothers to ping, they will never get one of these allocated, so we'll
never get the timeout behaviour. Thirdly, just deleting all the
resources without killing the connection seems unlikely to be worth it.
I think you need to set this up inside apiRootForEntity.
Additionally, I don't think you should return the srvPinger directly
from this method - that exposes its Stop method to the API, which you
probably don't want to do. I make a suggestion below.
One other thing: I think the actual value of the duration should perhaps
be tied to the value used by the client (perhaps the client should use
the value divided by 10 or something)
https:/
state/apiserver
s/interface{
https:/
state/apiserver
running killer.
This type isn't really a pinger - it doesn't do any pinging.
How about something like this?
type resourceTimeout struct {
tomb tomb.Tomb
timer *time.Timer
reset chan struct{}
resource killer
}
// newResourceTimeout will call resource.Kill if
// Refresh is not repeatedly called on the returned resourceTimeout
// within the given timeout.
func newResourceTime
*resourceTimeout
https:/
state/apiserver
This needs to make sure that the pinger isn't being stopped
by selecting on the tomb's dying chan too.
https:/
state/apiserver
Why is this commented out?
https:/
state/apiserver
Why close the reset channel?
https:/
state/apiserver
I think you should use NewTimer here (and Stop it after the select) to
avoid creating a bit of garbage each time a value is received on reset.
Even better, just use a single Timer instance inside ...
William Reade (fwereade) wrote : | # |
Thanks rog -- just emphasising a couple of your points.
https:/
File state/apiserver
https:/
state/apiserver
3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client
is allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping,
they will
> never get one of these allocated, so we'll never get the timeout
behaviour.
> Thirdly, just deleting all the resources without killing the
connection seems
> unlikely to be worth it.
> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly
from this
> method - that exposes its Stop method to the API, which you probably
don't want
> to do. I make a suggestion below.
> One other thing: I think the actual value of the duration should
perhaps be tied
> to the value used by the client (perhaps the client should use the
value divided
> by 10 or something)
Strong agreement with all of this. Create one per connection, use the
Ping method purely for resetting the connection-killing timeout. And
please make a timeout actually kill the connection -- that's the point
of this branch, and I don't think it should be left to a later one.
https:/
state/apiserver
running killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.
> How about something like this?
> type resourceTimeout struct {
> tomb tomb.Tomb
> timer *time.Timer
> reset chan struct{}
> resource killer
> }
> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTime
*resourceTimeout
FWIW, this is not (should not be) about killing resources -- it should
be about closing the connection. The logic for cleaning up resources
already exists, and is triggered by closing the connection.
Frank Mueller (themue) wrote : | # |
Please take a look.
https:/
File state/apiserver
https:/
state/apiserver
3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client
is allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping,
they will
> never get one of these allocated, so we'll never get the timeout
behaviour.
> Thirdly, just deleting all the resources without killing the
connection seems
> unlikely to be worth it.
> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly
from this
> method - that exposes its Stop method to the API, which you probably
don't want
> to do. I make a suggestion below.
> One other thing: I think the actual value of the duration should
perhaps be tied
> to the value used by the client (perhaps the client should use the
value divided
> by 10 or something)
Found a different approach. Only the part with the duration should be
discussed and may be part of the next round. Not sure about it.
https:/
state/apiserver
On 2013/11/12 16:01:21, rog wrote:
> s/interface{
Done.
https:/
state/apiserver
running killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.
> How about something like this?
> type resourceTimeout struct {
> tomb tomb.Tomb
> timer *time.Timer
> reset chan struct{}
> resource killer
> }
> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTime
*resourceTimeout
Yeah, looks better. Done.
https:/
state/apiserver
On 2013/11/12 16:01:21, rog wrote:
> This needs to make sure that the pinger isn't being stopped
> by selecting on the tomb's dying chan too.
Done.
https:/
state/apiserver
On 2013/11/12 16:01:21, rog wrote:
> Why is this commented out?
Argh, forgot after a test run. :/
https:/
state/apiserver
On 2013/11/12 16:01:21, rog wrote:
> Why close the reset channel?
Done.
https:/
state/apiserver
On 2013/11/12 16:01:21, rog wrote:
> I think you should use NewTimer here (and Stop it after th...
Roger Peppe (rogpeppe) wrote : | # |
Not quite there yet. Some thoughts below.
https:/
File state/apiserver
https:/
state/apiserver
Rather than having initialRoot inside srvRoot, I'd prefer
to explicitly include the members we need (srv and rpcConn).
https:/
state/apiserver
newPingTimeout(
There's nothing that cleans this up when the connection is killed.
Also, it would be good to have the timeout in a named constant,
and connected somehow to the client ping logic.
https:/
state/apiserver
the client pings and
// Pinger returns an object that can be pinged
// by calling its Ping method. If this method
// is not called frequently enough, the connection
// will be dropped.
?
https:/
state/apiserver
Do we need the tag here?
https:/
state/apiserver
instance
// newPingTimeout returns a new pingTimeout instance
// that invokes the given action if there is more
// than the given timeout interval between calls
// to its Ping method.
?
https:/
state/apiserver
I *think* that if you do:
defer action()
here that you'll be able to register the pingTimeout
as a resource and have the action close the rpc connection.
alternatively, call
go pt.action()
below instead of calling it synchronously.
On balance I think I prefer the latter option.
I'd make action() not return an error, so the caller
can decide what they might want to do with any error.
https:/
state/apiserver
I don't think this is ever called.
Frank Mueller (themue) wrote : | # |
Please take a look.
https:/
File state/apiserver
https:/
state/apiserver
On 2013/11/18 18:19:48, rog wrote:
> Rather than having initialRoot inside srvRoot, I'd prefer
> to explicitly include the members we need (srv and rpcConn).
Done.
https:/
state/apiserver
newPingTimeout(
On 2013/11/18 18:19:48, rog wrote:
> There's nothing that cleans this up when the connection is killed.
> Also, it would be good to have the timeout in a named constant,
> and connected somehow to the client ping logic.
Yep, cleanup has been prepared but led to the deadlock in the first
approach. Will be changed.
Please describe "... connected somehow to the client ping logic" more.
Shall different agent influence their ping times, shall this be
configurable or fixed by agent type, what is the benefit of this idea
and is it worth the increasing complexity?
https:/
state/apiserver
the client pings and
On 2013/11/18 18:19:48, rog wrote:
> // Pinger returns an object that can be pinged
> // by calling its Ping method. If this method
> // is not called frequently enough, the connection
> // will be dropped.
> ?
Done.
https:/
state/apiserver
On 2013/11/18 18:19:48, rog wrote:
> Do we need the tag here?
Removed, I only needed it during testing.
https:/
state/apiserver
instance
On 2013/11/18 18:19:48, rog wrote:
> // newPingTimeout returns a new pingTimeout instance
> // that invokes the given action if there is more
> // than the given timeout interval between calls
> // to its Ping method.
> ?
Done.
https:/
state/apiserver
On 2013/11/18 18:19:48, rog wrote:
> I *think* that if you do:
> defer action()
> here that you'll be able to register the pingTimeout
> as a resource and have the action close the rpc connection.
> alternatively, call
> go pt.action()
> below instead of calling it synchronously.
> On balance I think I prefer the latter option.
> I'd make action() not return an error, so the caller
> can decide what they might want to do with any error.
Done.
https:/
state/apiserver
On 2013/11/18 18:19:48, rog wrote:
> I don't think this is ever called.
No it's done, it already had been intended for the timeout cleanup.
Roger Peppe (rogpeppe) wrote : | # |
> Please describe "... connected somehow to the client ping logic" more.
Shall
> different agent influence their ping times, shall this be configurable
or fixed
> by agent type, what is the benefit of this idea and is it worth the
increasing
> complexity?
I suggest a specific solution in the comment on the timeout const.
This is really close now, thanks. Just a few small things left.
https:/
File state/apiserver
https:/
state/apiserver
s/timeout/
?
(more obvious what the timeout is)
I think I'd define it inside state/api. Then the api client
code can use it to calculate an appropriate ping interval.
https:/
state/apiserver
is more
s/action/action asynchronously/
https:/
state/apiserver
I don't think we want to call this if stop has been called.
when I said "below" I meant in the <-timer.C arm of the select
statement.
https:/
File state/apiserver
https:/
state/apiserver
s/10/2/
i don't think there's really any particular reason to make this test run
any longer than necessary.
https:/
state/apiserver
select {
case closed = <-closed:
case <-testing.LongWait:
c.
}
otherwise the test will hang forever if the action never
gets called.
https:/
state/apiserver
closed.
or:
closeDiff := closed.Sub(broken) / time.Millisecond
https:/
state/apiserver
<= 60, gc.Equals, true)
50 <= closeDiff && closeDiff <= 60
is a nice convention for range checks.
I'd also make the allowable range much larger to cater
for dubious scheduling in the cloud, so that this doesn't
break too often.
perhaps 50 <= closeDiff && closeDiff <= 100
Frank Mueller (themue) wrote : | # |
Please take a look.
https:/
File state/apiserver
https:/
state/apiserver
On 2013/11/19 14:44:08, rog wrote:
> s/timeout/
> ?
> (more obvious what the timeout is)
> I think I'd define it inside state/api. Then the api client
> code can use it to calculate an appropriate ping interval.
Renamed it and made a comment. I would move it not earlier than the
according API changes which could be done in another CL.
https:/
state/apiserver
is more
On 2013/11/19 14:44:08, rog wrote:
> s/action/action asynchronously/
Done.
https:/
state/apiserver
On 2013/11/19 14:44:08, rog wrote:
> I don't think we want to call this if stop has been called.
> when I said "below" I meant in the <-timer.C arm of the select
statement.
Ah, already wondered.
https:/
File state/apiserver
https:/
state/apiserver
On 2013/11/19 14:44:08, rog wrote:
> s/10/2/
> i don't think there's really any particular reason to make this test
run any
> longer than necessary.
Done.
https:/
state/apiserver
On 2013/11/19 14:44:08, rog wrote:
> select {
> case closed = <-closed:
> case <-testing.LongWait:
> c.Fatalf("action never executed")
> }
> otherwise the test will hang forever if the action never
> gets called.
Had luck so far, but that's better.
https:/
state/apiserver
closed.
On 2013/11/19 14:44:08, rog wrote:
> or:
> closeDiff := closed.Sub(broken) / time.Millisecond
Gnah! Sure. :)
https:/
state/apiserver
<= 60, gc.Equals, true)
On 2013/11/19 14:44:08, rog wrote:
> 50 <= closeDiff && closeDiff <= 60
> is a nice convention for range checks.
> I'd also make the allowable range much larger to cater
> for dubious scheduling in the cloud, so that this doesn't
> break too often.
> perhaps 50 <= closeDiff && closeDiff <= 100
Done.
Roger Peppe (rogpeppe) wrote : | # |
On 2013/11/19 15:52:37, mue wrote:
> Please take a look.
https:/
> File state/apiserver
https:/
> state/apiserver
> On 2013/11/19 14:44:08, rog wrote:
> > s/timeout/
> > ?
> > (more obvious what the timeout is)
> >
> > I think I'd define it inside state/api. Then the api client
> > code can use it to calculate an appropriate ping interval.
> Renamed it and made a comment. I would move it not earlier than the
according
> API changes which could be done in another CL.
https:/
> state/apiserver
is more
> On 2013/11/19 14:44:08, rog wrote:
> > s/action/action asynchronously/
> Done.
https:/
> state/apiserver
> On 2013/11/19 14:44:08, rog wrote:
> > I don't think we want to call this if stop has been called.
> > when I said "below" I meant in the <-timer.C arm of the select
statement.
> Ah, already wondered.
https:/
> File state/apiserver
https:/
> state/apiserver
> On 2013/11/19 14:44:08, rog wrote:
> > s/10/2/
> >
> > i don't think there's really any particular reason to make this test
run any
> > longer than necessary.
> Done.
https:/
> state/apiserver
> On 2013/11/19 14:44:08, rog wrote:
> > select {
> > case closed = <-closed:
> > case <-testing.LongWait:
> > c.Fatalf("action never executed")
> > }
> >
> > otherwise the test will hang forever if the action never
> > gets called.
> Had luck so far, but that's better.
https:/
> state/apiserver
closed.
> 1000000
> On 2013/11/19 14:44:08, rog wrote:
> > or:
> > closeDiff := closed.Sub(broken) / time.Millisecond
> Gnah! Sure. :)
https:/
> state/apiserver
<= 60,
> gc.Equals, true)
> On 2013/11/19 14:44:08, rog wrote:
> > 50 <= closeDiff && closeDiff <= 60
> >
> > is a nice convention for range checks.
> >
> > I'd also make the allowable range much larger to cater
> > for dubious scheduling in the cloud, so that this doesn't
> > break too often.
> >
> > perhaps 50 <= closeDiff && closeDiff <= 100
> Done.
LGTM, thanks.
Reviewers: mp+194862_ code.launchpad. net,
Message:
Please take a look.
Description:
apiserver: analyzing ping timeouts
This is a first fix of #1205451 handling externally
killed juju nodes. Currently the pings of clients to
the API server are not analyzed. With this change
each ping leads to the reset of a timer waiting for
3 minutes. If it is not reset (by the ping) the timer
signal tells the pinger to kill the connection.
The overall problem with the presence pinger,
especially regarding to HA environments, is not
covered by this change.
https:/ /code.launchpad .net/~themue/ juju-core/ 055-apiserver- heartbeat- analyzing/ +merge/ 194862
(do not edit description out of merge proposal)
Please review this at https:/ /codereview. appspot. com/24040044/
Affected files (+114, -8 lines): /export_ test.go /root.go /root_test. go
A [revision details]
M state/apiserver
M state/apiserver
M state/apiserver