Merge lp://staging/~dave-cheney/pyjuju/go-provisioning-strawman into lp://staging/pyjuju/go
- go-provisioning-strawman
- Merge into go
Status: | Merged |
---|---|
Approved by: | Gustavo Niemeyer |
Approved revision: | 207 |
Merged at revision: | 201 |
Proposed branch: | lp://staging/~dave-cheney/pyjuju/go-provisioning-strawman |
Merge into: | lp://staging/pyjuju/go |
Diff against target: |
239 lines (+198/-4) 2 files modified
cmd/jujud/provisioning.go (+114/-3) cmd/jujud/provisioning_test.go (+84/-1) |
To merge this branch: | bzr merge lp://staging/~dave-cheney/pyjuju/go-provisioning-strawman |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Juju Engineering | Pending | ||
Review via email: mp+107714@code.staging.launchpad.net |
Commit message
Description of the change
cmd/jujud: strawman provisioning agent
This is a strawman provisioning agent proposal.
Following Gustavo's suggestion, this revision does not include
any functionality to respond to machines changes.
Roger Peppe (rogpeppe) wrote : | # |
Dave Cheney (dave-cheney) wrote : | # |
Thanks Rog. I'll address those in the pre req by adding machine.
d
On 29/05/2012, at 18:44, Roger Peppe <email address hidden> wrote:
> this is looking nice.
> a few minor remarks below.
>
>
>
> https:/
> File cmd/jujud/
>
> https:/
> cmd/jujud/
> "provider-
> s/PROVIDER_
>
> UNDERSCORED_CAPS aren't conventional.
>
> https:/
> cmd/jujud/
> reports a provider id %q, skipping", m.Id(), id)
> s/machine-
>
> https:/
> cmd/jujud/
> machine-%010d key not found: %q", m.Id(), PROVIDER_
> i think 'machine %d' would work better - the %010 stuff is detail
> internal to state.
>
> https:/
> cmd/jujud/
> rather than doing the type conversion twice, perhaps:
>
> if id, ok := id.(string); ok {
> return id, nil
> }
> return "", fmt.Errorf("machine %d has invalid value for %s: %#v",
> m.Id(), PROVIDER_
>
> https:/
> cmd/jujud/
> this can't happen.
>
> https:/
>
> --
> https:/
> You are the owner of lp:~dave-cheney/juju/go-provisioning-strawman.
Gustavo Niemeyer (niemeyer) wrote : | # |
This is going in a great direction Dave, and I'm excited to see this.
At the same time, I'm concerned about the approach. I'm not comfortable
with us knowingly getting in critical chunks of the functionality while
knowing that it "lacks robust
error checking and retry logic", as I can't help you fixing issues and
catching up bugs that you tell me are known but you don't care about
right now.
I'd prefer if this spike, which is a great test bed, was now broken down
into smaller chunks that you are actually *sure* about, so that we and
the other developers can jointly review and agree is a good step
forward.
I don't mind if each of those steps are not entirely functional, for
example. We can have a first one that is just about connecting and
reconnecting to the environment reliably, for instance, ignoring any
machine management, and so on.
What follows is a few additional ideas on it:
https:/
File cmd/jujud/
https:/
cmd/jujud/
"provider-
providerMachineId would be the convention, I think
https:/
File cmd/jujud/
https:/
cmd/jujud/
Shouldn't the suite be using the testing package's log helper instead?
https:/
cmd/jujud/
That's not great. That's the kind of reason that we need a proper Stop
method on the provisioning agent, so that we can be sure that it is
happy to terminate, rather than waiting a random amount of time for it
to finish running. Otherwise, we're getting into exactly the same kind
of issue we got into in the Python implementation.
Please have a look at the underlying infrastructure of watchers for
inspiration. They reliably terminate, both erroring or not. They also do
that synchronously so that it is possible to wait, and to check for
errors from them.
We need something like that for the provisioning agent.
Gustavo Niemeyer (niemeyer) wrote : | # |
Some more comments.
Also, it just occurred to me that, as a very first step getting just the
testing environment in place, with zero functionality, in a way that
reliably starts and stops, is already a fantastic task on itself.
https:/
File cmd/jujud/
https:/
cmd/jujud/
state.Open(
This is a rather extensive function that might be broken down further in
more manageable bits for clarity.
https:/
cmd/jujud/
environment or the machine topology and action both.
Step 1 needs to be able to be more resilient, and step 2 must be able to
fallback to step 1 in case of issues with the connection. It must also
be able to deal with Stop of the watcher and re-watching, but that
should be somewhat natural.
https:/
cmd/jujud/
reports a provider id %q, skipping", m.Id(), id)
We should never talk about internal ZooKeeper keys to the user. This
should be saying "machine %d" instead.
https:/
cmd/jujud/
in ZK
Let's please not reference ZooKeeper all the time. It is the environment
state that we have at hand, with a nice abstraction that we created
precisely so we don't think in terms of ZooKeeper all the time.
ZooKeeper is hopefully going away soon, but the state will remain as-is.
Gustavo Niemeyer (niemeyer) wrote : | # |
Thanks Dave. That is now looking like a great step forward that we can
polish for integration easily.
Some comments:
https:/
File cmd/jujud/
https:/
cmd/jujud/
What happens if the connection is broken? We'll need to reestablish it
and restart from the beginning, since all the assumptions are now wrong.
Maybe this has to be moved into the outer loop or something. Will leave
details with you at this point.
That said, I suggest keeping this for the branch coming right after this
one. It feels like what we have here is a solid step forward, and we can
comfortably iterate over the state reestablishment in a new branch
before continuing with everything else.
https:/
cmd/jujud/
Those two types feel unnecessary. It looks like your model would work
equally well if we had environWatcher and machinesWatcher both within
the Provisioner struct in place of the environment and machines fields,
plus stopMachinesWatcher and stopEnvironWatcher methods (the two
invalidate ones), with no further changes.
https:/
cmd/jujud/
This should return the error it finds. Call sites can selectively ignore
it, but methods such as Provisioner.Stop should return the problems it
finds, so we can have a handle on them if we want (testing should
definitely check, for instance).
https:/
cmd/jujud/
watcher exited: %v", e.watcher.Stop())
That Stop call is the most critical task done by this method. It
shouldn't be disguised at the end of a log call.
https:/
cmd/jujud/
Nice organization.
https:/
cmd/jujud/
configuartion applied")
s/configuartion
https:/
cmd/jujud/
I'm wondering if rather than "continue" in the problems above, we should
"break", and here have something that verifies if the state is still
sane? Otherwise this will be an infinite loop, won't it?
https:/
cmd/jujud/
I suggest logic like this here:
p.tomb.Kill(nil)
err1 := p.stopEnvironWa
err2 := p.stopMachinesW
err3 := p.tomb.Wait()
for err := []error{err3, err2, err1} {
if err != nil {
return err
}
}
return nil
- 204. By Dave Cheney
-
log stop errors
- 205. By Dave Cheney
-
merge from trunk
- 206. By Dave Cheney
-
responding to review comments
Gustavo Niemeyer (niemeyer) wrote : | # |
https:/
File cmd/jujud/
https:/
cmd/jujud/
On 2012/06/01 01:58:27, dfc wrote:
> > What happens if the connection is broken? We'll need to reestablish
it and
> > restart from the beginning, since all the assumptions are now wrong.
Maybe
> this
> > has to be moved into the outer loop or something. Will leave details
with you
> at
> > this point.
> I looked into this more this morning and it looks like if the
underlying
> connection to ZK is interrupted then the c client reconnects behind
the scenes.
> I tested this by running the PA, then shutting down zk, then starting
it again a
> minute later. None of the watchers exited.
> I'm going to keep playing with this, but at this point, if gozk never
reports an
> error when it cant talk to the zk server, I can't think of a way to
detect
> connection failure.
We've covered this on juju-dev@. A TODO would be good here as well.
https:/
File cmd/jujud/
https:/
cmd/jujud/
will receive the new *ConfigNode
As we discussed, that logic can be simplified as we can bubble up onto
the top from any hiccups, rather than retrying in place. I'm happy both
for this branch to be changed to cover that, or for it to be integrated
with these and then refactored. I'll comment in-place and let the
decision with you.
https:/
cmd/jujud/
// when a change in the environment configuration is detected.
https:/
cmd/jujud/
invalidates the current environWatcher.
s/and invalidates//
https:/
cmd/jujud/
reported error on Stop: %v", err)
This is Printf rather than Debugf, so we need to be a bit nicer to the
user (no method and field names). Something like this might do:
"environment configuration watcher: %v"
https:/
cmd/jujud/
receive the new *ConfigNode when a
Needs updating.
https:/
cmd/jujud/
reported error on Stop: %v", err)
"machines watcher: %v"
https:/
cmd/jujud/
state.IsValid() here to exit cleanly if
IsConnected?
https:/
- 207. By Dave Cheney
-
Responding to review comments, final tweaks
Gustavo Niemeyer (niemeyer) wrote : | # |
That last set of changes are awesome. LGTM.
this is looking nice.
a few minor remarks below.
https:/ /codereview. appspot. com/6250068/ diff/2001/ cmd/jujud/ provisioning. go provisioning. go (right):
File cmd/jujud/
https:/ /codereview. appspot. com/6250068/ diff/2001/ cmd/jujud/ provisioning. go#newcode17 provisioning. go:17: const PROVIDER_MACHINE_ID = machine- id" MACHINE_ ID/providerMach ineId/
cmd/jujud/
"provider-
s/PROVIDER_
UNDERSCORED_CAPS aren't conventional.
https:/ /codereview. appspot. com/6250068/ diff/2001/ cmd/jujud/ provisioning. go#newcode120 provisioning. go:120: return fmt.Errorf( "machine- %010d already %010d/machine %d/
cmd/jujud/
reports a provider id %q, skipping", m.Id(), id)
s/machine-
https:/ /codereview. appspot. com/6250068/ diff/2001/ cmd/jujud/ provisioning. go#newcode208 provisioning. go:208: return "", fmt.Errorf( "findProviderId : MACHINE_ ID)
cmd/jujud/
machine-%010d key not found: %q", m.Id(), PROVIDER_
i think 'machine %d' would work better - the %010 stuff is detail
internal to state.
https:/ /codereview. appspot. com/6250068/ diff/2001/ cmd/jujud/ provisioning. go#newcode210 provisioning. go:210: if _, ok := id.(string); !ok {
cmd/jujud/
rather than doing the type conversion twice, perhaps:
if id, ok := id.(string); ok { MACHINE_ ID, id)
return id, nil
}
return "", fmt.Errorf("machine %d has invalid value for %s: %#v",
m.Id(), PROVIDER_
https:/ /codereview. appspot. com/6250068/ diff/2001/ cmd/jujud/ provisioning. go#newcode221 provisioning. go:221: if len(insts) < 1 {
cmd/jujud/
this can't happen.
https:/ /codereview. appspot. com/6250068/