MAAS

Merge lp://staging/~jason-hobbs/maas/run-single-process-single-thread into lp://staging/~maas-committers/maas/trunk

run-single-process-single-thread
Merge into trunk

Proposed by Jason Hobbs on 2014-10-17

Status:

Rejected

Rejected by:

Andres Rodriguez on 2016-02-23

Proposed branch:

lp://staging/~jason-hobbs/maas/run-single-process-single-thread

Merge into:

lp://staging/~maas-committers/maas/trunk

Diff against target:

12 lines (+4/-1)

1 file modified

contrib/maas-http.conf (+4/-1)

To merge this branch:

bzr merge lp://staging/~jason-hobbs/maas/run-single-process-single-thread

Critical

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Andres Rodriguez (community)		Disapprove on 2014-10-19
Christian Reis (community)	2014-10-17	Needs Information on 2014-10-17
Review via email: mp+238757@code.staging.launchpad.net

Commit message

Only run a single wsgi process/thread for MAAS.

Description of the change

This is the 'big hammer' approach to solving the general case of this problem for now. I have another branch that fixes only this specific case: lp:~jason-hobbs/maas/lock-around-node-acquire.

Given where we are in the release cycle, I think it's safer to use the big hammer approach for now and the come back later and work on finer grain locking that would allow multiple processes to run safely.

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-17:

Honest and thoughtful answer: what problems will this cause?

(Warning: if you say "nothing" I'll just say NAK!)

review: Needs Information

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2014-10-17:

I don't know of any problems specifically, but it would be good to consider performance and deadlock potential.

I don't think performance will be a big issue - I've done performance testing with MAAS before, for API request handling, and it didn't make a big difference how many processes or threads were running. I did a quick test for acquire performance just now:

Time to allocate 30 nodes:

With two processes running (incorrect results).
real 0m4.424s
user 0m6.404s
sys 0m0.618s

With one process running (correct results).
real 0m4.542s
user 0m3.220s
sys 0m0.399s

We could do more performance testing, like of UI load time, to see if it affects it much.

Deadlock potential - I don't think this is a big concern either, but perhaps it's possible there is a situation where completing the processing of one http request depends on another thread being available to handle a second request, like:

1) region gets request
2) region calls cluster to get some info
3) cluster calls region's http interface for some reason.
4) deadlock, because the region can't process the cluster's request until the region's original request completes.

I don't know if that can actually happen, and if it can, it seems like it would already be buggy behavior, since handling two of that type of request at the same time would produce a deadlock.

It would be good to do some more testing with this patch.

Revision history for this message

Graham Binns (gmb) wrote on 2014-10-17:

> Deadlock potential - I don't think this is a big concern either, but perhaps
> it's possible there is a situation where completing the processing of one http
> request depends on another thread being available to handle a second request,
> like:
>
> 1) region gets request
> 2) region calls cluster to get some info
> 3) cluster calls region's http interface for some reason.
> 4) deadlock, because the region can't process the cluster's request until the
> region's original request completes.
>
> I don't know if that can actually happen, and if it can, it seems like it
> would already be buggy behavior, since handling two of that type of request at
> the same time would produce a deadlock.

A note of caution here: I *think* (without checking I can't confirm) that _some_ RPC methods still use the API to report back to or get data from the region / metadataserver. I remember we had some back-and-forth discussion about this because we couldn't convert everything that was API to use RPC in time for release, so we just stuck with the celery bits and passed credentials for the API around as necessary.

So this might be more likely than you'd think.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2014-10-17:

This could also cause the web UI/API to become unresponsive during long running API jobs, like image uploads, or when images are syncing between the region and the cluster.

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-18:

> A note of caution here: I *think* (without checking I can't confirm)
> that _some_ RPC methods still use the API to report back to or get
> data from the region / metadataserver. I remember we had some
> back-and-forth discussion about this because we couldn't convert
> everything that was API to use RPC in time for release, so we just
> stuck with the celery bits and passed credentials for the API around
> as necessary.

- Tag calculations are still reported via the API, iirc.

- Nodes report commissioning results via the API.

- Nodes report deployment complete via the API.

There's probably more.

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-18:

> This could also cause the web UI/API to become unresponsive during
> long running API jobs, like image uploads, or when images are syncing
> between the region and the cluster.

Image syncing is going to be a killer. In fact, that's going to be a
killer right now if anyone has more than one cluster. We need to
*increase* concurrency for that.

Another way to deal with this, instead of using an explicit lock or
increasing transaction isolation, is to use SELECT FOR UPDATE. Even
Django supports it:

https://docs.djangoproject.com/en/1.6/ref/models/querysets/#select-for-update

It's not a big-hammer-fix-everything, but it's something we can
introduce without (I hope) much refactoring, breaking tests, and so on.
It's something we can do right now.

Longer term I think increasing transaction isolation is going to protect
us more comprehensively.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2014-10-18:

Thinking about this more I think a smaller testable fix targeted at just this instance of the problem is a better approach for now.

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-18:

That's not a bad idea, but are there races in other places that we will run into with NEC and OpenStack users in the short term? For instance, mass-start, mass-stop, mass-release, etc? What other multiple operations suffer from this issue?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-10-19:

I'm disapproving this branch in favor of [1] because the bug attached to this branch was caused by two changes that changed the usage of ISOLATION_LEVEL_SERIALIZABLE to ISOLATION_LEVEL_READ_COMMITED in the DB.

https://code.launchpad.net/~andreserl/maas/fix_lp1382575/+merge/238817

review: Disapprove

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-19:

> I'm disapproving this branch in favor of [1] because the bug attached
> to this branch was caused by two changes that changed the usage of
> ISOLATION_LEVEL_SERIALIZABLE to ISOLATION_LEVEL_READ_COMMITED in the
> DB.
>
> https://code.launchpad.net/~andreserl/maas/fix_lp1382575/+merge/238817

Unfortunately this approach won't work either. See my comment on that
merge proposal for an explanation.

We never actually ran with SERIALIZABLE in production; it was only ever
enabled in development. A crucial change to maas_local_settings.py in
trunk was missed when SERIALIZABLE was first introduced.

I have a uncomfortable hunch that Django was running things as
AUTOCOMMIT before I changed it to READ COMMITTED, even though we were
using the TransactionMiddleware. That's something I'll try to confirm
(or not) tomorrow.

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-19:

Yes, that's what I had indicated to Andres via email -- I had assumed
the changes you made were done to correct spurious failures in
operation (I had thought this was made evident or itensified by the
change to the cluster RPC code, which may not be the case).

Of course, we're now in an equally bad state, because concurrent changes
have become inconsistent.

Let's have a call tomorrow morning to figure out what the plan should
be.