Launchpad itself

Merge lp://staging/~gary/launchpad/bug553368 into lp://staging/launchpad

bug553368
Merge into devel

Proposed by Gary Poster on 2011-06-10

Status:

Merged

Approved by:

Gary Poster on 2011-06-10

Approved revision:

no longer in the source branch.

Merged at revision:

13209

Proposed branch:

lp://staging/~gary/launchpad/bug553368

Merge into:

lp://staging/launchpad

Diff against target:

283 lines (+141/-9)

5 files modified

lib/canonical/launchpad/webapp/error.py (+20/-8)
lib/canonical/launchpad/webapp/tests/test_login.py (+40/-1)
lib/lp/app/browser/configure.zcml (+9/-0)
lib/lp/app/templates/launchpad-discoveryfailure.pt (+19/-0)
lib/lp/testing/fixture.py (+53/-0)

To merge this branch:

bzr merge lp://staging/~gary/launchpad/bug553368

Critical

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Robert Collins (community)		2011-06-10	Approve on 2011-06-10
Brad Crittenden (community)	code		Approve on 2011-06-10
Review via email: mp+64108@code.staging.launchpad.net

Commit message

[r=bac,lifeless][bug=553368] Make more friendly error page for when openid server is down; add view test fixture.

Description of the change

This branch provides a simple fix for the linked bug: an OpenID DiscoveryError returns a 503 and a hopefully helpful error page, but no OOPS is logged.

Writing a test for this was a bit tricky because I didn't want to use a monkeypatch, and because I wanted to show that the full reported error was squashed. I ended up with a test fixture for replacing one view with another in the Zope component registry, which seems to work pretty well. I suspect it has limited usefulness, but it is general, and was easy enough to factor it out as a fixture, so I did so.

Revision history for this message

Gary Poster (gary) wrote on 2011-06-10:

lint:

./lib/canonical/launchpad/webapp/login.py
490: E302 expected 2 blank lines, found 1

This is for a function that has a comment in front of it; that apparently confuses our linter.

./lib/canonical/launchpad/webapp/tests/test_login.py
626: E301 expected 1 blank line, found 0
627: E301 expected 1 blank line, found 0

This is for the inline class, which I need because I have to subclass the dynamically generated class. I could do "type" games instead if you wanted, but I thought an inline class was simpler and easier to understand.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-10:

We probably want to log the OOPS, but show a nice error page.

Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.

What do think?

The code itself looks fine, modulo this conceptual question.

review: Needs Information

Revision history for this message

Gary Poster (gary) wrote on 2011-06-10:

On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:

> Review: Needs Information
> We probably want to log the OOPS, but show a nice error page.
>
> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
>
> What do think?
>
> The code itself looks fine, modulo this conceptual question.

Yeah, I was actually wondering about that too, earlier.

On consideration, here's my opinion. The uptime of our openid server should not be our responsiblity to maintain or monitor. The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility? And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon? In that case, again, the availability of external providers is also not our business to monitor. Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

So, tell me what you think. This is a judgement call, and I'm happy to follow your opinion.

I did a bit of research leading up to my opinion. You can read it below if you like, but I don't find it to be conclusive.

Gary

Research notes:

I ran the following on devpad, in /srv/launchpad.net-logs.

find . -mindepth 3 -maxdepth 3 -name '2011-06-0?' -exec grep -lr 'DiscoveryFailure' {} \;

FWIW, the find statement without the -exec does show that we are looking in the oops directories of these machines:

./production/gac
./production/soybean
./production/mizuho
./production/chaenomeles
./production/wampee
./staging/asuka
./scripts/loganberry
./edge/soybean
./edge/wampee
./qastaging/asuka
./db/hackberry

This gives 286 OOPSes within the nine-day period. They are all on /staging/asuka. The search takes a little while to run, so I put them here if you are interested: https://pastebin.canonical.com/48379/. To sum, though:

46 happened on the 4th, between 2011-06-04T08:21:34.987245+00:00 and 2011-06-04T22:41:30.293582+00:00.

159 happened on the 5th, between 2011-06-05T01:38:14.127379+00:00 and 2011-06-05T23:10:07.419856+00:00.

81 happened on the 6th, between 2011-06-06T00:03:36.055124+00:00 and 2011-06-06T10:05:45.665406+00:00.

That's all of them.

I think we have a separate staging openid server, and I bet it has lower quality of service expectations. Maybe that's the cause. Are these OOPSes actionable? The zero-oops policy implies it should be. If so, what is the action? I suppose we could go ask the LOSAs about them, and see if they could explain it with some network change for staging? Or should we only squelch the OOPSes on staging and qastaging?

That said, the original OOPS for the bug was actually in production, back in August 2010 (https://lp-oops.canonical.com/oops.py/?oopsid=1691L1546). It would be interesting to do a search and see how frequent these problems are over a wider time period, but I didn't really want to consume the devpad resources necessary.

On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:

> Review: Needs Information
> We probably want to log the OOPS, but show a nice error page.
> 
> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
> 
> What do think?
> 
> The code itself looks fine, modulo this conceptual question.

Yeah, I was actually wondering about that too, earlier.

On consideration, here's my opinion.  The uptime of our openid server should not be our responsiblity to maintain or monitor.  The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility?  And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon?  In that case, again, the availability of external providers is also not our business to monitor.  Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

So, tell me what you think.  This is a judgement call, and I'm happy to follow your opinion.

I did a bit of research leading up to my opinion.  You can read it below if you like, but I don't find it to be conclusive.

Gary

Research notes:

I ran the following on devpad, in /srv/launchpad.net-logs.

find . -mindepth 3 -maxdepth 3 -name '2011-06-0?' -exec grep -lr 'DiscoveryFailure' {} \;

FWIW, the find statement without the -exec does show that we are looking in the oops directories of these machines:

This gives 286 OOPSes within the nine-day period.  They are all on /staging/asuka.  The search takes a little while to run, so I put them here if you are interested: https://pastebin.canonical.com/48379/.  To sum, though:

46 happened on the 4th, between 2011-06-04T08:21:34.987245+00:00 and 2011-06-04T22:41:30.293582+00:00.

159 happened on the 5th, between 2011-06-05T01:38:14.127379+00:00 and 2011-06-05T23:10:07.419856+00:00.

81 happened on the 6th, between 2011-06-06T00:03:36.055124+00:00 and 2011-06-06T10:05:45.665406+00:00.

That's all of them.

I think we have a separate staging openid server, and I bet it has lower quality of service expectations.  Maybe that's the cause.  Are these OOPSes actionable?  The zero-oops policy implies it should be.  If so, what is the action?  I suppose we could go ask the LOSAs about them, and see if they could explain it with some network change for staging?  Or should we only squelch the OOPSes on staging and qastaging?

That said, the original OOPS for the bug was actually in production, back in August 2010 (https://lp-oops.canonical.com/oops.py/?oopsid=1691L1546).  It would be interesting to do a search and see how frequent these problems are over a wider time period, but I didn't really want to consume the devpad resources necessary.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-10:

On Fri, Jun 10, 2011 at 4:57 PM, Gary Poster <email address hidden> wrote:
>
> On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:
>
>> Review: Needs Information
>> We probably want to log the OOPS, but show a nice error page.
>>
>> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
>>
>> What do think?
>>
>> The code itself looks fine, modulo this conceptual question.
>
> Yeah, I was actually wondering about that too, earlier.
>
> On consideration, here's my opinion. The uptime of our openid server should not be our responsiblity to maintain or monitor. The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility? And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon? In that case, again, the availability of external providers is also not our business to monitor. Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

AIUI the zero-oops-policy only applies to production, at least for now.

> So, tell me what you think. This is a judgement call, and I'm happy to follow your opinion.
>
> I did a bit of research leading up to my opinion. You can read it below if you like, but I don't find it to be conclusive.

The research is interesting, thanks. On consideration I'd prefer us to
log these OOPSes: my reasoning is in a few parts...

There are many things that can go wrong with SSO which are not our
responsibility, but an SSO problem affects our users; so in terms of
delivering a high quality service we need to measure and respond to
those issues. Concretely right now SSO login times are degraded, which
makes logging into Launchpad slow - we know this because of +login
timeouts - and the ISD run SSO service has only a small subset of the
operational polish we do: we're in a good position to detect
regressions and issues, they are not.

On top of that there are are number of things that would be our
responsibility such as networking glitches on our servers, or badly
configured host firewalls.. which would not show up on SSO uptime
reports but would impact our service levels.

In the future when we support arbitrary openid providers we probably
won't want OOPSes for non-canonical openid providers, but thats
something to tackle when we tackle that bug in general.

-Rob

On Fri, Jun 10, 2011 at 4:57 PM, Gary Poster <gary.poster@canonical.com> wrote:
>
> On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:
>
>> Review: Needs Information
>> We probably want to log the OOPS, but show a nice error page.
>>
>> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
>>
>> What do think?
>>
>> The code itself looks fine, modulo this conceptual question.
>
> Yeah, I was actually wondering about that too, earlier.
>
> On consideration, here's my opinion.  The uptime of our openid server should not be our responsiblity to maintain or monitor.  The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility?  And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon?  In that case, again, the availability of external providers is also not our business to monitor.  Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

AIUI the zero-oops-policy only applies to production, at least for now.

> So, tell me what you think.  This is a judgement call, and I'm happy to follow your opinion.
>
> I did a bit of research leading up to my opinion.  You can read it below if you like, but I don't find it to be conclusive.

The research is interesting, thanks. On consideration I'd prefer us to
log these OOPSes: my reasoning is in a few parts...

In the future when we support arbitrary openid providers we probably
won't want OOPSes for non-canonical openid providers, but thats
something to tackle when we tackle that bug in general.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-10:

On Fri, Jun 10, 2011 at 9:01 PM, Robert Collins
<email address hidden> wrote:
> On Fri, Jun 10, 2011 at 4:57 PM, Gary Poster <email address hidden> wrote:
>> On consideration, here's my opinion. The uptime of our openid server should not be our ...
> The research is interesting, thanks. On consideration I'd prefer us to

I didn't mean to shadow your phrasing there - oops !

Revision history for this message

Gary Poster (gary) wrote on 2011-06-10:

I can buy that, Robert. Thanks.

I have updated the branch to simply deliver a nicer error page. I've also updated the associated bug with your comments.

Revision history for this message

Brad Crittenden (bac) wrote on 2011-06-10:

Hi Gary,

Thanks for working through the considerations for this branch with Robert. I agree with the outcome and reasoning: OOPS now for our (Canonical's) SSO service problems and address the issue of other SSOs when the time comes.

<rant>
I know you're just following existing convention but I'd much rather see:

response_code = httplib.SERVICE_UNAVAILABLE

than

response_code = 503 # Service Unavailable.

I wonder why we never adopted the use of the symbols? Because the codes are ubiquitous and everyone knows them? That can't be it because I found errors in reviews of Leonard's code where he wrote the wrong number and if he couldn't get them right what chance do the rest of us have?

</rant>

How about expanding:

You can also join <a href="irc://irc.freenode.net/launchpad">the #launchpad IRC support channel on irc.freenode.net</a> for further assistance.

You can also join <a href="irc://irc.freenode.net/launchpad">the #launchpad IRC support channel on irc.freenode.net</a> to ask for further assistance.

It is minor but indicates they have to go there and ask.

Can "This will not work with the AppServerLayer." be checked and enforced nicely?

review: Approve (code)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-10:

Thanks for reworking this Gary, much appreciated. I'm now totally fine with what the branch does; I agree with Brad's code review comments too FWIW.

review: Approve