Launchpad itself

Code review comment for lp://staging/~gary/launchpad/bug553368

bug553368
Merge into devel

Revision history for this message

Gary Poster (gary) wrote on 2011-06-10:

On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:

> Review: Needs Information
> We probably want to log the OOPS, but show a nice error page.
>
> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
>
> What do think?
>
> The code itself looks fine, modulo this conceptual question.

Yeah, I was actually wondering about that too, earlier.

On consideration, here's my opinion. The uptime of our openid server should not be our responsiblity to maintain or monitor. The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility? And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon? In that case, again, the availability of external providers is also not our business to monitor. Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

So, tell me what you think. This is a judgement call, and I'm happy to follow your opinion.

I did a bit of research leading up to my opinion. You can read it below if you like, but I don't find it to be conclusive.

Gary

Research notes:

I ran the following on devpad, in /srv/launchpad.net-logs.

find . -mindepth 3 -maxdepth 3 -name '2011-06-0?' -exec grep -lr 'DiscoveryFailure' {} \;

FWIW, the find statement without the -exec does show that we are looking in the oops directories of these machines:

./production/gac
./production/soybean
./production/mizuho
./production/chaenomeles
./production/wampee
./staging/asuka
./scripts/loganberry
./edge/soybean
./edge/wampee
./qastaging/asuka
./db/hackberry

This gives 286 OOPSes within the nine-day period. They are all on /staging/asuka. The search takes a little while to run, so I put them here if you are interested: https://pastebin.canonical.com/48379/. To sum, though:

46 happened on the 4th, between 2011-06-04T08:21:34.987245+00:00 and 2011-06-04T22:41:30.293582+00:00.

159 happened on the 5th, between 2011-06-05T01:38:14.127379+00:00 and 2011-06-05T23:10:07.419856+00:00.

81 happened on the 6th, between 2011-06-06T00:03:36.055124+00:00 and 2011-06-06T10:05:45.665406+00:00.

That's all of them.

I think we have a separate staging openid server, and I bet it has lower quality of service expectations. Maybe that's the cause. Are these OOPSes actionable? The zero-oops policy implies it should be. If so, what is the action? I suppose we could go ask the LOSAs about them, and see if they could explain it with some network change for staging? Or should we only squelch the OOPSes on staging and qastaging?

That said, the original OOPS for the bug was actually in production, back in August 2010 (https://lp-oops.canonical.com/oops.py/?oopsid=1691L1546). It would be interesting to do a search and see how frequent these problems are over a wider time period, but I didn't really want to consume the devpad resources necessary.

On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:

> Review: Needs Information
> We probably want to log the OOPS, but show a nice error page.
> 
> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
> 
> What do think?
> 
> The code itself looks fine, modulo this conceptual question.

Yeah, I was actually wondering about that too, earlier.

On consideration, here's my opinion.  The uptime of our openid server should not be our responsiblity to maintain or monitor.  The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility?  And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon?  In that case, again, the availability of external providers is also not our business to monitor.  Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

So, tell me what you think.  This is a judgement call, and I'm happy to follow your opinion.

I did a bit of research leading up to my opinion.  You can read it below if you like, but I don't find it to be conclusive.

Gary

Research notes:

I ran the following on devpad, in /srv/launchpad.net-logs.

find . -mindepth 3 -maxdepth 3 -name '2011-06-0?' -exec grep -lr 'DiscoveryFailure' {} \;

FWIW, the find statement without the -exec does show that we are looking in the oops directories of these machines:

This gives 286 OOPSes within the nine-day period.  They are all on /staging/asuka.  The search takes a little while to run, so I put them here if you are interested: https://pastebin.canonical.com/48379/.  To sum, though:

46 happened on the 4th, between 2011-06-04T08:21:34.987245+00:00 and 2011-06-04T22:41:30.293582+00:00.

159 happened on the 5th, between 2011-06-05T01:38:14.127379+00:00 and 2011-06-05T23:10:07.419856+00:00.

81 happened on the 6th, between 2011-06-06T00:03:36.055124+00:00 and 2011-06-06T10:05:45.665406+00:00.

That's all of them.

I think we have a separate staging openid server, and I bet it has lower quality of service expectations.  Maybe that's the cause.  Are these OOPSes actionable?  The zero-oops policy implies it should be.  If so, what is the action?  I suppose we could go ask the LOSAs about them, and see if they could explain it with some network change for staging?  Or should we only squelch the OOPSes on staging and qastaging?

That said, the original OOPS for the bug was actually in production, back in August 2010 (https://lp-oops.canonical.com/oops.py/?oopsid=1691L1546).  It would be interesting to do a search and see how frequent these problems are over a wider time period, but I didn't really want to consume the devpad resources necessary.

« Back to merge proposal