Code review comment for lp://staging/~gary/launchpad/bug553368

Revision history for this message
Robert Collins (lifeless) wrote :

On Fri, Jun 10, 2011 at 4:57 PM, Gary Poster <email address hidden> wrote:
>
> On Jun 9, 2011, at 10:34 PM, Robert Collins wrote:
>
>> Review: Needs Information
>> We probably want to log the OOPS, but show a nice error page.
>>
>> Why log the OOPS? because our SSO should - never - be down, so if its down or misbehaving, we want to respond promptly to that.
>>
>> What do think?
>>
>> The code itself looks fine, modulo this conceptual question.
>
> Yeah, I was actually wondering about that too, earlier.
>
> On consideration, here's my opinion.  The uptime of our openid server should not be our responsiblity to maintain or monitor.  The network availability is somewhat more arguable...but first, isn't that more of an IS responsibility?  And second, aren't we still planning on accepting openid tokens from arbitrary providers Sometime Soon?  In that case, again, the availability of external providers is also not our business to monitor.  Given all that, I am inclined to agree with the advice in the bug, and proceed as I have done here: no OOPS.

AIUI the zero-oops-policy only applies to production, at least for now.

> So, tell me what you think.  This is a judgement call, and I'm happy to follow your opinion.
>
> I did a bit of research leading up to my opinion.  You can read it below if you like, but I don't find it to be conclusive.

The research is interesting, thanks. On consideration I'd prefer us to
log these OOPSes: my reasoning is in a few parts...

There are many things that can go wrong with SSO which are not our
responsibility, but an SSO problem affects our users; so in terms of
delivering a high quality service we need to measure and respond to
those issues. Concretely right now SSO login times are degraded, which
makes logging into Launchpad slow - we know this because of +login
timeouts - and the ISD run SSO service has only a small subset of the
operational polish we do: we're in a good position to detect
regressions and issues, they are not.

On top of that there are are number of things that would be our
responsibility such as networking glitches on our servers, or badly
configured host firewalls.. which would not show up on SSO uptime
reports but would impact our service levels.

In the future when we support arbitrary openid providers we probably
won't want OOPSes for non-canonical openid providers, but thats
something to tackle when we tackle that bug in general.

-Rob

« Back to merge proposal