Merge lp://staging/~danilo/launchpad/bug-408718 into lp://staging/launchpad

Proposed by Данило Шеган
Status: Merged
Merged at revision: not available
Proposed branch: lp://staging/~danilo/launchpad/bug-408718
Merge into: lp://staging/launchpad
Diff against target: None lines
To merge this branch: bzr merge lp://staging/~danilo/launchpad/bug-408718
Reviewer Review Type Date Requested Status
Henning Eggers (community) Approve
Review via email: mp+11682@code.staging.launchpad.net
To post a comment you must log in.
Revision history for this message
Данило Шеган (danilo) wrote :

= Bug #408717 =

Note: this is a cherrypick candidate so is kept as simple as possible.
It's critical for Launchpad Translations (poimport is blocked on this).

We had super-fast-imports query (which is a huge optimization for our
imports, resulting in 50x faster imports on average) cause Postgres to
eat up diskspace with repeated requests for more temp space. That
happened a month ago (when bug #408717 was filed and considered a
one-off), and it happened again on Friday morning. We all suspect a
Postgres bug, but that doesn't help us out.

However, since this is only an optimization, we can avoid this query
when we do hit a problem. So, how do we know when we'll hit a problem?

Simple: a timeout. 5 minute timeout (300s) should be enough to handle
all real query runs which do not exhibit the bad behaviour. If this
query takes more than 5 minutes, it's probably hitting a PG bug, so
let's kill it. Note that PG was completely operational when this
happened, so it should be able to fulfill this as well.

Note that exactly the same query that failed when it was ran manually
caused no issues: so, it's not a particular query that fails, it's a
particular set of circumstances that's out of our control (i.e.
something like a PG bug) that make the query fail.

= Pre-implementation notes =

Discussed this with Stuart: he thinks it's a great idea :)

He advised me about what I need to do to set the statement timeout and
how to best test it.

= Proposed fix =

So, introduce a statement timeout right around the potentially breaking
query. If it doesn't complete in the allotted time, kill it and go the
slow route for this particular import.

poimport.statement_timeout option allows one to set it to 'timeout'
which will trigger the timeout on purpose. This is used in a test.

= Tests =

bin/test -vvt poimport -t superfastimports

= Demo & QA =

Test poimport thoroughly.

Revision history for this message
Henning Eggers (henninge) wrote :
Download full text (5.5 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 14.09.2009 11:35, Данило Шеган schrieb:
> = Bug #408717 =
>
> Note: this is a cherrypick candidate so is kept as simple as possible.
> It's critical for Launchpad Translations (poimport is blocked on this).

Thanks for taking this on so quickly and finding a solution. If Stuart
thinks it's a great idea, who am I to argue? ;-)

 review approve

Just two comments below.

> === modified file 'lib/canonical/config/schema-lazr.conf'
> --- lib/canonical/config/schema-lazr.conf 2009-09-08 15:16:33 +0000
> +++ lib/canonical/config/schema-lazr.conf 2009-09-14 09:12:58 +0000
> @@ -1429,6 +1429,10 @@
> storm_cache: generational
> storm_cache_size: 500
>
> +# Statement timeout (in seconds), limited to super-fast-imports query.
> +# Set to 'timeout' to make it timeout every time (for tests).
> +statement_timeout: 300
> +
> [processmail]
> # The database user which will be used by this process.
> # datatype: string
>
> === modified file 'lib/lp/translations/utilities/tests/test_superfastimports.py'
> --- lib/lp/translations/utilities/tests/test_superfastimports.py 2009-07-17 02:25:09 +0000
> +++ lib/lp/translations/utilities/tests/test_superfastimports.py 2009-09-14 09:12:58 +0000
> @@ -5,8 +5,9 @@
>
> import unittest
>
> +from canonical.config import config
> from canonical.testing import ZopelessDatabaseLayer
> -from lp.testing import TestCaseWithFactory, verifyObject

Thanks for cleaning this.

> +from lp.testing import TestCaseWithFactory
> from lp.translations.utilities.translation_import import (
> ExistingPOFileInDatabase)
> from lp.translations.utilities.translation_common_format import (
> @@ -67,6 +68,27 @@
> self.assertFalse(cached_file.isAlreadyImportedTheSame(message_data))
> self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
>
> + def test_query_timeout(self):
> + # Test that super-fast-imports doesn't cache anything when it hits
> + # a timeout.
> +
> + # Override the config option to force a timeout error.
> + new_config = ("""
> + [poimport]
> + statement_timeout: timeout
> + """)
> + config.push('super_fast_timeout', new_config)
> +
> + # Add a message that would otherwise be cached (see other tests).
> + current_message = self.factory.makeTranslationMessage(
> + pofile=self.pofile, is_imported=False)
> + message_data = self.getTranslationMessageData(current_message)
> + cached_file = ExistingPOFileInDatabase(self.pofile)
> + self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
> +
> + # Restore the old configuration.
> + config.pop('super_fast_timeout')
> +
>
> def test_suite():
> return unittest.TestLoader().loadTestsFromName(__name__)
>
> === modified file 'lib/lp/translations/utilities/translation_import.py'
> --- lib/lp/translations/utilities/translation_import.py 2009-08-05 13:11:00 +0000
> +++ lib/lp/translations/utilities/translation_import.py 2009-09-14 09:12:58 +0000
> @@ -18,8 +18,13 @@
>
> from operator import attrgetter
>
> +import transaction
> +
> +from...

Read more...

review: Approve
Revision history for this message
Данило Шеган (danilo) wrote :

Hi Henning,

Thanks for the review. For documentation purposes, I am writing up what
we've discussed on IRC as well.

У пон, 14. 09 2009. у 10:24 +0000, Henning Eggers пише:
> > + except TimeoutError:
> > + # Restart the transaction and return empty SelectResults.
> > + transaction.abort()
> > + transaction.begin()
> > + cur.execute("SELECT 1 WHERE 1=0")
>
> As you mentioned on IRC, you should reset the statement_timeout back to
> what it was (if that is not done by abort() ?). Please extend the test
> to check that, too.

I've decided not to (and I don't know if transaction.abort() resets
statement_timeout value): this script run will finish in at most 10
minutes, so having a safeguard while it runs is not too bad (considering
how we just almost killed the DB server).

> Also, could you consider using EmptyResultSet instead of such useless
> queries, please? ;)

As discussed, this is actually using SQLObject, and since I am going to
ask for a CP, I am not converting this query to Storm. With
EmptyResultSet the following would not be possible:

> > rows = cur.fetchall()

And I'd have to worry a bit more about breaking something else.

Cheers,
Danilo

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/config/schema-lazr.conf'
2--- lib/canonical/config/schema-lazr.conf 2009-09-08 15:16:33 +0000
3+++ lib/canonical/config/schema-lazr.conf 2009-09-14 09:12:58 +0000
4@@ -1429,6 +1429,10 @@
5 storm_cache: generational
6 storm_cache_size: 500
7
8+# Statement timeout (in seconds), limited to super-fast-imports query.
9+# Set to 'timeout' to make it timeout every time (for tests).
10+statement_timeout: 300
11+
12 [processmail]
13 # The database user which will be used by this process.
14 # datatype: string
15
16=== modified file 'lib/lp/translations/utilities/tests/test_superfastimports.py'
17--- lib/lp/translations/utilities/tests/test_superfastimports.py 2009-07-17 02:25:09 +0000
18+++ lib/lp/translations/utilities/tests/test_superfastimports.py 2009-09-14 09:12:58 +0000
19@@ -5,8 +5,9 @@
20
21 import unittest
22
23+from canonical.config import config
24 from canonical.testing import ZopelessDatabaseLayer
25-from lp.testing import TestCaseWithFactory, verifyObject
26+from lp.testing import TestCaseWithFactory
27 from lp.translations.utilities.translation_import import (
28 ExistingPOFileInDatabase)
29 from lp.translations.utilities.translation_common_format import (
30@@ -67,6 +68,27 @@
31 self.assertFalse(cached_file.isAlreadyImportedTheSame(message_data))
32 self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
33
34+ def test_query_timeout(self):
35+ # Test that super-fast-imports doesn't cache anything when it hits
36+ # a timeout.
37+
38+ # Override the config option to force a timeout error.
39+ new_config = ("""
40+ [poimport]
41+ statement_timeout: timeout
42+ """)
43+ config.push('super_fast_timeout', new_config)
44+
45+ # Add a message that would otherwise be cached (see other tests).
46+ current_message = self.factory.makeTranslationMessage(
47+ pofile=self.pofile, is_imported=False)
48+ message_data = self.getTranslationMessageData(current_message)
49+ cached_file = ExistingPOFileInDatabase(self.pofile)
50+ self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
51+
52+ # Restore the old configuration.
53+ config.pop('super_fast_timeout')
54+
55
56 def test_suite():
57 return unittest.TestLoader().loadTestsFromName(__name__)
58
59=== modified file 'lib/lp/translations/utilities/translation_import.py'
60--- lib/lp/translations/utilities/translation_import.py 2009-08-05 13:11:00 +0000
61+++ lib/lp/translations/utilities/translation_import.py 2009-09-14 09:12:58 +0000
62@@ -18,8 +18,13 @@
63
64 from operator import attrgetter
65
66+import transaction
67+
68+from canonical.config import config
69 from canonical.database.sqlbase import cursor, quote
70
71+from storm.exceptions import TimeoutError
72+
73 from canonical.cachedproperty import cachedproperty
74 from lp.registry.interfaces.person import (
75 IPersonSet,
76@@ -119,7 +124,6 @@
77 # Pre-fill self.messages and self.imported with data.
78 self._fetchDBRows()
79
80-
81 def _fetchDBRows(self):
82 msgstr_joins = [
83 "LEFT OUTER JOIN POTranslation pt%d "
84@@ -168,8 +172,32 @@
85 TranslationTemplateItem.sequence,
86 TranslationMessage.potemplate NULLS LAST
87 ''' % substitutions
88+
89 cur = cursor()
90- cur.execute(sql)
91+ try:
92+ # XXX 2009-09-14 DaniloSegan (bug #408718):
93+ # this statement causes postgres to eat the diskspace
94+ # from time to time. Let's wrap it up in a timeout.
95+ timeout = config.poimport.statement_timeout
96+
97+ # We have to commit what we've got so far or we'll lose
98+ # it when we hit TimeoutError.
99+ transaction.commit()
100+
101+ if timeout == 'timeout':
102+ # This is used in tests.
103+ query = "SELECT pg_sleep(2)"
104+ timeout = '1s'
105+ else:
106+ timeout = 1000 * int(timeout)
107+ query = sql
108+ cur.execute("SET statement_timeout to %s" % quote(timeout))
109+ cur.execute(query)
110+ except TimeoutError:
111+ # Restart the transaction and return empty SelectResults.
112+ transaction.abort()
113+ transaction.begin()
114+ cur.execute("SELECT 1 WHERE 1=0")
115 rows = cur.fetchall()
116
117 assert TranslationConstants.MAX_PLURAL_FORMS == 6, (
118