Code review comment for lp://staging/~stub/charms/precise/postgresql/enable-integration-tests

Revision history for this message
Stuart Bishop (stub) wrote :

On 7 March 2015 at 06:36, Adam Israel <email address hidden> wrote:

> There were several tests failing that I've determined to be a failure in the test infrastructure. I've landed a patch to bundletester that was running tests more than once. I'll be sending patches to amulet, juju-deployer, and juju-wait (the change from max_workers=6 to max_workers=1 fixed the lock timeout error).

Ok. The multiple workers is of course to speed things up, but
correctness is better. I'll sort juju-wait shortly and kick off the
package build.

> With the above patches implemented locally, I've been able to run the 9.1 integration test with only one failure, that looks like it's ultimately erroneous:
>
> test_syslog initially failed on the deploy command. This line:
>
> self.deployment.add('rsyslog', 'cs:rsyslog', 2)
>
> should read:
>
> self.deployment.add('rsyslog', 'rsyslog', 2)
>
> "cs:rsyslog" returns a 404 - see https://manage.jujucharms.com/api/3/charm/trusty/cs:rsyslog

Aha... when to use prefixes or not has always confused me. I think
this has worked in the past.

> With that fixed, the test runs but fails it's assert checks:
> _StringException: Traceback (most recent call last):
> File "/charms/trusty/postgresql/test.py", line 697, in test_syslog
> self.failUnless('master {}'.format(token) in out)
> File "/usr/lib/python2.7/unittest/case.py", line 614, in deprecated_func
> return original_func(*args, **kwargs)
> File "/usr/lib/python2.7/unittest/case.py", line 424, in assertTrue
> raise self.failureException(msg)
> AssertionError: False is not true
>
> I think the failure is wrong, though. I ssh'd into each rsyslog unit after the failure and `tail -n 100 /var/log/syslog | grep`ped for each string and found it.
>
> While running test_explicit_database, I ran into a case where the replication-relation-broken failed on tearDown, and reset_environment wasn't cycling through to resolve failures. The test passed when I flipped the force flag in amuletfixture.py, but that might not be what you intended.

I consider this a bug, which is why the force flag isn't set.

> The stack trace of the failure:
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 Traceback (most recent call last):
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 2701, in <module>
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 hooks.execute(sys.argv)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/charmhelpers/core/hookenv.py", line 490, in execute
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 self._hooks[hook_name]()
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 2263, in replication_relation_broken
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 config_changed()
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 1140, in config_changed
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 generate_postgresql_hba(postgresql_hba)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 521, in generate_postgresql_hba
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 unit=hookenv.local_unit(), rid=relid)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/charmhelpers/core/hookenv.py", line 44, in wrapper
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 res = func(*args, **kwargs)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/charmhelpers/core/hookenv.py", line 289, in relation_get
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 return json.loads(subprocess.check_output(_args))
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 raise CalledProcessError(retcode, cmd, output=output)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 subprocess.CalledProcessError: Command '['relation-get', '--format=json', '-r', u'db:11', '-', 'postgresql/0']' returned non-zero exit status 1

My first guess is that this is a race condition. The db relation was
torn down before the replication-relation-broken hook was invoked, and
this hook attempted to introspect the db relation and failed. I'll
need to make the horrible mess that is generate_postgresql_hba smarter
or somehow avoid calling it entirely.

> Other changes:
>
> - increased the default timeout in amuletfixture.py from 1200 to 3600 (may not be necessary)
>
> - in test.py:test_failover, I changed three instances of self.deployment.wait() to self.deployment.wait(timeout=self.timeout)
>
> Each unit test took around an hour to run. I wonder if there's anything to be done to shorten that without sacrificing the completeness of the unit tests. Perhaps limit calls to `apt-get update` in `hooks/hooks.py`; it's called twice, on line 32 and again on line 1848. If one of those calls could be eliminated, that would probably shave a few minutes off the test time.

An hour is ridiculous :-( I lost the ability to reuse machines when I
switched to amulet (correct, but slower). But still, provisioning 4
machines, doing the apt dance on them twice, running a dozen hooks...
if that takes an hour either the cloud is horrible or there are some
other delays still buried in there. Or do you mean a suite of tests,
such as the PG91Tests collection?

I do want to optimize the tests, but this will be a quite invasive
change. I want to get the existing suite green before major
refactorings.

> With the above few fixes, plus the merges I need to land in the tools, I think we'll be in good shape to land these changes. Thanks again for all of your work and patience on this!

No, thank you.

--
Stuart Bishop <email address hidden>

« Back to merge proposal