On 7 March 2015 at 06:36, Adam Israel <email address hidden> wrote:
> There were several tests failing that I've determined to be a failure in the test infrastructure. I've landed a patch to bundletester that was running tests more than once. I'll be sending patches to amulet, juju-deployer, and juju-wait (the change from max_workers=6 to max_workers=1 fixed the lock timeout error).
Ok. The multiple workers is of course to speed things up, but
correctness is better. I'll sort juju-wait shortly and kick off the
package build.
> With the above patches implemented locally, I've been able to run the 9.1 integration test with only one failure, that looks like it's ultimately erroneous:
>
> test_syslog initially failed on the deploy command. This line:
>
> self.deployment.add('rsyslog', 'cs:rsyslog', 2)
>
> should read:
>
> self.deployment.add('rsyslog', 'rsyslog', 2)
>
> "cs:rsyslog" returns a 404 - see https://manage.jujucharms.com/api/3/charm/trusty/cs:rsyslog
Aha... when to use prefixes or not has always confused me. I think
this has worked in the past.
> With that fixed, the test runs but fails it's assert checks:
> _StringException: Traceback (most recent call last):
> File "/charms/trusty/postgresql/test.py", line 697, in test_syslog
> self.failUnless('master {}'.format(token) in out)
> File "/usr/lib/python2.7/unittest/case.py", line 614, in deprecated_func
> return original_func(*args, **kwargs)
> File "/usr/lib/python2.7/unittest/case.py", line 424, in assertTrue
> raise self.failureException(msg)
> AssertionError: False is not true
>
> I think the failure is wrong, though. I ssh'd into each rsyslog unit after the failure and `tail -n 100 /var/log/syslog | grep`ped for each string and found it.
>
> While running test_explicit_database, I ran into a case where the replication-relation-broken failed on tearDown, and reset_environment wasn't cycling through to resolve failures. The test passed when I flipped the force flag in amuletfixture.py, but that might not be what you intended.
I consider this a bug, which is why the force flag isn't set.
> The stack trace of the failure:
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 Traceback (most recent call last):
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 2701, in <module>
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 hooks.execute(sys.argv)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/charmhelpers/core/hookenv.py", line 490, in execute
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 self._hooks[hook_name]()
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 2263, in replication_relation_broken
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 config_changed()
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 1140, in config_changed
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 generate_postgresql_hba(postgresql_hba)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/replication-relation-broken", line 521, in generate_postgresql_hba
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 unit=hookenv.local_unit(), rid=relid)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/charmhelpers/core/hookenv.py", line 44, in wrapper
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 res = func(*args, **kwargs)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/var/lib/juju/agents/unit-postgresql-0/charm/hooks/charmhelpers/core/hookenv.py", line 289, in relation_get
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 return json.loads(subprocess.check_output(_args))
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 raise CalledProcessError(retcode, cmd, output=output)
> 2015-03-26 15:17:01 INFO unit.postgresql/0.replication-relation-broken logger.go:40 subprocess.CalledProcessError: Command '['relation-get', '--format=json', '-r', u'db:11', '-', 'postgresql/0']' returned non-zero exit status 1
My first guess is that this is a race condition. The db relation was
torn down before the replication-relation-broken hook was invoked, and
this hook attempted to introspect the db relation and failed. I'll
need to make the horrible mess that is generate_postgresql_hba smarter
or somehow avoid calling it entirely.
> Other changes:
>
> - increased the default timeout in amuletfixture.py from 1200 to 3600 (may not be necessary)
>
> - in test.py:test_failover, I changed three instances of self.deployment.wait() to self.deployment.wait(timeout=self.timeout)
>
> Each unit test took around an hour to run. I wonder if there's anything to be done to shorten that without sacrificing the completeness of the unit tests. Perhaps limit calls to `apt-get update` in `hooks/hooks.py`; it's called twice, on line 32 and again on line 1848. If one of those calls could be eliminated, that would probably shave a few minutes off the test time.
An hour is ridiculous :-( I lost the ability to reuse machines when I
switched to amulet (correct, but slower). But still, provisioning 4
machines, doing the apt dance on them twice, running a dozen hooks...
if that takes an hour either the cloud is horrible or there are some
other delays still buried in there. Or do you mean a suite of tests,
such as the PG91Tests collection?
I do want to optimize the tests, but this will be a quite invasive
change. I want to get the existing suite green before major
refactorings.
> With the above few fixes, plus the merges I need to land in the tools, I think we'll be in good shape to land these changes. Thanks again for all of your work and patience on this!
On 7 March 2015 at 06:36, Adam Israel <email address hidden> wrote:
> There were several tests failing that I've determined to be a failure in the test infrastructure. I've landed a patch to bundletester that was running tests more than once. I'll be sending patches to amulet, juju-deployer, and juju-wait (the change from max_workers=6 to max_workers=1 fixed the lock timeout error).
Ok. The multiple workers is of course to speed things up, but
correctness is better. I'll sort juju-wait shortly and kick off the
package build.
> With the above patches implemented locally, I've been able to run the 9.1 integration test with only one failure, that looks like it's ultimately erroneous: .add('rsyslog' , 'cs:rsyslog', 2) .add('rsyslog' , 'rsyslog', 2) /manage. jujucharms. com/api/ 3/charm/ trusty/ cs:rsyslog
>
> test_syslog initially failed on the deploy command. This line:
>
> self.deployment
>
> should read:
>
> self.deployment
>
> "cs:rsyslog" returns a 404 - see https:/
Aha... when to use prefixes or not has always confused me. I think
this has worked in the past.
> With that fixed, the test runs but fails it's assert checks: trusty/ postgresql/ test.py" , line 697, in test_syslog ('master {}'.format(token) in out) python2. 7/unittest/ case.py" , line 614, in deprecated_func func(*args, **kwargs) python2. 7/unittest/ case.py" , line 424, in assertTrue eption( msg) database, I ran into a case where the replication- relation- broken failed on tearDown, and reset_environment wasn't cycling through to resolve failures. The test passed when I flipped the force flag in amuletfixture.py, but that might not be what you intended.
> _StringException: Traceback (most recent call last):
> File "/charms/
> self.failUnless
> File "/usr/lib/
> return original_
> File "/usr/lib/
> raise self.failureExc
> AssertionError: False is not true
>
> I think the failure is wrong, though. I ssh'd into each rsyslog unit after the failure and `tail -n 100 /var/log/syslog | grep`ped for each string and found it.
>
> While running test_explicit_
I consider this a bug, which is why the force flag isn't set.
> The stack trace of the failure: /0.replication- relation- broken logger.go:40 Traceback (most recent call last): /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/replicati on-relation- broken" , line 2701, in <module> /0.replication- relation- broken logger.go:40 hooks.execute( sys.argv) /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/charmhelp ers/core/ hookenv. py", line 490, in execute /0.replication- relation- broken logger.go:40 self._hooks[ hook_name] () /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/replicati on-relation- broken" , line 2263, in replication_ relation_ broken /0.replication- relation- broken logger.go:40 config_changed() /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/replicati on-relation- broken" , line 1140, in config_changed /0.replication- relation- broken logger.go:40 generate_ postgresql_ hba(postgresql_ hba) /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/replicati on-relation- broken" , line 521, in generate_ postgresql_ hba /0.replication- relation- broken logger.go:40 unit=hookenv. local_unit( ), rid=relid) /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/charmhelp ers/core/ hookenv. py", line 44, in wrapper /0.replication- relation- broken logger.go:40 res = func(*args, **kwargs) /0.replication- relation- broken logger.go:40 File "/var/lib/ juju/agents/ unit-postgresql -0/charm/ hooks/charmhelp ers/core/ hookenv. py", line 289, in relation_get /0.replication- relation- broken logger.go:40 return json.loads( subprocess. check_output( _args)) /0.replication- relation- broken logger.go:40 File "/usr/lib/ python2. 7/subprocess. py", line 573, in check_output /0.replication- relation- broken logger.go:40 raise CalledProcessEr ror(retcode, cmd, output=output) /0.replication- relation- broken logger.go:40 subprocess. CalledProcessEr ror: Command '['relation-get', '--format=json', '-r', u'db:11', '-', 'postgresql/0']' returned non-zero exit status 1
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
> 2015-03-26 15:17:01 INFO unit.postgresql
My first guess is that this is a race condition. The db relation was relation- broken hook was invoked, and postgresql_ hba smarter
torn down before the replication-
this hook attempted to introspect the db relation and failed. I'll
need to make the horrible mess that is generate_
or somehow avoid calling it entirely.
> Other changes: test_failover, I changed three instances of self.deployment .wait() to self.deployment .wait(timeout= self.timeout)
>
> - increased the default timeout in amuletfixture.py from 1200 to 3600 (may not be necessary)
>
> - in test.py:
>
> Each unit test took around an hour to run. I wonder if there's anything to be done to shorten that without sacrificing the completeness of the unit tests. Perhaps limit calls to `apt-get update` in `hooks/hooks.py`; it's called twice, on line 32 and again on line 1848. If one of those calls could be eliminated, that would probably shave a few minutes off the test time.
An hour is ridiculous :-( I lost the ability to reuse machines when I
switched to amulet (correct, but slower). But still, provisioning 4
machines, doing the apt dance on them twice, running a dozen hooks...
if that takes an hour either the cloud is horrible or there are some
other delays still buried in there. Or do you mean a suite of tests,
such as the PG91Tests collection?
I do want to optimize the tests, but this will be a quite invasive
change. I want to get the existing suite green before major
refactorings.
> With the above few fixes, plus the merges I need to land in the tools, I think we'll be in good shape to land these changes. Thanks again for all of your work and patience on this!
No, thank you.
--
Stuart Bishop <email address hidden>