Merge lp://staging/~jameinel/bzr/2.1-client-reconnect-819604 into lp://staging/bzr/2.1

Proposed by John A Meinel
Status: Work in progress
Proposed branch: lp://staging/~jameinel/bzr/2.1-client-reconnect-819604
Merge into: lp://staging/bzr/2.1
Diff against target: 1035 lines (+589/-143)
7 files modified
bzrlib/help_topics/en/debug-flags.txt (+2/-0)
bzrlib/osutils.py (+40/-9)
bzrlib/smart/client.py (+170/-86)
bzrlib/smart/medium.py (+43/-24)
bzrlib/smart/protocol.py (+0/-3)
bzrlib/tests/test_osutils.py (+39/-0)
bzrlib/tests/test_smart_transport.py (+295/-21)
To merge this branch: bzr merge lp://staging/~jameinel/bzr/2.1-client-reconnect-819604
Reviewer Review Type Date Requested Status
bzr-core Pending
Review via email: mp+78597@code.staging.launchpad.net

Commit message

Start of bug #819604, allow bzr clients to reconnect if the connection is gone when we go to write a new request.

Description of the change

This is some initial work for having clients reconnect when their bzr+ssh (or bzr://) connection gets closed underneath them.

To start with, the trickiest bit is that it is really quite hard to detect when things are closed. Mostly because of buffering, etc, at lots of different levels (internal memory buffering, socket/pipe buffering, 'ssh.exe' buffering, latency of close message from server back to client, etc.)

I'll also clarify that this isn't intended to handle all possible sorts of connection failures. What we really are trying to handle is allowing a server to be upgraded 'live'. Such that the server will gently disconnect us only between complete requests. Versus having requests terminated randomly in the middle of content. It may help with some of those cases, and we certainly don't want to cause corruption, etc if one of those happens. But it isn't the goal of *this* change.

This specific patch has a few aspects, and does help some cases in real-world testing.

1) Backport a minimum amount of the SmartSSHClientMedium proxies to _real_medium patch from Andrew. In bzr-2.2 Andrew updated the code so that when spawning an 'ssh' subprocess, we use a socketpair when possible instead of pipes. (That way we can read without blocking, allowing us to use large buffers, etc.)

It also meant that he shared the logic from SmartSimplePipeStreamMedium and SmartTCPClientMedium, which I wanted to do, to avoid having to re-implement the logic multiple times.

This is a pretty small part of the patch.

2) Update some of the lower level code so that we get ConnectionReset rather than various IOError and ValueError when writing to a closed connection.

3) Update the _SmartClient._send_request logic. If we get a ConnectionReset while we are trying to write the request, then we can safely retry the request. On the assumption that as long as the server doesn't see the final 'e' terminal byte, it will reject the request, because it is incomplete.

4) The one caveat is if there is a 'body_stream'. body_stream is an iterable, so we can't just rewind it and resume it.

On the plus side, there is only one caller RemoteStreamSink.insert_stream. In theory we could update that caller. In practice, it *also* takes a 'stream' object, which can't simply be rewound. Though we do a 'no-op' stream just before the real content to determine that the remote server actually supports the RPC we want to use.

That would allow us to close the gap a little bit. So we can detect ConnectionReset all the way to the point that we actually start streaming the content.

The other bit that would be possible, is to update ProtocolThreeRequester.call_with_body_stream to allow it to set a flag to indicate whether it has actually started consuming the stream yet. Then update it to flush just before it iterates the body stream to help force detecting closed connections. (otherwise we are likely to have buffered at least the first part of the body_stream in local memory, since the local buffer is 1MB and the first flush is only called after the first chunk of the body_stream.)

This just closes a window where we won't reconnect during 'bzr push'.

5) The next step is that (especially for sockets), we write out the request successfully, and then notice it is closed when we try to read back the response.

Here I'm a lot more concerned about non-idempotent requests. Because the server might have fully read the request, and sent back a response, but we didn't get to see the response.

However, our RPC design is meant to be stateless (all state is passed in the request, not assumed to be stored on the server side). Which means things *should* be idempotent. For something like 'put_bytes', we may write the same content twice, but we shouldn't corrupt anything.

The only one I can think of is 'append_bytes', and the associated open_write_stream() code. (RemoteTransport.open_write_stream uses AppendBasedFileStream as its implementation.)

And that code shouldn't really be used with up-to-date bzr clients and servers. So I'm much more ok with it just dying if it isn't safe to retry.

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

I should note, I think it would be prudent to land this into bzr.dev first, but I wanted to make sure it was written against 2.1 since that is the oldest client that we want to backport support for.

Also, I expect there will be a modest amount of conflicts bringing this up from bzr-2.1 into bzr.dev. There has been a modest amount of changes inbetween, so I'll try to produce branches for each series.

Revision history for this message
John A Meinel (jameinel) wrote :

Small update, in working on the read-reconnect code, I realized passing all these arguments around was very clumsy. So I refactored the code into a helper class _SmartClientRequest. That way, the 6 arguments being passed around just end up as attributes, and the function call interplay is easier to follow.

Revision history for this message
Vincent Ladeuil (vila) wrote :

> I should note, I think it would be prudent to land this into bzr.dev first,
> but I wanted to make sure it was written against 2.1 since that is the oldest
> client that we want to backport support for.

+4 (+1 for each of 2.1, 2.2, 2.3, 2.4)

No need to ask anymore, good, keep reading my mind ;)

I don't know exactly how we should proceed but as much as possible, all the modifications related to this topic should be done in a separate branch which should be merged into whatever series we target.

Backport is hard, let's avoid it as much as possible (but no more ;)

> Also, I expect there will be a modest amount of conflicts bringing this up
> from bzr-2.1 into bzr.dev. There has been a modest amount of changes
> inbetween, so I'll try to produce branches for each series.

Yup, that sounds a lot like a loom with a thread by targeted series. Or any arrangement of branches that makes our life easier.

Revision history for this message
Vincent Ladeuil (vila) wrote :

> Small update, in working on the read-reconnect code, I realized passing all
> these arguments around was very clumsy. So I refactored the code into a helper
> class _SmartClientRequest. That way, the 6 arguments being passed around just
> end up as attributes, and the function call interplay is easier to follow.

From the description alone this part got my approval. If some helpers need to move there too, clarifying the class features, even better ( no pre-conceptions here, just saying).

Revision history for this message
John A Meinel (jameinel) wrote :

Some small code cleanups, and add -Dnoretry as a debugging flag that will disable the retry-on-write.

Revision history for this message
Martin Pool (mbp) wrote :

I'm going to (reluctantly) bump these out of the merge queue until this is landed and well tested in 2.5 - probably until after the 2.5.0 final release is out. Then we can come back and look at landing to earlier series.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/29/2011 7:32 AM, Martin Pool wrote:
> I'm going to (reluctantly) bump these out of the merge queue until
> this is landed and well tested in 2.5 - probably until after the
> 2.5.0 final release is out. Then we can come back and look at
> landing to earlier series.

I don't think it is something you should feel bad about :). That was
certainly the intent.

1) I wrote it against 2.1 so we would have a chance to land it there
   cleanly.

2) I tried to split it up into multiple patches to make it easier to
   review.

3) I merged it up through the stack because there are a fair number of
   changes to this code, and thus conflicts, etc.

4) It should certainly land on trunk and get a thorough shakeout there.
   I'm not sure how much time you think is enough. Certainly if enough
   time passes, 2.1 will be fully obsolete :).

Anyway, thanks for getting this landed in 2.5.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7UiEcACgkQJdeBCYSNAANSFwCgiwx+EeIpiYLJF84P0NPcLFd5
T90AoJDimVbLmeDiqn1FWFDmdWtMXaWq
=94bN
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches