Merge ~sergiodj/ubuntu/+source/bind9:bug1997375-segfault-isc-nm-tcp into ubuntu/+source/bind9:ubuntu/focal-devel

Proposed by Sergio Durigan Junior
Status: Merged
Approved by: git-ubuntu bot
Approved revision: not available
Merged at revision: cd357eac4a7c77737bb0422d00a90a4847756553
Proposed branch: ~sergiodj/ubuntu/+source/bind9:bug1997375-segfault-isc-nm-tcp
Merge into: ubuntu/+source/bind9:ubuntu/focal-devel
Diff against target: 238 lines (+216/-0)
3 files modified
debian/changelog (+8/-0)
debian/patches/lp1997375-segfault-isc-nm-tcp-send.patch (+207/-0)
debian/patches/series (+1/-0)
Reviewer Review Type Date Requested Status
git-ubuntu bot Approve
Andreas Hasenack (community) Approve
Canonical Server Reporter Pending
Review via email: mp+435153@code.staging.launchpad.net

Description of the change

This MP fixes the segmentation fault experienced by some users on bug #1997375.

Unfortunately, I wasn't able to reliably reproduce the crash locally. It involves a complex scenario where bind9 is put under stress by having to handle a lot of requests and then there's a race condition on the tcpdns code (which is accessing socket internal fields in an unsafe manner), all of that finally leading to the segmentation fault.

This issue is affecting several community members, and they have been very helpful in determining the right fix. After one of them provided me with a coredump, I was able to inspect it and pinpoint what seemed to be the commit that fixes the problem:

https://gitlab.isc.org/isc-projects/bind9/-/commit/4ea84740e64f44ff1d397f1a317682633f174b0d

I backported it, put the package in a PPA and ask the community to try it out. As you can see in the bug report, I've had 3 people confirming that the segfaults don't happen with the patched bind9 anymore.

I decided to go ahead with the SRU and, again, rely on the community's good will to help me with the test plan.

There's a PPA with the proposed changes here:

https://launchpad.net/~sergiodj/+archive/ubuntu/bind9-bugfix/+packages

autopkgtest results:

Results: (from http://autopkgtest.ubuntu.com/results/autopkgtest-focal-sergiodj-bind9-bugfix/?format=plain)
  bind9 @ amd64:
    04.01.23 20:24:52 Log 🗒️ ✅ Triggers: bind9/1:9.16.1-0ubuntu2.12~ppa1
  bind9 @ arm64:
    04.01.23 20:27:21 Log 🗒️ ✅ Triggers: bind9/1:9.16.1-0ubuntu2.12~ppa1
  bind9 @ armhf:
    04.01.23 19:58:53 Log 🗒️ ✅ Triggers: bind9/1:9.16.1-0ubuntu2.12~ppa1
  bind9 @ ppc64el:
    04.01.23 20:28:12 Log 🗒️ ✅ Triggers: bind9/1:9.16.1-0ubuntu2.12~ppa1
  bind9 @ s390x:
    04.01.23 20:34:05 Log 🗒️ ✅ Triggers: bind9/1:9.16.1-0ubuntu2.12~ppa1

To post a comment you must log in.
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'll look at this

Revision history for this message
Andreas Hasenack (ahasenack) :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hi Sergio,

I was following the trails of this bug in upstream gitlab.

The fix you are backporting comes from MR #3721: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3721/commits

The commit is from June 16th, 2020.

There is this other MR #3726: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3726/commits#882b07ab55f884856324e81fb113e650836a94d9

It shares the backported commit, but has two others on top:
commit 1cf65cd8829f01cc38f47b1180d7fbe3ab710d35
Author: Witold Kręcicki <email address hidden>
Date: Mon Jun 22 15:46:11 2020 -0700

    Fix a shutdown race in netmgr udp

    We need to mark the socket as inactive early (and synchronously)
    in the stoplistening process; otherwise we might destroy the
    callback argument before we actually stop listening, and call
    the callback on bad memory.

And

commit 3704c4fff2757ade6dda56865aa87935d0c447b9
Author: Evan Hunt <email address hidden>
Date: Sat Jun 20 15:03:05 2020 -0700

    clean up outerhandle when a tcpdns socket is disconnected

    this prevents a crash when some non-netmgr thread, such as a
    recursive lookup, times out after the TCP socket is already
    disconnected.

They also refer to the 1937 Issue (probably because of the shared commit), but also a new one, 1938: https://gitlab.isc.org/isc-projects/bind9/-/issues/1938

Issue 1938 says that because of the crash caused by 1937, bind was being restarted by systemd, and that was also crashing, which became issue 1938. Here is an interesting comment in 1938:

    The server that was repeatedly crashing on restart was eventually restarted successfully
    after stopping client query traffic and letting named load all zones undisturbed before
    allowing client queries again

It *looks* like being inundated with queries (what caused the 1937 crash) while starting up or recovering from the crash also causes problems, and this time it was in the new code.

These are complicated scenarios, and backports. I was wondering if you considered the 1938 issue, or if you hadn't seen it. The commits fixing 1938 are a few days later than the 1937 ones, in the git log.

review: Needs Information
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks for the review, Andreas.

I hadn't noticed MR #3726; interesting. It does seem to be a more complete fix for a more generic bug, but it's also a bit more complex and, as such, introduces more possibilities for regressions.

I'm now wondering whether it makes sense to bite the bullet and backport everything, or keep the fix more "contained" and just address the issue raised in our bug report. Something to be noted is that at least 3 people tested the changes I'm proposing here and so far haven't experienced negative side effects.

As a member of the SRU team, what's your opinion here? I'm inclined to be more conservative and not backport the other commits from MR #3726, but I'm open to suggestions.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

As a member of the SRU team, I normally look ahead in upstream commits whenever a big chunk of C code is refactoring something or introducing new functions, and this is such a case, which led me to find what I did. I don't know what it would look like when backported, and whether it's worth the risk. That's part of the risk analysis of the SRU. At a glance, it looks like whatever flood of queries that is causing the crash keeps incoming while systemd is restarting bind, and that is what causes the second crash.

If we start noticing the second crash after the SRU is released, then the second fix can still be applied I suppose. I would just confirm with those 3 people who tested the ppa package if they had the bind9 service unit configured in such a way as to restart bind whenever it crashed, I don't know if that's our default or not.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

OK, back to this MP.

I talked to Andreas yesterday and decided to leave a comment upstream asking for their input regarding the two MRs. I've come to believe that we really just need MR 3721 (which is the one I backported here), because it is the upstream backport of MR 3726 (as can be seen in this comment: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3721#note_143992). In fact, you can confirm that MR 3721 targets the v9_16 branch, while MR 3726 targets the main branch. But in any case, I think it's a good idea to confirm with them.

I won't bother getting in touch with the folks who have been helping me with testing this package in Ubuntu for now. Hopefully upstream's answer will be enough to get us to move forward here. If they say that I should backport more commits, then I'll prepare another package, put it in the PPA and ask for more testing. Otherwise, I'll ask Andreas to finish his review of this MP.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Upstream's reply wasn't really encouraging:

https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3721#note_345089

While I agree with Ondřej's opinion, I also believe that we have to work with what's available for now. Hopefully Lena's effort to get an MRE for bind9 will bear fruit and we'll be able to have the proper, fuller fix for this problem shipped to our users, but meanwhile I believe it's justifiable to proceed with this MP as is.

Andreas, if you're OK with it I'd like to take my chances and upload this change to Focal.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Ok, +1. Tiny optional nit, the DEP3 headers are missing Last-Update and Applied-Upstream. There is also that inline comment about perhaps changing the bug url to https://gitlab.isc.org/isc-projects/bind9/-/issues/1937, I don't know if you saw that, but it's also a nit.

review: Approve
Revision history for this message
git-ubuntu bot (git-ubuntu-bot) wrote :

Approvers: sergiodj, ahasenack
Uploaders: sergiodj, ahasenack
MP auto-approved

review: Approve
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

On Thursday, January 19 2023, Andreas Hasenack wrote:

> Ok, +1. Tiny optional nit, the DEP3 headers are missing Last-Update
> and Applied-Upstream. There is also that inline comment about perhaps
> changing the bug url to
> https://gitlab.isc.org/isc-projects/bind9/-/issues/1937, I don't know
> if you saw that, but it's also a nit.

Thanks, Andreas.

I've addressed the points above and uploaded the package:

$ dput bind9_9.16.1-0ubuntu2.12_source.changes
Trying to upload package to ubuntu
Checking signature on .changes
gpg: /home/sergio/work/bind9/bind9_9.16.1-0ubuntu2.12_source.changes: Valid signature from 106DA1C8C3CBBF14
Checking signature on .dsc
gpg: /home/sergio/work/bind9/bind9_9.16.1-0ubuntu2.12.dsc: Valid signature from 106DA1C8C3CBBF14
Uploading to ubuntu (via ftp to upload.ubuntu.com):
  Uploading bind9_9.16.1-0ubuntu2.12.dsc: done.
  Uploading bind9_9.16.1-0ubuntu2.12.debian.tar.xz: done.
  Uploading bind9_9.16.1-0ubuntu2.12_source.buildinfo: done.
  Uploading bind9_9.16.1-0ubuntu2.12_source.changes: done.
Successfully uploaded packages.

--
Sergio
GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

There was an error fetching revisions from git servers. Please try again in a few minutes. If the problem persists, contact Launchpad support.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches