#9102 [Tracker] Nightly test failure (rawhide) when restarting named
Closed: fixed 2 years ago by frenaud. Opened 2 years ago by frenaud.

Issue

The nightly tests are failing in rawhide during the server setup, in a step restarting named process. See PR #1473 with for instance the test test_installation_TestInstallMasterDNS (logs, report).

Steps to Reproduce

  1. ipa-server-install -n ipa.test -r IPA.TEST -p Secret.123 -a Secret.123 -U --setup-dns --auto-forwarder
  2. edit /et/named/ipa-logging-ext.conf and set severity to debug
  3. restart named with systemctl restart named.service

Actual behavior

The command fails and named produces a coredump.

Version/Release/Distribution

The issue happens since the bind/bind-dyndb-ldap update (bodhi):
freeipa-server-4.10.0.dev-0.fc36.x86_64
bind-9.16.25-2.fc36.x86_64
bind-dyndb-ldap-11.9-13.fc36.x86_64

Additional info:

The journal shows the following coredump:

Jan 29 12:07:02 master.ipa.test kernel: isc-net-0000[35641]: segfault at be ip 00007f42e0438796 sp 00007f42dc8cc020 error 6 in libisc-9.16.25-RH.so[7f42e042e000+53000]
Jan 29 12:07:02 master.ipa.test kernel: Code: 53 48 83 ec 08 eb 2a 0f 1f 44 00 00 66 48 8d 3d b8 b7 06 00 66 66 48 e8 d8 a9 ff ff 48 89 d9 48 63 10 49 8b 45 10 48 8b 04 d0 <48> 87 0c 28 48 89 d8 49 8b 1c 24 48 39 c3 75 d2 48 83 c4 08 5b 5d
Jan 29 12:07:02 master.ipa.test audit[35637]: ANOM_ABEND auid=4294967295 uid=25 gid=25 ses=4294967295 subj=system_u:system_r:named_t:s0 pid=35637 comm="isc-net-0000" exe="/usr/sbin/named" sig=11 res=1
Jan 29 12:07:02 master.ipa.test audit: BPF prog-id=1006 op=LOAD
Jan 29 12:07:02 master.ipa.test audit: BPF prog-id=1007 op=LOAD
Jan 29 12:07:02 master.ipa.test audit: BPF prog-id=1008 op=LOAD
Jan 29 12:07:02 master.ipa.test systemd[1]: Started systemd-coredump@3-35643-0.service - Process Core Dump (PID 35643/UID 0).
Jan 29 12:07:02 master.ipa.test audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@3-35643-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 29 12:07:03 master.ipa.test [35645]: [164B blob data]
Jan 29 12:07:03 master.ipa.test systemd-coredump[35644]: Process 35637 (named) of user 25 dumped core.

                                                         Module linux-vdso.so.1 with build-id b5939f660ab6f7577272c9a79485f98ea6b15ab7
                                                         Module libplain.so with build-id 0598867fdf736adfa73b06c49a3070322b163469
                                                         Metadata for module libplain.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module liblogin.so with build-id a69a66defcd3cf7d630d56b27f0c33b73f553b84
                                                         Metadata for module liblogin.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libdigestmd5.so with build-id df4453af32021b83cdcc0c6f6d1e90d28c4cffe2
                                                         Metadata for module libdigestmd5.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libcrammd5.so with build-id 749e80aaa8d29954a873be2f931ff46b7d2c9272
                                                         Metadata for module libcrammd5.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libgssapiv2.so with build-id ae7445e0869bcc18ff0eeeb7bb01e9a673673ce2
                                                         Metadata for module libgssapiv2.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libgdbm.so.6 with build-id ef2551acd6629d3145f932451bbc9dab01e245ba
                                                         Metadata for module libgdbm.so.6 owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "gdbm",
                                                                 "version" : "1.22-2.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libsasldb.so with build-id 0454bdf8afc1f4d82f9e35dda7e82e4a29cb2bf8
                                                         Metadata for module libsasldb.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libanonymous.so with build-id 8d4f90407a2d2586608ab9a5bcda7aee827215f4
                                                         Metadata for module libanonymous.so owned by FDO found: {
                                                                 "type" : "rpm",
                                                                 "name" : "cyrus-sasl",
                                                                 "version" : "2.1.27-17.fc36",
                                                                 "architecture" : "x86_64",
                                                                 "osCpe" : "cpe:/o:fedoraproject:fedora:36"
                                                         }

                                                         Module libcrypt.so.2 with build-id 3238e1ae7275785819022aec6b4cd552e5b767b0
                                                         Stack trace of thread 35641:
                                                         #0  0x00007f42e0438796 isc_hp_protect (libisc-9.16.25-RH.so + 0x20796)
                                                         #1  0x00007f42e046422a isc_queue_enqueue (libisc-9.16.25-RH.so + 0x4c22a)
                                                         #2  0x00007f42e044f8d5 isc__nm_enqueue_ievent (libisc-9.16.25-RH.so + 0x378d5)
                                                         #3  0x00007f42e046c972 isc_task_sendto (libisc-9.16.25-RH.so + 0x54972)
                                                         #4  0x00007f42dca14988 syncrepl_update (ldap.so + 0x1c988)
                                                         #5  0x00007f42dca15695 ldap_sync_search_entry (ldap.so + 0x1d695)
                                                         #6  0x00007f42dc9d7361 ldap_sync_search_entry (libldap_r-2.4.so.2 + 0x43361)
                                                         #7  0x00007f42dc9d80bb ldap_sync_init (libldap_r-2.4.so.2 + 0x440bb)
                                                         #8  0x00007f42dca1645d ldap_sync_doit (ldap.so + 0x1e45d)
                                                         #9  0x00007f42dca168e6 ldap_syncrepl_watcher.lto_priv.0 (ldap.so + 0x1e8e6)
                                                         #10 0x00007f42e046bef9 isc__trampoline_run (libisc-9.16.25-RH.so + 0x53ef9)
                                                         #11 0x00007f42dfc4f0e7 start_thread (libc.so.6 + 0x910e7)
                                                         #12 0x00007f42dfcd4780 __clone3 (libc.so.6 + 0x116780)

                                                         Stack trace of thread 35637:
                                                         #0  0x00007f42dfc0179a __sigtimedwait (libc.so.6 + 0x4379a)
                                                         #1  0x00007f42dfc00e6c sigwait (libc.so.6 + 0x42e6c)
                                                         #2  0x00007f42e043d7a3 isc_app_ctxrun (libisc-9.16.25-RH.so + 0x257a3)
                                                         #3  0x00007f42e043da4c isc_app_run (libisc-9.16.25-RH.so + 0x25a4c)
                                                         #4  0x000055adde47846e main (named + 0x1d46e)
                                                         #5  0x00007f42dfbeb590 __libc_start_call_main (libc.so.6 + 0x2d590)
                                                         #6  0x00007f42dfbeb649 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2d649)
                                                         #7  0x000055adde478f65 _start (named + 0x1df65)

                                                         Stack trace of thread 35639:
                                                         #0  0x00007f42dfc4bca9 __futex_abstimed_wait_common (libc.so.6 + 0x8dca9)
                                                         #1  0x00007f42dfc4e759 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x90759)
                                                         #2  0x00007f42e047d5f0 isc_condition_waituntil (libisc-9.16.25-RH.so + 0x655f0)
                                                         #3  0x00007f42e046fa5b run (libisc-9.16.25-RH.so + 0x57a5b)
                                                         #4  0x00007f42e046bef9 isc__trampoline_run (libisc-9.16.25-RH.so + 0x53ef9)
                                                         #5  0x00007f42dfc4f0e7 start_thread (libc.so.6 + 0x910e7)
                                                         #6  0x00007f42dfcd4780 __clone3 (libc.so.6 + 0x116780)

                                                         Stack trace of thread 35640:
                                                         #0  0x00007f42dfcd3d5e epoll_wait (libc.so.6 + 0x115d5e)
                                                         #1  0x00007f42e0477434 netthread (libisc-9.16.25-RH.so + 0x5f434)
                                                         #2  0x00007f42e046bef9 isc__trampoline_run (libisc-9.16.25-RH.so + 0x53ef9)
                                                         #3  0x00007f42dfc4f0e7 start_thread (libc.so.6 + 0x910e7)
                                                         #4  0x00007f42dfcd4780 __clone3 (libc.so.6 + 0x116780)

                                                         Stack trace of thread 35638:
                                                         #0  0x00007f42dfcc8c94 __fcntl64_nocancel_adjusted (libc.so.6 + 0x10ac94)
                                                         #1  0x00007f42dfcc4428 __libc_fcntl64 (libc.so.6 + 0x106428)
                                                         #2  0x00007f42e047518e opensocket (libisc-9.16.25-RH.so + 0x5d18e)
                                                         #3  0x00007f42e04768d4 socket_create (libisc-9.16.25-RH.so + 0x5e8d4)
                                                         #4  0x00007f42e0561aa3 open_socket (libdns-9.16.25-RH.so + 0x5eaa3)
                                                         #5  0x00007f42e05637bd dns_dispatch_addresponse (libdns-9.16.25-RH.so + 0x607bd)
                                                         #6  0x00007f42e060e33d resquery_send.lto_priv.0 (libdns-9.16.25-RH.so + 0x10b33d)
                                                         #7  0x00007f42e060d0db fctx_query.lto_priv.0 (libdns-9.16.25-RH.so + 0x10a0db)
                                                         #8  0x00007f42e0610839 fctx_try.lto_priv.0 (libdns-9.16.25-RH.so + 0x10d839)
                                                         #9  0x00007f42e046dffe isc_task_run (libisc-9.16.25-RH.so + 0x55ffe)
                                                         #10 0x00007f42e04594be process_netievent (libisc-9.16.25-RH.so + 0x414be)
                                                         #11 0x00007f42e04596aa process_queue (libisc-9.16.25-RH.so + 0x416aa)
                                                         #12 0x00007f42e0459e77 async_cb (libisc-9.16.25-RH.so + 0x41e77)
                                                         #13 0x00007f42e01f3b3d uv__async_io.part.0 (libuv.so.1 + 0xab3d)
                                                         #14 0x00007f42e020f86d uv__io_poll.part.0 (libuv.so.1 + 0x2686d)
                                                         #15 0x00007f42e01f968a uv_run (libuv.so.1 + 0x1068a)
                                                         #16 0x00007f42e0459712 nm_thread (libisc-9.16.25-RH.so + 0x41712)
                                                         #17 0x00007f42e046bef9 isc__trampoline_run (libisc-9.16.25-RH.so + 0x53ef9)
                                                         #18 0x00007f42dfc4f0e7 start_thread (libc.so.6 + 0x910e7)
                                                         #19 0x00007f42dfcd4780 __clone3 (libc.so.6 + 0x116780)
                                                         ELF object binary architecture: AMD x86-64
Jan 29 12:07:03 master.ipa.test systemd[1]: systemd-coredump@3-35643-0.service: Deactivated successfully.
Jan 29 12:07:03 master.ipa.test audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@3-35643-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 29 12:07:03 master.ipa.test systemd[1]: named.service: Control process exited, code=exited, status=1/FAILURE
Jan 29 12:07:03 master.ipa.test systemd[1]: named.service: Failed with result 'exit-code'.
Jan 29 12:07:03 master.ipa.test systemd[1]: Failed to start named.service - Berkeley Internet Name Domain (DNS).

Metadata Update from @frenaud:
- Issue tagged with: tracker

2 years ago

It seems bind upstream has done refactoring to netmgr tasks, which affects way syncrepl thread tasks are handled. It crashes when zone loading task is running and syncrepl task is executed at the same time. Unfortunately I don't see the cause yet, is tasks are somehow complicated stuff. Would commit message explaining changes help to you?

https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/4918

Judging by

Stack trace of thread 35641:
#0  0x00007f42e0438796 isc_hp_protect (libisc-9.16.25-RH.so + 0x20796)
#1  0x00007f42e046422a isc_queue_enqueue (libisc-9.16.25-RH.so + 0x4c22a)
#2  0x00007f42e044f8d5 isc__nm_enqueue_ievent (libisc-9.16.25-RH.so + 0x378d5)
#3  0x00007f42e046c972 isc_task_sendto (libisc-9.16.25-RH.so + 0x54972)
#4  0x00007f42dca14988 syncrepl_update (ldap.so + 0x1c988)

I think, what happens is that we attempt to add an event into a half-initialized queue. isc_hp_protect() is fairly small. We get it called from isc_queue_enqueue() with ihp = 0 and atom = &queue->tail. Assuming queue->tail is not NULL, it might be hp->hp[tid()] NULL -- when hp is created in isc_hp_new(), there is no check that isc_get_mem() succeeded.

uintptr_t
isc_hp_protect(isc_hp_t *hp, int ihp, atomic_uintptr_t *atom) {
        uintptr_t n = 0;
        uintptr_t ret;
        while ((ret = atomic_load(atom)) != n) {
                atomic_store(&hp->hp[tid()][ihp], ret);
                n = ret;
        }
        return (ret);
}

I wonder if it is possible to get a coredump as without it it is hard to correlate the lines...

coredump is easy to obtain. problem is most of these variables are optimized out. I think isc_get_mem can not fail anymore. If it does, whole daemon crashes.

(gdb) p isc_tid_v
$1 = 4
(gdb) p hp->hp[isc_tid_v]
$2 = (atomic_uintptr_t *) 0xbe
(gdb) p isc__hp_max_threads
$3 = 4
(gdb) p hp->hp@4
$4 = {0x55c6072fa280, 0x55c6072fa2b0, 0x7f19ebdc7030 <node_destroy>, 0xbe}
(gdb) p hp->hp[0]
$5 = (atomic_uintptr_t *) 0x55c6072fa2e0
(gdb) p hp->hp[1]
$6 = (atomic_uintptr_t *) 0x55c6072fa4a0
(gdb) p hp->hp[2]
$7 = (atomic_uintptr_t *) 0x55c6072fa660
(gdb) p hp->hp[3]
$8 = (atomic_uintptr_t *) 0x55c6072fa820
(gdb) p hp->hp[4]
$9 = (atomic_uintptr_t *) 0xbe

It seems tid() returns one higher than isc__hp_max_threads, which means it selects uninitialized memory. Maybe isc ensures maximal count of threads and bind-dyndb-ldap breaks this assumption by spawning custom thread in addition.

That's a good assumption, I wonder how we can fix this. We absolutely need this separate thread for asynchronous processing of the updates coming through syncrepl.

One way is to call isc_hp_init() with high enough value. In v9_16 branch this is done in isc_managers_create() with 4workers. In v9_18 it uses 2+2workers. Workers picked up from number of CPUs detected (in create_managers()). This is hard-coded in bind's code:

In v9_16:

        /*
         * We have ncpus network threads, ncpus old network threads - make
         * it 4x just to be on the safe side.
         */
        isc_hp_init(4 * workers);

in v9_18:

        /*
         * Currently, there are:
         * - 1 main thread
         * - 1 timer thread
         * - n netmgr threads
         * - n threadpool threads
         */
        isc_hp_init(2 + 2 * workers);

In bind-dyndb-ldap we have number of LDAP connections defined in the configuration:

* connections (default 2)

        Number of connections the LDAP driver should try to establish to
        the LDAP server. It's best if this matches the number of threads
        BIND creates, for performance reasons. However, your LDAP server
        configuration might only allow certain number of connections per
        client.

So we need to affect the whole number of threads at the time HP is initialized. Alternatively, we can have HP autoscale if tid() is greater than the number of threads in HP. This would probably be best -- if isc_hp_*() functions that use tid() would fall back to reallocate the HP array in case tid() is higher than currently recorded. This would not need then any changes on plugins like bind-dyndb-ldap side.

I did not know number of connections is configurable. There might be solution possible.

isc_hp_init has max_threads parameter, isc_queue_new has it also. Interesting enough is those values are quite different. It seems isc_queue_new parameter is unused. There seems no way to increase number of worker threads from configuration, because it is initialized way too early.

Breakpoint 3, isc_hp_init (max_threads=4) at ../../../lib/isc/hp.c:86
86  isc_hp_init(int max_threads) {
(gdb) bt
#0  isc_hp_init (max_threads=4) at ../../../lib/isc/hp.c:86
#1  0x00007ffff7c80bad in isc_managers_create (mctx=0x5555556764d0, workers=<optimized out>, quantum=0, 
    netmgrp=0x5555555dbae8 <named_g_nm>, taskmgrp=0x5555555dbae0 <named_g_taskmgr>) at ../../../lib/isc/managers.c:32
#2  0x00005555555712fb in create_managers () at ../../../bin/named/main.c:927
#3  setup () at ../../../bin/named/main.c:1252
#4  main (argc=<optimized out>, argv=<optimized out>) at ../../../bin/named/main.c:1569
(gdb) cont
Continuing.

Breakpoint 2, isc_hp_new (mctx=0x5555556764d0, max_hps=1, deletefunc=0x7ffff7c98900 <node_destroy>) at ../../../lib/isc/hp.c:92
92  isc_hp_new(isc_mem_t *mctx, size_t max_hps, isc_hp_deletefunc_t *deletefunc) {
(gdb) bt
#0  isc_hp_new (mctx=0x5555556764d0, max_hps=1, deletefunc=0x7ffff7c98900 <node_destroy>) at ../../../lib/isc/hp.c:92
#1  0x00007ffff7c9a1b5 in isc_queue_new (mctx=0x5555556764d0, max_threads=128) at ../../../lib/isc/queue.c:122
#2  0x00007ffff7c855bd in isc__netmgr_create (mctx=<optimized out>, workers=<optimized out>, netmgrp=<optimized out>)
    at netmgr/../../../../lib/isc/netmgr/netmgr.c:348
#3  0x00007ffff7c80bd0 in isc_managers_create (mctx=0x5555556764d0, workers=<optimized out>, quantum=0, 
    netmgrp=0x5555555dbae8 <named_g_nm>, taskmgrp=0x5555555dbae0 <named_g_taskmgr>) at ../../../lib/isc/managers.c:35
#4  0x00005555555712fb in create_managers () at ../../../bin/named/main.c:927
#5  setup () at ../../../bin/named/main.c:1252
#6  main (argc=<optimized out>, argv=<optimized out>) at ../../../bin/named/main.c:1569

Update: Petr submitted a proposed change to bind upstream but it is pending review: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/5795

The named BZ https://bugzilla.redhat.com/show_bug.cgi?id=2048235 has been fixed in bind-9.16.27-1.fc37 , the nightly tests do not display the crash anymore.
See for instance PR #1586 and the test simple_replication (Logs, report).

Closing as fixed.

Metadata Update from @frenaud:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata