#48347 ns-slapd segfaults when removing a malfunctioning replica in disconnected topology
Closed: wontfix 6 years ago Opened 8 years ago by mbabinsk.

During working on new topology-related features in FreeIPA, I have encountered a segfault in dirsrv process. I had three machines in linear topology:

{{{master1.ipa.test <-> replica1.ipa.test <-> client1.ipa.test}}}

replica1.ipa.test was malfunctioning and was thus uninstalled. During forced removal of its' master entry and leftover segments, dirsrv crashed:

# ipa-replica-manage del replica1.ipa.test --force
Checking connectivity in topology suffix 'ipaca'
'replica1.ipa.test' is not a part of topology suffix 'ipaca'
Not checking connectivity
Checking connectivity in topology suffix 'realm'
WARNING: Removal of 'replica1.ipa.test' will lead to disconnected topology in suffix 'realm'
Changes will not be replicated to all servers and data will become inconsistent.
You need to add segments to prevent disconnection of the topology.
Errors in topology after removal:
Topology does not allow server client1.ipa.test to replicate with servers:
    master1.ipa.test
Topology does not allow server master1.ipa.test to replicate with servers:
    client1.ipa.test
Forcing removal of replica1.ipa.test
Failed to cleanup replica1.ipa.test entries: cannot connect to 'ldaps://master1.ipa.test:636': 
You may need to manually remove them from the tree
unexpected error: cannot connect to 'ldapi://%2fvar%2frun%2fslapd-IPA-TEST.socket':

Relevant systemd journal entries:

Nov 13 12:56:57 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1
Nov 13 12:56:57 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2
Nov 13 12:56:57 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3
Nov 13 12:57:09 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1
Nov 13 12:57:09 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2
Nov 13 12:57:09 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3
Nov 13 12:57:10 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1
Nov 13 12:57:10 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2
Nov 13 12:57:10 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3
Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1
Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2
Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3
Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1
Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2
Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3
Nov 13 12:57:16 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1
Nov 13 12:57:16 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2
Nov 13 12:57:16 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3
Nov 13 12:57:19 master1.ipa.test systemd[1]: dirsrv@IPA-TEST.service: main process exited, code=killed, status=11/SEGV
Nov 13 12:57:19 master1.ipa.test systemd[1]: Unit dirsrv@IPA-TEST.service entered failed state.
Nov 13 12:57:19 master1.ipa.test systemd[1]: dirsrv@IPA-TEST.service failed.
Nov 13 13:15:46 master1.ipa.test systemd[1]: Stopped 389 Directory Server IPA-TEST..

The stacktrace can be found here: http://fpaste.org/290155/

389-ds version: 389-ds-base-1.3.4.4-1.fc22.x86_64


Crashing thread is {{{ Thread 1 (Thread 0x7fd06b7f6700 (LWP 30192)): #0 slapi_sdn_copy (from=0x0, to=to@entry=0x7fd0782b7060) at ldap/servers/slapd/dn.c:2474 #1 0x00007fd0935bbd99 in agmt_get_replarea (ra=ra@entry=0x7fd0544ad550) at ldap/servers/plugins/replication/repl5_agmt.c:983 #2 0x00007fd0935bed48 in get_agmt_status (pb=<optimized out>, e=0x7fd0782af720, entryAfter=<optimized out>, returncode=<optimized out>, returntext=<optimized out>, arg=0x7fd0544ad550) at ldap/servers/plugins/replication/repl5_agmt.c:2692 #3 0x00007fd0a0755d29 in dse_call_callback (pb=pb@entry=0x7fd0782c0320, operation=operation@entry=4, flags=flags@entry=1, entryBefore=entryBefore@entry=0x7fd0782af720, entryAfter=entryAfter@entry=0x0, returncode=0x7fd06b7e86ac, returntext=0x7fd06b7e8730 "", pdse=<optimized out>) at ldap/servers/slapd/dse.c:2634 #4 0x00007fd0a0757a00 in dse_search (pb=0x7fd0782c0320) at ldap/servers/slapd/dse.c:1737 #5 0x00007fd0a078fcb8 in op_shared_search (pb=pb@entry=0x7fd0782c0320, send_result=send_result@entry=1) at ldap/servers/slapd/opshared.c:799 #6 0x00007fd0a07a00de in search_internal_callback_pb (pb=pb@entry=0x7fd0782c0320, callback_data=callback_data@entry=0x7fd06b7ef070, prc=prc@entry=0x7fd0a079f240 <internal_plugin_result_callback>, psec=psec@entry=0x7fd0a079f290 <internal_plugin_search_entry_callback>, prec=prec@entry=0x7fd0a079f250 <internal_plugin_search_referral_callback>) at ldap/servers/slapd/plugin_internal_op.c:783 #7 0x00007fd0a07a0368 in search_internal_pb (pb=0x7fd0782c0320) at ldap/servers/slapd/plugin_internal_op.c:636 #8 0x00007fd0a07a0713 in slapi_search_internal_get_entry (dn=dn@entry=0x7fd0782adbe0, attrs=attrs@entry=0x0, ret_entry=ret_entry@entry=0x7fd06b7ef138, component_identity=0x7fd0a2226f20) at ldap/servers/slapd/plugin_internal_op.c:919 #9 0x00007fd0931752c2 in mep_pre_op (pb=0x7fd078076220, modop=2) at ldap/servers/plugins/mep/mep.c:2135 #10 0x00007fd0a079afb9 in plugin_call_func (list=0x7fd0a2227dd0, operation=operation@entry=463, pb=pb@entry=0x7fd078076220, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:1920 #11 0x00007fd0a079b245 in plugin_call_list (pb=0x7fd078076220, operation=463, list=<optimized out>) at ldap/servers/slapd/plugin.c:1864 #12 plugin_call_plugins (pb=pb@entry=0x7fd078076220, whichfunction=whichfunction@entry=463) at ldap/servers/slapd/plugin.c:438 #13 0x00007fd0a0759370 in dse_delete (pb=0x7fd078076220) at ldap/servers/slapd/dse.c:2505 #14 0x00007fd0a074ef10 in op_shared_delete (pb=pb@entry=0x7fd078076220) at ldap/servers/slapd/delete.c:333 #15 0x00007fd0a074f0c2 in delete_internal_pb (pb=pb@entry=0x7fd078076220) at ldap/servers/slapd/delete.c:211 #16 0x00007fd0a074f373 in slapi_delete_internal_pb (pb=pb@entry=0x7fd078076220) at ldap/servers/slapd/delete.c:154 #17 0x00007fd094b3524b in ipa_topo_agmt_del_dn (dn=0x7fd0782b94b0) at topology_agmt.c:176 #18 0x00007fd094b352f8 in ipa_topo_agmt_del (hostname=<optimized out>, conf=<optimized out>, agmt=0x7fd08805d3d0) at topology_agmt.c:161 #19 0x00007fd094b37b88 in ipa_topo_post_del (pb=<optimized out>) at topology_post.c:253 #20 0x00007fd0a079afb9 in plugin_call_func (list=0x7fd0a21f5790, operation=operation@entry=523, pb=pb@entry=0x7fd07829d770, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:1920 #21 0x00007fd0a079b245 in plugin_call_list (pb=0x7fd07829d770, operation=523, list=<optimized out>) at ldap/servers/slapd/plugin.c:1864 #22 plugin_call_plugins (pb=pb@entry=0x7fd07829d770, whichfunction=whichfunction@entry=523) at ldap/servers/slapd/plugin.c:438 #23 0x00007fd0a074ef8d in op_shared_delete (pb=pb@entry=0x7fd07829d770) at ldap/servers/slapd/delete.c:358 #24 0x00007fd0a074f0c2 in delete_internal_pb (pb=pb@entry=0x7fd07829d770) at ldap/servers/slapd/delete.c:211 #25 0x00007fd0a074f373 in slapi_delete_internal_pb (pb=pb@entry=0x7fd07829d770) at ldap/servers/slapd/delete.c:154 #26 0x00007fd094b3ab4c in ipa_topo_util_segm_remove (tconf=tconf@entry=0x7fd07000b8d0, tsegm=tsegm@entry=0x7fd08805c990) at topology_util.c:1135 #27 0x00007fd094b3b753 in ipa_topo_util_delete_segments_for_host (repl_root=0x7fd0a2827490, delhost=delhost@entry=0x7fd07829bf60 ) at topology_util.c:1431 #28 0x00007fd094b3c1b8 in ipa_topo_util_delete_host (hostentry=<optimized out>) at topology_util.c:1558 #29 0x00007fd094b37b0a in ipa_topo_post_del (pb=<optimized out>) at topology_post.c:278 #30 0x00007fd0a079afb9 in plugin_call_func (list=0x7fd0a21b0fe0, operation=operation@entry=508, pb=pb@entry=0x7fd06b7f5b00, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:1920 #31 0x00007fd0a079b245 in plugin_call_list (pb=0x7fd06b7f5b00, operation=508, list=<optimized out>) at ldap/servers/slapd/plugin.c:1864 #32 plugin_call_plugins (pb=pb@entry=0x7fd06b7f5b00, whichfunction=whichfunction@entry=508) at ldap/servers/slapd/plugin.c:438 #33 0x00007fd0a074ef8d in op_shared_delete (pb=pb@entry=0x7fd06b7f5b00) at ldap/servers/slapd/delete.c:358 #34 0x00007fd0a074f1d3 in do_delete (pb=pb@entry=0x7fd06b7f5b00) at ldap/servers/slapd/delete.c:97 #35 0x00007fd0a0c6ceeb in connection_dispatch_operation (pb=0x7fd06b7f5b00, op=0x7fd0a274cd60, conn=0x7fd077538fd8) at ldap/servers/slapd/connection.c:609 #36 connection_threadmain () at ldap/servers/slapd/connection.c:1735 #37 0x00007fd09e96e5cb in _pt_root (arg=0x7fd0a2875450) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #38 0x00007fd09e30e555 in start_thread (arg=0x7fd06b7f6700) at pthread_create.c:333 #39 0x00007fd09e048f3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 }}}

When deleting a host, topology plugins deletes the related segments/replica_agreements.
When deleting a replica_agreement (frame 13), a preop (mep) first searches it. It finds it and tries to gets its status (registered callback).
The replica agreement structure has been freed when the search (frame 4) tries to get its status:

{{{
(gdb) frame 1
#1 0x00007fd0935bbd99 in agmt_get_replarea (ra=ra@entry=0x7fd0544ad550) at ldap/servers/plugins/replication/repl5_agmt.c:983
983 slapi_sdn_copy(ra->replarea, return_value);

(gdb) print *ra
$15 = {hostname = 0x7fd0540c4710 "mum\nidns\261", port = 1409286264, transport_flags = 32720, binddn = 0x0, creds = 0x0,
bindmethod = 3, replarea = 0x0, frac_attrs = 0x7fd0544db4e0, frac_attrs_total = 0x7fd0880469c0, schedule = 0x7fd05448f0f0,
auto_initialize = 502, dn = 0x0, rdn = 0x0, long_name = 0x0, protocol = 0x0, changecounters = 0x0, num_changecounters = -1,
max_changecounters = 256, last_update_start_time = 1447419296, last_update_end_time = 1447419296,
last_update_status = "-1 Unable to acquire replicaLDAP error: Can't contact LDAP server\000to acquire replica", '\000' <repeats 939 times>, update_in_progress = 0, is_enabled = 1, last_init_start_time = 1447414868, last_init_end_time = 1447414871,
last_init_status = "0 Total update succeeded\000s\000eded\000s", '\000' <repeats 990 times>, lock = 0x7fd0540c4720,
consumerRUV = 0x7fd0800c60e0, consumerSchemaCSN = 0x0, consumerRID = 16, tmpConsumerRID = 0, timeout = 120,
stop_in_progress = 0, busywaittime = 0, pausetime = 0, priv = 0x0, attrs_to_strip = 0x7fd0544a72e0, agreement_type = 0,
protocol_timeout = 0x0, maxcsn = 0x0, flowControlWindow = 1000, flowControlPause = 2000, attr_lock = 0x7fd0540c3cf0,
WaitForAsyncResults = 100}

}}}

This is quite unexpected as it occurs in PREOP so by that time the RA has not been deleted.

Was this a onetime incident, or does it happen more frequently ?

As Thierry noticed, the agreement should not be deleted when mep plugin tries to access it, has there been a parallel attempt to remove the replica ?
Is the core file and th elog files still available ?

Also looks like there are problems in mep:
- the repl agreement in cn=config is not in the managed area, so no need to handle it
- the entry deleted coul be retrieved from the pblock, so why do an internal search ?

IIRC it happened to me once or twice when a central master in linear IPA topology stopped working. I have uninstalled it and ran {{{ipa-replica-manage del --force}}} on other master to clean up dangling segments. This caused DS on this master to crash.

I have a similar topology set up so I will try to reproduce this issue if time permits.

as stated in comment #3 it is unclear why the replication agreement is already freed, when the mep plugin tries to access it.
For this part of the problem we need a reproducer and will keep this ticket open for a while.

The other point is that the mep plugin should ignore changes in cn=config, this can be fixed independently and would probably prevent the crash. I will open a separate ticket for this

Per weekly triage:

ok to move to 1.3.6/backlog/future/ just not close as it is valid case without reproducer

Metadata Update from @nhosoi:
- Issue set to the milestone: 1.3.6 backlog

7 years ago

@mbabinsk Hi there, it's been more than a year now and we still don't have a reproducer. Can we close this issue ?

Metadata Update from @firstyear:
- Custom field component reset (from Replication - General)
- Issue close_status updated to: None

6 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: FUTURE (was: 1.3.6 backlog)

6 years ago

@firstyear Yes feel free to close the issue.

Metadata Update from @mreynolds:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

6 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1678

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: worksforme)

3 years ago

Login to comment on this ticket.

Metadata