During working on new topology-related features in FreeIPA, I have encountered a segfault in dirsrv process. I had three machines in linear topology:
{{{master1.ipa.test <-> replica1.ipa.test <-> client1.ipa.test}}}
replica1.ipa.test was malfunctioning and was thus uninstalled. During forced removal of its' master entry and leftover segments, dirsrv crashed:
# ipa-replica-manage del replica1.ipa.test --force Checking connectivity in topology suffix 'ipaca' 'replica1.ipa.test' is not a part of topology suffix 'ipaca' Not checking connectivity Checking connectivity in topology suffix 'realm' WARNING: Removal of 'replica1.ipa.test' will lead to disconnected topology in suffix 'realm' Changes will not be replicated to all servers and data will become inconsistent. You need to add segments to prevent disconnection of the topology. Errors in topology after removal: Topology does not allow server client1.ipa.test to replicate with servers: master1.ipa.test Topology does not allow server master1.ipa.test to replicate with servers: client1.ipa.test Forcing removal of replica1.ipa.test Failed to cleanup replica1.ipa.test entries: cannot connect to 'ldaps://master1.ipa.test:636': You may need to manually remove them from the tree unexpected error: cannot connect to 'ldapi://%2fvar%2frun%2fslapd-IPA-TEST.socket':
Relevant systemd journal entries:
Nov 13 12:56:57 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1 Nov 13 12:56:57 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2 Nov 13 12:56:57 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3 Nov 13 12:57:09 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1 Nov 13 12:57:09 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2 Nov 13 12:57:09 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3 Nov 13 12:57:10 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1 Nov 13 12:57:10 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2 Nov 13 12:57:10 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3 Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1 Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2 Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3 Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1 Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2 Nov 13 12:57:15 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3 Nov 13 12:57:16 master1.ipa.test ns-slapd[30163]: GSSAPI server step 1 Nov 13 12:57:16 master1.ipa.test ns-slapd[30163]: GSSAPI server step 2 Nov 13 12:57:16 master1.ipa.test ns-slapd[30163]: GSSAPI server step 3 Nov 13 12:57:19 master1.ipa.test systemd[1]: dirsrv@IPA-TEST.service: main process exited, code=killed, status=11/SEGV Nov 13 12:57:19 master1.ipa.test systemd[1]: Unit dirsrv@IPA-TEST.service entered failed state. Nov 13 12:57:19 master1.ipa.test systemd[1]: dirsrv@IPA-TEST.service failed. Nov 13 13:15:46 master1.ipa.test systemd[1]: Stopped 389 Directory Server IPA-TEST..
The stacktrace can be found here: http://fpaste.org/290155/
389-ds version: 389-ds-base-1.3.4.4-1.fc22.x86_64
When deleting a host, topology plugins deletes the related segments/replica_agreements. When deleting a replica_agreement (frame 13), a preop (mep) first searches it. It finds it and tries to gets its status (registered callback). The replica agreement structure has been freed when the search (frame 4) tries to get its status:
{{{ (gdb) frame 1 #1 0x00007fd0935bbd99 in agmt_get_replarea (ra=ra@entry=0x7fd0544ad550) at ldap/servers/plugins/replication/repl5_agmt.c:983 983 slapi_sdn_copy(ra->replarea, return_value);
(gdb) print *ra $15 = {hostname = 0x7fd0540c4710 "mum\nidns\261", port = 1409286264, transport_flags = 32720, binddn = 0x0, creds = 0x0, bindmethod = 3, replarea = 0x0, frac_attrs = 0x7fd0544db4e0, frac_attrs_total = 0x7fd0880469c0, schedule = 0x7fd05448f0f0, auto_initialize = 502, dn = 0x0, rdn = 0x0, long_name = 0x0, protocol = 0x0, changecounters = 0x0, num_changecounters = -1, max_changecounters = 256, last_update_start_time = 1447419296, last_update_end_time = 1447419296, last_update_status = "-1 Unable to acquire replicaLDAP error: Can't contact LDAP server\000to acquire replica", '\000' <repeats 939 times>, update_in_progress = 0, is_enabled = 1, last_init_start_time = 1447414868, last_init_end_time = 1447414871, last_init_status = "0 Total update succeeded\000s\000eded\000s", '\000' <repeats 990 times>, lock = 0x7fd0540c4720, consumerRUV = 0x7fd0800c60e0, consumerSchemaCSN = 0x0, consumerRID = 16, tmpConsumerRID = 0, timeout = 120, stop_in_progress = 0, busywaittime = 0, pausetime = 0, priv = 0x0, attrs_to_strip = 0x7fd0544a72e0, agreement_type = 0, protocol_timeout = 0x0, maxcsn = 0x0, flowControlWindow = 1000, flowControlPause = 2000, attr_lock = 0x7fd0540c3cf0, WaitForAsyncResults = 100}
}}}
This is quite unexpected as it occurs in PREOP so by that time the RA has not been deleted.
Was this a onetime incident, or does it happen more frequently ?
As Thierry noticed, the agreement should not be deleted when mep plugin tries to access it, has there been a parallel attempt to remove the replica ? Is the core file and th elog files still available ?
Also looks like there are problems in mep: - the repl agreement in cn=config is not in the managed area, so no need to handle it - the entry deleted coul be retrieved from the pblock, so why do an internal search ?
IIRC it happened to me once or twice when a central master in linear IPA topology stopped working. I have uninstalled it and ran {{{ipa-replica-manage del --force}}} on other master to clean up dangling segments. This caused DS on this master to crash.
I have a similar topology set up so I will try to reproduce this issue if time permits.
as stated in comment #3 it is unclear why the replication agreement is already freed, when the mep plugin tries to access it. For this part of the problem we need a reproducer and will keep this ticket open for a while.
The other point is that the mep plugin should ignore changes in cn=config, this can be fixed independently and would probably prevent the crash. I will open a separate ticket for this
Per weekly triage:
ok to move to 1.3.6/backlog/future/ just not close as it is valid case without reproducer
Metadata Update from @nhosoi: - Issue set to the milestone: 1.3.6 backlog
@mbabinsk Hi there, it's been more than a year now and we still don't have a reproducer. Can we close this issue ?
Metadata Update from @firstyear: - Custom field component reset (from Replication - General) - Issue close_status updated to: None
Metadata Update from @mreynolds: - Issue set to the milestone: FUTURE (was: 1.3.6 backlog)
@firstyear Yes feel free to close the issue.
Metadata Update from @mreynolds: - Issue close_status updated to: worksforme - Issue status updated to: Closed (was: Open)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1678
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: worksforme)
Login to comment on this ticket.