After replication is configured, issue 'start' in master server entry nsds5BeginReplicaRefresh.
In slave server, the replication process is restarting because each pass takes too long. It was interrupted and restart again.
In master server, the ns-slapd daemon has 100% cpu usage.
389-ds-base-1.3.7.5-25.el7_5.x86_64 on RHEL 7.5
Configure replication from master to read only replica.
Replication did not complete and master ldap daemon hangs.
Replication complete
<img alt="gdb-backtrace.txt" src="/389-ds-base/issue/raw/files/5422acec601193f3b8853aa675ac1afd6e31d33c4ae50aaf9065de0c8ced88f7-gdb-backtrace.txt" />
@dkt, do you mind upgrading to 389-ds-base-1.3.7.5-26.el7_5. It was recently discovered some issue using nunc-stans (deadlock, fd leak and high CPU). The option was to disable it. You may also try with your current version setting
cn: config nsslapd-enable-nunc-stans: off
And restart the server
Metadata Update from @tbordaz: - Custom field component adjusted to None - Custom field origin adjusted to None - Custom field reviewstatus adjusted to None - Custom field type adjusted to None - Custom field version adjusted to None
<img alt="master-errorlog.txt.xz" src="/389-ds-base/issue/raw/files/bcec29c407ba7fa86bfc73478d72a5441e6e41eb5bfc722817ec4db481cc541d-master-errorlog.txt.xz" />
This ticket is possibly a duplicate of #49871 (https://pagure.io/389-ds-base/issue/49871#comment-523649). You may use 'top -H -p <pid>' to confirm the high cpu thread is the event-thread.
The new version (1.3.7.5-26) seems not appear in the channel rhel-7-server-rpms yet. I would try turning off the option 'nsslapd-enable-nunc-stans' and try again.
Before turning off the nunc-stans:
This thread /pid 4120 is having high CPU.
<img alt="pstack.zip" src="/389-ds-base/issue/raw/files/42ff31400dd6f1bf4e8bfc134e3d43e8fc0b47edf4f7ed04eb625ccae3c065ac-pstack.zip" />
<img alt="after-nsslapd-enable-nunc-stans-off.zip" src="/389-ds-base/issue/raw/files/9fa120582788519ea5335a70419f276956742dbbafc4dcfc74c9c6c0f732db52-after-nsslapd-enable-nunc-stans-off.zip" /> still have problem after nsslapd-enable-nunc-stans is off, attached gdb and pstack
Indeed the problem is not connection related. The pstack shows the replication agreement building a set of entries to send (parentid>=1). It takes some time to build the candidate list, does it fail ? Does it fail on master/consumer ? How large is the DB (#entries) ? I would expect the master to consum CPU to build the sorted list of entries to send, then send the entries. The pstacks (both with/without nunc-stans) shows consumption while preparing the list. So on the consumer side, no entries have been received yet.
My DB is very small, about 50 entries. Before the nunc-stans, the replication stuck for about one hour. I didn't wait for it to finish because it took 12 pass and never finished. During that time the master had high CPU. But I had two extra schema added: 60kerberos.ldif (from MIT Kerberos git, because it have krbCanonicalName) and 60pam-plugin.ldif.
<img alt="slave-errorlog.txt.xz" src="/389-ds-base/issue/raw/files/e1d07ccae38a5644b7b4f6d907fd8fb885176daf31e315945b486883ddadb11c-slave-errorlog.txt.xz" /> slave-errorlog (for complete case study, the time should be about the same as in the master-errorlog)
@dkt, so far I fail to reproduce. Would it be possible to share your db2ldif file and dse.ldif ?
A eradication version of dse.ldif (master) was send to your email account in redhat (I hope it's correct). Still working on the db2ldif file because there are Kerberos data inside it.
Thanks.
@dkt, that was the correct email address :), thank you sooooo much for your help. I will try to reproduce it locally .
Redacted version of db2ldif LDIF dump was also send. Thanks.
Thanks @tbordaz! I remember an issue like this ages ago where it was because the list building needs to be sorted in a certain order, and there was a mistake from ASAN fixes that caused it to never complete. I did resolve it, but maybe it's there again? Maybe somethig missed? Sadly I don't know the ticket number (It was back when @nhosoi was still with us, so old).
Looks like the replication problem is DB specific. I had a pair of 389ds that had replication working, I can export my data and import to that pair of instances without problem.
FYI, I grabbed some info when the CPU at is 100% :: Samples: 3K of event 'cpu-clock', Event count (approx.): 923906250 Overhead Shared Object Symbol 73.71% libback-ldbm.so [.] idl_id_is_in_idlist 18.00% libback-ldbm.so [.] idl_new_range_fetch 0.83% [kernel] [k] vsnprintf 0.52% [kernel] [k] format_decode 0.51% libback-ldbm.so [.] idl_id_is_in_idlist@plt
This seems a lot like an infinite loop in trying to build the list of entries to send for replication. I swear that I have seen this before, but also that the issue was resolved. I'm really curious though, because if you can db2ldif I think it uses the same code ....
With your backtrace, can you do "thread apply all bt full" instead of "bt full" thanks?
<img alt="gdb-thread-apply-all-bt-full.txt.xz" src="/389-ds-base/issue/raw/files/7c848e80cd82f82a3a2de43cdee905f82469b68ee484ad1345ba2a3d50928824-gdb-thread-apply-all-bt-full.txt.xz" /> Attached is the output of gdb thread apply all bt full
@firstyear gdb output of 'thread apply all bt full' is attached.
I upgraded the 389-ds with RHEL 7.6 beta (RPM only, not full OS upgrade). Version is now 389-ds-base-1.3.8.4-9.el7.x86_64. Problem persisted for that particular instance.
Since the debug rpm is not available to able to download individually (EDIT: no debug package/ISO is available). I attached the gdb bt here. <img alt="gdb_trace_for_389-ds-base-1.3.8.4-9.el7.txt" src="/389-ds-base/issue/raw/files/cca6ae6840f68394bf0586186bde6d68afbe95ed1cb61f7ec3102d679b014ffa-gdb_trace_for_389-ds-base-1.3.8.4-9.el7.txt" />
BTW, I I have trimmed down the DB/data inside the instance (key/password removed). This db2bak data would cause 100% CPU when doing the replication. Instance name is cms1, so that you can reproduce the problem.
db2bak output is in here: https://drive.google.com/open?id=1yzYEAfeoa5n5eYN-D92eIKgSi6RUu3DE
It seems the problem is in the Kerberos tree in the ldap data. If I don't include the Kerberos tree, the replication seems fine.
Something specific to the provided db2back is to see the RUV entry (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,<suffix> as the first entry of the db, instead of the suffix. Your ldif also contains the RUV in the first place, I will try to reproduce with this specific order.
Something specific to the provided db2back is to see the RUV entry (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,<suffix> as the first entry of the db, instead of the suffix.
yes, I had the same suspicion and vaguely remember a problem with this
@lkrispen, I was initially thinking a craft ldif could be responsible of this but if the RUV is the first entry in the ldif, the RUV entry is skipped. So I think it should exist an other scenario where the RUV is the first entry.
you can create this scenario if you configure a suffix and enable replication for this suffix before adding the suffix entry. then the ruv has id:1 and the suffix id:2
Using problem-db data:
id 1 (created 13:59:35 according to replicageneration) rdn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff objectClass: top objectClass: nsTombstone objectClass: extensibleobject nsUniqueId: ffffffff-ffffffff-ffffffff-ffffffff nsds50ruv: {replicageneration} 5b7d6c47000000020000 nsds50ruv: {replica 2 ldap://<masterlab.local.nonet:389} 5b7d6c4f000000020000 5b7d6c77000200020000 dc: local nscpEntryDN: dc=local,dc=nonet nsruvReplicaLastModified: {replica 2 ldap://cms1.lab.local.nonet:389} 5b7d6c77 nsds5agmtmaxcsn: dc=local,dc=nonet;Replication Agreement cms1 to cms1b;cms1.la b.local.nonet;2389;unavailable id 2 (created 13:59:43) rdn: dc=local,dc=nonet createTimestamp;vucsn-5b7d6c4f000000020000: 20180822135943Z entryid: 2 id 17 (created 14:00:23) rdn: uid=user2 ... createTimestamp;vucsn-5b7d6c77000200020000: 20180822140023Z entryid: 17
This matches the scenario described by Ludwig. Next is checking if such scenario can lead to 100% CPU during total init
I managed to reproduce locally the problem of 100% CPU. The problem seems to appears with 2 conditions: - The suffix entry is not the entry-id 1, because the suffix is created after the RUV entry is created - it exists intermediate entries (not leaf) with entry-id starting with 1* (e.g. 10, 11, 100..)
I am preparing a testcase
The problem can easily been worked around if the replica is created after the suffix entry was created. To recover from the state where replica was created before, you may export ldif (without -r option) and reinitialize the replica with the ldif file.
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.8
Patch under review https://pagure.io/389-ds-base/pull-request/49936
Metadata Update from @tbordaz: - Issue set to the milestone: None (was: 1.3.8)
Metadata Update from @gparente: - Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1626375
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.7.0
c73cd26 .. bdb8676 master 06d9fcc..b6c146f 389-ds-base-1.3.8 -> 389-ds-base-1.3.8 722a6f8..942890e 389-ds-base-1.3.7 -> 389-ds-base-1.3.7
Metadata Update from @tbordaz: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
Fix compiler warning
2ce769e..a8c37ea master b6c146f..8ca3fb8 389-ds-base-1.3.8 -> 389-ds-base-1.3.8 942890e..dd4b13c 389-ds-base-1.3.7 -> 389-ds-base-1.3.7
@tbordaz, you are still missing this change I think:
index f8b86d92e..5749e2676 100644 --- a/ldap/servers/slapd/back-ldbm/proto-back-ldbm.h +++ b/ldap/servers/slapd/back-ldbm/proto-back-ldbm.h @@ -305,6 +305,8 @@ int index_buffer_init(size_t size, int flags, void **h); int index_buffer_flush(void *h, backend *be, DB_TXN *txn, struct attrinfo *a); int index_buffer_terminate(void *h); +int get_suffix_key(Slapi_Backend *be, struct _back_info_index_key *info); +int set_suffix_key(Slapi_Backend *be, struct _back_info_index_key *info);
Oppsss.. sorry 8ca3fb8..bcb97d4 389-ds-base-1.3.8 -> 389-ds-base-1.3.8 dd4b13c..cb21dc5 389-ds-base-1.3.7 -> 389-ds-base-1.3.7
Reminder to review https://pagure.io/389-ds-base/pull-request/49636 given it's relation to this :)
Commit 0319ec0 fixes this issue
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2974
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.