Issue #49915: Master ns-slapd had 100% CPU usage after starting replication and replication cannot finish - 389-ds-base

389-ds-base

#49915 Master ns-slapd had 100% CPU usage after starting replication and replication cannot finish

Closed: wontfix 5 years ago Opened 5 years ago by dkt.

Issue Description

After replication is configured, issue 'start' in master server entry nsds5BeginReplicaRefresh.

In slave server, the replication process is restarting because each pass takes too long. It was interrupted and restart again.

In master server, the ns-slapd daemon has 100% cpu usage.

Package Version and Platform

389-ds-base-1.3.7.5-25.el7_5.x86_64 on RHEL 7.5

Steps to reproduce

Configure replication from master to read only replica.

Actual results

Replication did not complete and master ldap daemon hangs.

Expected results

Replication complete

dkt commented 5 years ago

tbordaz commented 5 years ago

@dkt, do you mind upgrading to 389-ds-base-1.3.7.5-26.el7_5. It was recently discovered some issue using nunc-stans (deadlock, fd leak and high CPU). The option was to disable it.
You may also try with your current version setting

cn: config
nsslapd-enable-nunc-stans: off

And restart the server

Metadata Update from @tbordaz:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

5 years ago

dkt commented 5 years ago

tbordaz commented 5 years ago

This ticket is possibly a duplicate of #49871 (https://pagure.io/389-ds-base/issue/49871#comment-523649). You may use 'top -H -p <pid>' to confirm the high cpu thread is the event-thread.

dkt commented 5 years ago

The new version (1.3.7.5-26) seems not appear in the channel rhel-7-server-rpms yet.
I would try turning off the option 'nsslapd-enable-nunc-stans' and try again.

dkt commented 5 years ago

Before turning off the nunc-stans:

This ticket is possibly a duplicate of #49871 (https://pagure.io/389-ds-base/issue/49871#comment-523649). You may use 'top -H -p <pid>' to confirm the high cpu thread is the event-thread.

This thread /pid 4120 is having high CPU.

Edited 5 years ago by dkt

dkt commented 5 years ago

still have problem after nsslapd-enable-nunc-stans is off, attached gdb and pstack

tbordaz commented 5 years ago

Indeed the problem is not connection related. The pstack shows the replication agreement building a set of entries to send (parentid>=1). It takes some time to build the candidate list, does it fail ? Does it fail on master/consumer ? How large is the DB (#entries) ?
I would expect the master to consum CPU to build the sorted list of entries to send, then send the entries. The pstacks (both with/without nunc-stans) shows consumption while preparing the list. So on the consumer side, no entries have been received yet.

dkt commented 5 years ago

Indeed the problem is not connection related. The pstack shows the replication agreement building a set of entries to send (parentid>=1). It takes some time to build the candidate list, does it fail ? Does it fail on master/consumer ? How large is the DB (#entries) ?
I would expect the master to consum CPU to build the sorted list of entries to send, then send the entries. The pstacks (both with/without nunc-stans) shows consumption while preparing the list. So on the consumer side, no entries have been received yet.

My DB is very small, about 50 entries.
Before the nunc-stans, the replication stuck for about one hour. I didn't wait for it to finish because it took 12 pass and never finished. During that time the master had high CPU.
But I had two extra schema added: 60kerberos.ldif (from MIT Kerberos git, because it have krbCanonicalName) and 60pam-plugin.ldif.

dkt commented 5 years ago

slave-errorlog (for complete case study, the time should be about the same as in the master-errorlog)

tbordaz commented 5 years ago

@dkt, so far I fail to reproduce. Would it be possible to share your db2ldif file and dse.ldif ?

dkt commented 5 years ago

A eradication version of dse.ldif (master) was send to your email account in redhat (I hope it's correct). Still working on the db2ldif file because there are Kerberos data inside it.

Thanks.

tbordaz commented 5 years ago

@dkt, that was the correct email address :), thank you sooooo much for your help. I will try to reproduce it locally .

dkt commented 5 years ago

Redacted version of db2ldif LDIF dump was also send. Thanks.

firstyear commented 5 years ago

Thanks @tbordaz! I remember an issue like this ages ago where it was because the list building needs to be sorted in a certain order, and there was a mistake from ASAN fixes that caused it to never complete. I did resolve it, but maybe it's there again? Maybe somethig missed? Sadly I don't know the ticket number (It was back when @nhosoi was still with us, so old).

dkt commented 5 years ago

Looks like the replication problem is DB specific. I had a pair of 389ds that had replication working, I can export my data and import to that pair of instances without problem.

FYI, I grabbed some info when the CPU at is 100%
::
Samples: 3K of event 'cpu-clock', Event count (approx.): 923906250
Overhead Shared Object Symbol
73.71% libback-ldbm.so [.] idl_id_is_in_idlist
18.00% libback-ldbm.so [.] idl_new_range_fetch
0.83% [kernel] [k] vsnprintf
0.52% [kernel] [k] format_decode
0.51% libback-ldbm.so [.] idl_id_is_in_idlist@plt

firstyear commented 5 years ago

This seems a lot like an infinite loop in trying to build the list of entries to send for replication. I swear that I have seen this before, but also that the issue was resolved. I'm really curious though, because if you can db2ldif I think it uses the same code ....

With your backtrace, can you do "thread apply all bt full" instead of "bt full" thanks?

dkt commented 5 years ago

Attached is the output of gdb thread apply all bt full

dkt commented 5 years ago

@firstyear gdb output of 'thread apply all bt full' is attached.

dkt commented 5 years ago

I upgraded the 389-ds with RHEL 7.6 beta (RPM only, not full OS upgrade). Version is now 389-ds-base-1.3.8.4-9.el7.x86_64. Problem persisted for that particular instance.

Since the debug rpm is not available to able to download individually (EDIT: no debug package/ISO is available). I attached the gdb bt here.

BTW, I I have trimmed down the DB/data inside the instance (key/password removed).
This db2bak data would cause 100% CPU when doing the replication.
Instance name is cms1, so that you can reproduce the problem.

db2bak output is in here:
https://drive.google.com/open?id=1yzYEAfeoa5n5eYN-D92eIKgSi6RUu3DE

It seems the problem is in the Kerberos tree in the ldap data. If I don't include the Kerberos tree, the replication seems fine.

Edited 5 years ago by dkt

tbordaz commented 5 years ago

Something specific to the provided db2back is to see the RUV entry (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,<suffix> as the first entry of the db, instead of the suffix.
Your ldif also contains the RUV in the first place, I will try to reproduce with this specific order.

lkrispen commented 5 years ago

Something specific to the provided db2back is to see the RUV entry (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,<suffix> as the first entry of the db, instead of the suffix.

yes, I had the same suspicion and vaguely remember a problem with this

tbordaz commented 5 years ago

@lkrispen, I was initially thinking a craft ldif could be responsible of this but if the RUV is the first entry in the ldif, the RUV entry is skipped. So I think it should exist an other scenario where the RUV is the first entry.

lkrispen commented 5 years ago

you can create this scenario if you configure a suffix and enable replication for this suffix before adding the suffix entry. then the ruv has id:1 and the suffix id:2

tbordaz commented 5 years ago

Using problem-db data:

id 1 (created 13:59:35 according to replicageneration)
    rdn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff
    objectClass: top
    objectClass: nsTombstone
    objectClass: extensibleobject
    nsUniqueId: ffffffff-ffffffff-ffffffff-ffffffff
    nsds50ruv: {replicageneration} 5b7d6c47000000020000
    nsds50ruv: {replica 2 ldap://<masterlab.local.nonet:389} 5b7d6c4f000000020000 5b7d6c77000200020000
    dc: local
    nscpEntryDN: dc=local,dc=nonet
    nsruvReplicaLastModified: {replica 2 ldap://cms1.lab.local.nonet:389} 5b7d6c77
    nsds5agmtmaxcsn: dc=local,dc=nonet;Replication Agreement cms1 to cms1b;cms1.la
     b.local.nonet;2389;unavailable

id 2 (created 13:59:43)
    rdn: dc=local,dc=nonet
    createTimestamp;vucsn-5b7d6c4f000000020000: 20180822135943Z
    entryid: 2

id 17 (created 14:00:23)
    rdn: uid=user2
    ...
    createTimestamp;vucsn-5b7d6c77000200020000: 20180822140023Z                                                                     
    entryid: 17

This matches the scenario described by Ludwig.
Next is checking if such scenario can lead to 100% CPU during total init

tbordaz commented 5 years ago

I managed to reproduce locally the problem of 100% CPU.
The problem seems to appears with 2 conditions:
- The suffix entry is not the entry-id 1, because the suffix is created after the RUV entry is created
- it exists intermediate entries (not leaf) with entry-id starting with 1* (e.g. 10, 11, 100..)

I am preparing a testcase

The problem can easily been worked around if the replica is created after the suffix entry was created.
To recover from the state where replica was created before, you may export ldif (without -r option) and reinitialize the replica with the ldif file.

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.3.8

5 years ago

tbordaz commented 5 years ago

Patch under review https://pagure.io/389-ds-base/pull-request/49936

Metadata Update from @tbordaz:
- Issue set to the milestone: None (was: 1.3.8)

5 years ago

Metadata Update from @gparente:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1626375

5 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.3.7.0

5 years ago

tbordaz commented 5 years ago

c73cd26 .. bdb8676 master
06d9fcc..b6c146f 389-ds-base-1.3.8 -> 389-ds-base-1.3.8
722a6f8..942890e 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

Metadata Update from @tbordaz:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

5 years ago

tbordaz commented 5 years ago

Fix compiler warning

2ce769e..a8c37ea master
b6c146f..8ca3fb8 389-ds-base-1.3.8 -> 389-ds-base-1.3.8
942890e..dd4b13c 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

mreynolds commented 5 years ago

@tbordaz, you are still missing this change I think:

index f8b86d92e..5749e2676 100644
--- a/ldap/servers/slapd/back-ldbm/proto-back-ldbm.h
+++ b/ldap/servers/slapd/back-ldbm/proto-back-ldbm.h
@@ -305,6 +305,8 @@ int index_buffer_init(size_t size, int flags, void **h);
 int index_buffer_flush(void *h, backend *be, DB_TXN *txn, struct attrinfo *a);
 int index_buffer_terminate(void *h);

+int get_suffix_key(Slapi_Backend *be, struct _back_info_index_key *info);
+int set_suffix_key(Slapi_Backend *be, struct _back_info_index_key *info);

tbordaz commented 5 years ago

Oppsss.. sorry
8ca3fb8..bcb97d4 389-ds-base-1.3.8 -> 389-ds-base-1.3.8
dd4b13c..cb21dc5 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

firstyear commented 5 years ago

Reminder to review https://pagure.io/389-ds-base/pull-request/49636 given it's relation to this :)

mreynolds commented 5 years ago

Commit 0319ec0 fixes this issue

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2974

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

Milestone

1.3.7.0

reviewstatus

None

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1626375

origin

None