#494 slapd entered to infinite loop during new index addition
Closed: Fixed None Opened 7 years ago by pspacek.

During IPA 2.2.0->3.0.0 upgrade various indexes are added. During index addition 389 locks in a infinite loop. Stacktraces etc. attached.


kill 6285 led to message "slapd shutting down - waiting for 2 threads to terminate" in errors log and slapd started to cycle on TWO cores. Before SIGTERM 389 cycled on single core.

I captured coredump after SIGTERM. It is too big for Trac, so there is a link:
http://people.redhat.com/~pspacek/a/2012-10-15/core.6285.bz2

GDB produced error message during core dump creation, see gcore.report file.

Slapd was killed after the message [15/Oct/2012:17:51:00 +0200]

Unfortunatelly DS was restarted by RPM scripts in few seconds, so database dumps are not exactly same (potentially).

I tried to export whole DB and again ended in infinite loop:
{{{

/var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot

Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184047.ldif
[15/Oct/2012:18:40:47 +0200] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
[15/Oct/2012:18:40:48 +0200] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184047.ldif
^C
[root@unused-4-107 ~]# /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot
Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184158.ldif
[15/Oct/2012:18:41:58 +0200] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
[15/Oct/2012:18:41:58 +0200] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184158.ldif
}}}

At this moment ns-slapd uses 100 % CPU (again) and LDIF file don't change its size.

Core captured from ns-slapd during the export can be downloaded from ​http://people.redhat.com/~pspacek/a/2012-10-15/core.12285.bz2 . Stacks and produced LDIF are attached.

File ​http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 contains /etc/dirsrv and /var/lib/dirsrv.

Looking at the attached stacktrace: "389stacktrace.txt​ (85.6 KB)", the main thread is waiting for the c_mutex lock in handle_pr_read_ready (Thread 12). And most likely, the lock is held by connection_enter_leave_turbo (Thread 1). I'm wondering the stacktrace is just a snapshot or it is showing a deadlock. To answer it, it'll be nice to take the stacktrace a couple of times and compare among them. Or attach gdb to the process which is hanging.

Do we still have the process? Is it difficult to reproduce the problem?

Well, CPU was under 100% load all the time. I killed the process after capturing stacktrace and coredump (see links in comment:1).

I'm able to reproduce the issue with current DB as described in comment:3. Can you download the http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 and do the investigation?

Replying to [comment:5 pspacek]:

Well, CPU was under 100% load all the time. I killed the process after capturing stacktrace and coredump (see links in comment:1).

I'm able to reproduce the issue with current DB as described in comment:3. Can you download the http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 and do the investigation?

Thanks, Petr. It looks you gave me a tar ball containing your environment. Could you tell me how to "reproduce" the problem? I have my f17 VM. I can override my IPA/DS env with yours. To cause the infinite loop, what operation should I send to the server?

Or could it be possible for you to make the infinite loop happen again. Then, generate multiple core files with gcore in gdb (probably, 5 or so?)? That way, it'll be clearer which loop cannot finish...

Thank you so much!

Hello Noriko,

comment:3 contains simple reproducer - run /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot (with my database from the tarball).

I was forced to destroy old DB and start with the new one (because this instance was installed directly on my laptop).

I can't simply reproduce it again on my HW. However, I did LVM snapshot before destroying the DB, so I will go back in time if it's really inevitable.

Did you look into both attached stack traces? I captured two stack snapshots (but from two different runs).

Thank you, Petr! I could see the infinite loop. The problem looks like caused by an USN tombstone generation. I see an odd entry:
{{{
id 35547
rdn: nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e,nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e
nsUniqueId: b99eedd6-070411e2-ba7983cd-ea05094e
tXTRecord: "137"
dNSTTL: 666
objectClass: nsTombstone
[...]
idnsName: c178reproduce
parentid: 35028
entryid: 35547
nsParentUniqueId: 95dba6e0-070411e2-ba7983cd-ea05094e
nscpEntryDN: nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e,idnsname=c178,idnsname=e.org,cn=dns,dc=e,dc=org
entryusn: 208228
}}}
This rdn should not be repeated nsuniqueId... I'm wondering if you see this entry (e.g., tXTRecord: "137"), could you tell what operation could generate this odd entry? I'm hoping that the tXTRecord value is associated with some test case...

Also, could there be more access logs which include the operations you run? The particular entry idnsName=c178 is repeatedly deleted. But I cannot find DEL in the access log you attached.

If it's easy to reproduce the problem, could it be possible to run the test with the audit log enabled in dse.ldif?
nsslapd-auditlog-logging-enabled: on
This way, all the update operations are logged in the audit log.

Objects idnsname=c<number>,idnsname=e.org,cn=dns,dc=e,dc=org were created by scripts loop.sh + nsupdate_add_del_gen.py + e.org.ldif (attached).

These scripts are targeted on stress-testing DNS dynamic update system, so objects and attributes are changed quickly in several parallel threads. As you can see, attributes with random value are added and deleted, so I can't point to exact case and time.

Is there some modifyTimestamp? If it is, you can search my logs:
http://people.redhat.com/~pspacek/a/2012-10-15/log-slapd-E-ORG.tar.bz2

I uploaded all logs which I have. Records from last two days are from the new DS installation (after wiping old DB). Stress tests were conducted 3-4 weeks ago I think, but I'm not 100 % sure.

Hi Petr,

Thanks for the info. I think I figured the cause of the problem. There
could be a small window in which it fails to check if the entry is already
tombston'ed or not. If it is, the operation to delete the entry should
back off. But the current code goes ahead and it tombstones the tombstone
entry again, which breaks the dn... I've built a scratch build with the
fix:
http://koji.fedoraproject.org/koji/taskinfo?taskID=4616851

Could you please run your stress test again using this scratch build?
Thank you so much, in advance...

(Just in case you wonder why I updated this link with the above one, this taskID=4616250 is based on the 389-ds-base-1.3.0.a1.
http://koji.fedoraproject.org/koji/taskinfo?taskID=4616250
I think the version you are testing is 389-ds-base1.2.11.16. Thus I rebuilt taskID=4616851 from the 1.2.11 branch.)

Bug Description: An entry which rdn is nsuniqueid=...,nsuniqueid=...
was accidentally generated. It broke the entryrdn tree and put
the reindex and export into the the infinite loop.

Fix Description: Added an missing check if the retrieved entry is
already an tombstone or not. This tombstone is the one converted
by the entryusn plugin. Please note that replication has an urp
code which handles such conflicts.

Hello,

I did some tests with 389-ds-base-1.2.11.17-0.1.scratch0.fc17.x86_64 - export to LDIF worked well. Sorry for late message, I forgot to reply after running the test.

P.S. It is a race condition, you know, my test results can't be taken as definitive ...

Replying to [comment:16 pspacek]:

Hello,

I did some tests with 389-ds-base-1.2.11.17-0.1.scratch0.fc17.x86_64 - export to LDIF worked well. Sorry for late message, I forgot to reply after running the test.

P.S. It is a race condition, you know, my test results can't be taken as definitive ...

Thank you so much for testing my scratch build, Petr! Right. It's hard to "prove" the patch fixes all the bad results from the race condition. But definitely, it is a great indication that it passed your tests! Let us check in the patch and close the ticket once. If you run into the issue, please feel free to reopen this ticket, again.

Once again, thanks a lot!

Reviewed by Rich and tested by Petr. Thank you, both of you!!

Pushed to master.

{{{
$ git merge trac494
Updating 927b0ef..b963576
Fast-forward
ldap/servers/slapd/back-ldbm/ldbm_delete.c | 7 +++++++
1 file changed, 7 insertions(+)
$ git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 999 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
927b0ef..b963576 master -> master
}}}

Cherry-picked and pushed to origin/389-ds-base-1.2.11.

{{{
$ git push origin 389-ds-base-1.2.11-ext:389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.03 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
56ebbb2..31d1578 389-ds-base-1.2.11-ext -> 389-ds-base-1.2.11
}}}

Metadata Update from @pspacek:
- Issue assigned to rmeggins
- Issue set to the milestone: 1.2.11.17

2 years ago

Login to comment on this ticket.

Metadata