Originally reported on the IPA mailing list: https://www.redhat.com/archives/freeipa-users/2013-December/msg00106.html
I am running IPA on Fedora 19 with one master and one replica. Originally I had an issue where the replica would hang while starting up. It appeared to be a deadlock in cos_cache.c which caused any LDAP request to hang. I was able to workaround this issue with Rich's suggestion: https://www.redhat.com/archives/freeipa-users/2013-December/msg00117.html
Now the server no longer hangs while starting up. However it does hang when adding a new user with ipa user-add joe-test --first=joe --last=test. The backtrace (attached) seems to still indicate a deadlock in cos_cache.c. Both the master and replica hang depending where the request is sent.
ipa user-add joe-test --first=joe --last=test
I was also apparently able to remedy the replica by running ipa-replica-manage re-initialize --from p-ipa-wd01.prod.the.flatiron.com. Now user add requests to the replica work fine while user add requests to the master hang.
ipa-replica-manage re-initialize --from p-ipa-wd01.prod.the.flatiron.com
attachment stacktrace.1387489013.txt
Would it be safe to also run ipa-replica-manage re-initialize on the master?
ipa-replica-manage re-initialize
Replying to [comment:1 jmou]:
Yes. In IPA, there really isn't a "master" master instance. All of the replicas are equally masters.
I ran ipa-replica-manage re-initialize --from p-ipa-wd02.prod.the.flatiron.com on the master but it still hangs.
ipa-replica-manage re-initialize --from p-ipa-wd02.prod.the.flatiron.com
We just ran into this issue again today. Adding users to the master (or primary replica, if you will) hangs, but adding users to the replica is fine. Any other thoughts on things I can try?
Could you try running this libdb utility and attach the result to this ticket?
Most likely the deadlock is caused by the combination of Thread 8 and 41 from stacktrace.1387489013.txt.
Thread 8 holds dblayer lock and has started transaction, then it is waiting for the COS change_lock released.
Thread 41 holds COS change_lock, then it is trying to read DB which is blocked by the transaction started by Thread 8.
But I'm not sure what Thread 1 and 15 are waiting for. Both look waiting for conn->c_mutex.
db_stat -h /var/lib/dirsrv/slapd-<MASTER>/db/ -C A db_stat.out
Same command after issuing ipa user-add (server is hung) db_stat.hung.out
ipa user-add
Uploaded two outputs of db_stat, one while the server is responding and one after it is hung.
At least one of those deadlocks is probably a later request from the Web UI that I did manually to check that the server is hung. Let me know if it would help to have a cleaner back trace immediately after ipa user-add.
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1049525
I'm trying to reproduce the problem on IPA, but so far no luck...
Could there be anything special in terms of the IPA installation? E.g., could you tell us what options did you give to the command line when you ran ipa-server-install to install IPA?
And could there be any knowledge you could share with such as how powerful your host is? How many CPUs, memory size, etc.? We are interested in the info since it might be a timing issue hidden for most cases...
Thanks! --noriko
The master was installed with:
ipa-server-install -r THE.FLATIRON.COM -n the.flatiron.com --mkhomedir --no-ntp --setup-dns --reverse-zone=102.10.in-addr.arpa. --forwarder=168.63.129.16
and the replica was set up with:
ipa-replica-install --no-ntp --setup-ca --setup-dns --forwarder=168.63.129.16 /var/lib/ipa/replica-info-p-ipa-wd02.prod.the.flatiron.com.gpg
The machines are virtual machines on Windows Azure. They are small, which means 1 core and 1.75 GB of memory. I'm not sure what else I can share that would be relevant but I'm happy to answer any questions. We have been running these IPA services for a few months with active usage, but I can't think of any unusual configurations. Mostly we have been adding/removing users, groups, and hosts. We first noticed the issue about a month ago, and I can't think of anything that we did around that time.
Thank you for your input. I'm still straggling with duplicating the problem. Reading the code and your stacktrace, it looks to me Password Policy is involved. Could there be any chance for you to have set up password policy?
Also, could you try running this command line? ldapsearch -LLLx -D 'cn=directory manager' -w Your_Directory_Manager_password -b "cn=costemplates,cn=accounts,dc=the,dc=flatiron,dc=com"
Are any costemplate entries (containing "objectclass: costemplate") returned?
Thanks, --noriko
Thanks noriko.
Yes, we have added a custom password policy. It applies to a group which contains a single user at the moment. The timing of this is in the same ballpark as when we started to see issues.
Yes, there is a costemplate entry. The full output:
{{{ $ ldapsearch -LLLx -D 'cn=directory manager' -W -b "cn=costemplates,cn=accounts,dc=the,dc=flatiron,dc=com" Enter LDAP Password: dn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com objectClass: top objectClass: nsContainer cn: cosTemplates
dn: cn=cn\3Dservice-users\2Ccn\3Dgroups\2Ccn\3Daccounts\2Cdc\3Dthe\2Cdc\3Dflat iron\2Cdc\3Dcom,cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com krbPwdPolicyReference: cn=service-users,cn=THE.FLATIRON.COM,cn=kerberos,dc=the ,dc=flatiron,dc=com objectClass: top objectClass: costemplate objectClass: extensibleobject objectClass: krbcontainer cosPriority: 50 cn: cn=service-users,cn=groups,cn=accounts,dc=the,dc=flatiron,dc=com }}}
Thank YOU, jmou. I was hoping to see more complicated costemplate entry, but it looks quite similar to mine. {{{ dn: cn=cn\3DG1\2Ccn\3Dgroups\2Ccn\3Daccounts\2Cdc\3Dkikif19\2Cdc\3Dexample\2Cdc\3Dcom,cn=cosTemplates,cn=accounts,dc=kikif19,dc=example,dc=com krbPwdPolicyReference: cn=G1,cn=KIKIF19.EXAMPLE.COM,cn=kerberos,dc=kikif19,dc=example,dc=com objectClass: top objectClass: costemplate objectClass: extensibleobject objectClass: krbcontainer cosPriority: 10 cn: cn=G1,cn=groups,cn=accounts,dc=kikif19,dc=example,dc=com }}} Now I'm confused more on the deadlock you are running into. Could you tell us how often it is happening? Every user-add hangs? Or once in a while?
The stacktrace you gave me first shows adding a user entry (e.g., uid=joe=test) is trying to call cos_cache_change_notify and hangs in it. The function is called only when the entry is cos related (costemplate, cosdefinition, or cossuperdefinition). And I don't think an ordinary user entry can be a cos entry...
Probably, I need more help from the IPA engineer here...
Hello jmou, if the deadlock occurs again, may we ask you to dump the entry in gdb? {{{ Using your stacktrace.1387489013.txt, please go to Thread 8, in which the add user is processed. (gdb) thread 8 Go to the frame 5 (cos_cache_change_notify). (gdb) frame 5 (gdb) print e $# = {e_sdn = {... Please repeat this until you see "a_type = "objectClass" (you are traversing the attribute linked list). (gdb) p e->e_attrs (gdb) p e->e_attrs->a_next (gdb) p e->e_attrs->a_next->a_next $8 = {a_type = 0x7f2f2c010850 "objectClass", a_present_values = {num = 11, max = 16, sorted = 0x7f2f2c0275e0, va = 0x7f2f2c00a700}, a_flags = 4, a_plugin = 0x7f2f6366cec0, a_deleted_values = {num = 0, max = 0, sorted = 0x0, va = 0x0}, a_listtofree = 0x0, a_next = 0x7f2f2c00b360, a_deletioncsn = 0x0, a_mr_eq_plugin = 0x7f2f6361d960, a_mr_ord_plugin = 0x0, a_mr_sub_plugin = 0x0} Once you reach the point, could you repeat this until you get 0x0? (you are dumping all the objectclass values). (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[0] $9 = {bv = {bv_len = 3, bv_val = 0x7f2f2c010870 "top"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[1] $10 = {bv = {bv_len = 6, bv_val = 0x7f2f2c007d30 "person"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[2] $11 = {bv = {bv_len = 20, bv_val = 0x7f2f2c007d50 "organizationalperson"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[3] $12 = {bv = {bv_len = 13, bv_val = 0x7f2f2c007d70 "inetorgperson"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[4] $13 = {bv = {bv_len = 8, bv_val = 0x7f2f2c007d90 "inetuser"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[5] $14 = {bv = {bv_len = 12, bv_val = 0x7f2f2c007db0 "posixaccount"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[6] $15 = {bv = {bv_len = 15, bv_val = 0x7f2f2c00ef20 "krbprincipalaux"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[7] $16 = {bv = {bv_len = 18, bv_val = 0x7f2f2c00ef70 "krbticketpolicyaux"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[8] $17 = {bv = {bv_len = 9, bv_val = 0x7f2f2c00e9f0 "ipaobject"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[9] $18 = {bv = {bv_len = 10, bv_val = 0x7f2f2c00ea40 "ipasshuser"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[10] $19 = {bv = {bv_len = 20, bv_val = 0x7f2f2c0275c0 "ipaSshGroupOfPubKeys"}, v_csnset = 0x0, v_flags = 0} (gdb) p e->e_attrs->a_next->a_next->a_present_values.va[11] Cannot access memory at address 0x0 }}}
If you could share the output from gdb, it'll be a great help for us. If you have any difficulties, we could go through with you on IRC #389.
objectClass gdb session typescript
The hang is reproducible 100% of the time when invoking user-add. I induced a hang and attached the gdb session.
Great info!! Thank you so much! May I ask you to run one more search and attach the result to this ticket?
ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -w YOUR_PW -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"
And one more thing... Could you share the command line you ran to set up your password policy?
ipa pwpolicy [options] ???
Thanks noriko. Here is the result of that command.
$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)" Enter LDAP Password: dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com cn: Password Policy cosspecifier: memberOf cosAttribute: krbPwdPolicyReference override costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com objectClass: top objectClass: ldapsubentry objectClass: cosSuperDefinition objectClass: cosClassicDefinition description: Password Policy based on group membership
Unfortunately I don't have the command for creating the password policy. I believe it was created using the web UI. The parameters are:
Group: service-users Max lifetime (days): 3650 Min lifetime (hours): 1 History size (number of passwords): 0 Character classes: Min length: 16 Max failures: Failure reset interval (seconds): Lockout duration (seconds): Priority: 50
Hmmm, I was expecting the bug is there, but it looks it's deeper than I thought... :(
Could you try this? At the same location you dumped the entry, (gdb) t 29 (gdb) f 5 Could you dump the cosCache and its contents? (gdb) p pCache ... (gdb) p pCache->pDefs ... (gdb) p *pCache->pDefs->pCosTemplateDn ... And repleat this until it returns 0x0? (gdb) p pCache->ppTemlateList[0] (gdb) p pCache->ppTemlateList[1] (gdb) p pCache->ppTemlateList[N] $# = 0x0
GREAT!! Because of this value, you are having the deadlock problem... (gdb) p pCache->ppTemplateList[0] $6 = 0x7f810802ab40 "cn=accounts,dc=the,dc=flatiron,dc=com"
My pCache only has your ppTemplateList[1] equivalent, which is supposed to be... Now we need to figure out from where ppTemplateList[0] value came... But this is a great clue. Thanks so much!!
Glad I could help! Let me know if there's any other information you need.
:) Could you try one more? Your pCosTemplateDn has more values... (gdb) p *pCache->pDefs->pCosTemplateDn $3 = {list = {pNext = 0x7f810805f750, index = 0x0}, val = 0x7f81080882d0 "cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com"}
Could you repeat this until you get 0x0? (gdb) p pCache->pDefs->pCosTemplateDn->list.pNext (gdb) p pCache->pDefs->pCosTemplateDn->list.pNext->list.pNext (gdb) p *pCache->pDefs->pCosTemplateDn->list.pNext->list.pNext->list.pNext
Thank you!!
Yes! This is it. (gdb) p (struct _cosAttrValue)pCache->pDefs->pCosTemplateDn->list.pNext $2 = {list = {pNext = 0x0, index = 0x0}, val = 0x7f03280363a0 "cn=accounts,dc=the,dc=flatiron,dc=com"} I'm now investigating why this value is set here... Thanks!!
Ticket #47649 - Server hangs in cos_cache when adding a user entry
Bug description: cos_dn_defs_cb reads cosDefinition and sets up the cos Definition part of cos cache. In the function, when processing cosAttribute, cosTargetTree and cosTemlpateDn are missing, it sets the parent dn of the cos definition dn. This parent setting is needed only when the 2 attributes are completely missing from the cos definition. But if the attributes are located below cosAttribute (see the Example cos definition), in addition to "cn=cosTemplates,ou=people,dc=example,dc=com", the parent of "cn=generate- PostalCode,ou=People,dc=example,dc=com" is added to the cos cache as cosTemplateDn. Example cos definition: dn: cn=generatePostalCode,ou=People,dc=example,dc=com description: generate postalCode attr based on location objectClass: top objectClass: ldapsubentry objectClass: cossuperdefinition objectClass: cosClassicDefinition cosAttribute: postalCode costemplatedn: cn=cosTemplates,ou=people,dc=example,dc=com cosSpecifier: l cn: generatePostalCode The mistakenly added cosTemplatedDn makes adding an entry under ou=People notify recreating the cos cache. The notification needs to be outside of backend transaction since it causes a deadlock with the cos_cache_wait_on_change thread which cannot read the DB due to the transaction but holds the lock that the notifier thread is waiting for.
Fix description: The parent of the cos definition dn is set to the cosTargetTree and the cosTemlpateDn, only when the attributes are completely missing.
git patch file (master) 0001-Ticket-47649-Server-hangs-in-cos_cache-when-adding-a.patch
in {{{ char *orig = slapi_dn_parent(pDn->val); }}} need to handle case where orig is NULL. Otherwise, ack
Nice work!
git patch file (master; take 2) -- added the case when (orig == NULL). 0001-Ticket-47649-Server-hangs-in-cos_cache-when-adding-a.2.patch
Thank you for the review, Rich! I've updated the patch...
Reviewed by Rich (Thank you!!)
Pushed to master: c52987d..1e52401 master -> master commit 1e52401
Pushed to 389-ds-base-1.3.2: 4266657..01c0794 389-ds-base-1.3.2 -> 389-ds-base-1.3.2 commit 01c0794
Pushed to 389-ds-base-1.3.1: d68dc32..1ebad4b 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit 1ebad4b
Thank you so much, jmou. Without your help, we could not figure out the cause of this bug.
We will respin 389-ds-base-1.3.1.19 as soon as possible. But in the meantime, you may want to work around the deadlock problem as follows:
ldapmodify -x -D 'cn=directory manager' -w YOUR_PASSWORD << EOF dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com changetype: modify delete: cosAttribute - dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com changetype: modify add: cosAttribute cosAttribute: krbPwdPolicyReference override EOF
This delete-then-add operation should move the cosAttribute at the end of the entry, which is after costemplatedn. It'd stop adding this DN to the cos cache.
(gdb) p pCache->ppTemplateList[0] $6 = 0x7f810802ab40 "cn=accounts,dc=the,dc=flatiron,dc=com"
Please reopen the bug if it does not solve your problem...
Once again, thank you very much! --noriko
Thank you noriko. Unfortunately this workaround did not work for me. I ran this on my initial server (what I have been calling the master), which is the one that is hanging.
$ ldapmodify -x -D 'cn=directory manager' -W -f cos.ldif $ cat cos.ldif dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com changetype: modify delete: cosAttribute - add: cosAttribute cosAttribute: krbPwdPolicyReference override - $ ipa user-add joe-test --first=joe --last=test
The last command still hangs. Did I miss something?
Could you run this command line again? $ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"
That hint got it to work for me! Thanks for all the help.
{{{ $ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)" Enter LDAP Password: dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com cosAttribute: krbPwdPolicyReference override cn: Password Policy cosspecifier: memberOf costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com objectClass: top objectClass: ldapsubentry objectClass: cosSuperDefinition objectClass: cosClassicDefinition description: Password Policy based on group membership
$ cat cos2.ldif dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com changetype: modify delete: costemplatedn - add: costemplatedn costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com - $ ldapmodify -x -D 'cn=directory manager' -W -f cos2.ldif Enter LDAP Password: modifying entry "cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com"
$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)" Enter LDAP Password: dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com cosAttribute: krbPwdPolicyReference override cn: Password Policy cosspecifier: memberOf objectClass: top objectClass: ldapsubentry objectClass: cosSuperDefinition objectClass: cosClassicDefinition description: Password Policy based on group membership
User login: joe-test First name: joe Last name: test Full name: joe test Display name: joe test Initials: jt Home directory: /home/joe-test GECOS: joe test Login shell: /bin/bash Kerberos principal: joe-test@THE.FLATIRON.COM Email address: joe-test@flatiron.com UID: 749400051 GID: 749400051 Password: False Member of groups: ipausers Kerberos keys available: False }}}
Great news!! I'm sooooo happy to hear that...
Thank you so much for your help and patience! ;)
Pushed to 1.2.11
ad5314f..8f39dd5 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit 8f39dd5
Metadata Update from @rmeggins: - Issue assigned to nhosoi - Issue set to the milestone: 1.3.1.18
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/986
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.