#47649 Server hangs in cos_cache when adding a user entry
Closed: Fixed None Opened 5 years ago by jmou.

Originally reported on the IPA mailing list: https://www.redhat.com/archives/freeipa-users/2013-December/msg00106.html

I am running IPA on Fedora 19 with one master and one replica. Originally I had an issue where the replica would hang while starting up. It appeared to be a deadlock in cos_cache.c which caused any LDAP request to hang. I was able to workaround this issue with Rich's suggestion: https://www.redhat.com/archives/freeipa-users/2013-December/msg00117.html

Now the server no longer hangs while starting up. However it does hang when adding a new user with ipa user-add joe-test --first=joe --last=test. The backtrace (attached) seems to still indicate a deadlock in cos_cache.c. Both the master and replica hang depending where the request is sent.

I was also apparently able to remedy the replica by running ipa-replica-manage re-initialize --from p-ipa-wd01.prod.the.flatiron.com. Now user add requests to the replica work fine while user add requests to the master hang.


Would it be safe to also run ipa-replica-manage re-initialize on the master?

Replying to [comment:1 jmou]:

Would it be safe to also run ipa-replica-manage re-initialize on the master?

Yes. In IPA, there really isn't a "master" master instance. All of the replicas are equally masters.

I ran ipa-replica-manage re-initialize --from p-ipa-wd02.prod.the.flatiron.com on the master but it still hangs.

We just ran into this issue again today. Adding users to the master (or primary replica, if you will) hangs, but adding users to the replica is fine. Any other thoughts on things I can try?

Could you try running this libdb utility and attach the result to this ticket?

db_stat -h /var/lib/dirsrv/slapd-<master>/db/ -C A

Most likely the deadlock is caused by the combination of Thread 8 and 41 from stacktrace.1387489013.txt​.

Thread 8 holds dblayer lock and has started transaction, then it is waiting for the COS change_lock released.

Thread 41 holds COS change_lock, then it is trying to read DB which is blocked by the transaction started by Thread 8.

But I'm not sure what Thread 1 and 15 are waiting for. Both look waiting for conn->c_mutex.

db_stat -h /var/lib/dirsrv/slapd-<master>/db/ -C A
db_stat.out

Same command after issuing ipa user-add (server is hung)
db_stat.hung.out

Uploaded two outputs of db_stat, one while the server is responding and one after it is hung.

At least one of those deadlocks is probably a later request from the Web UI that I did manually to check that the server is hung. Let me know if it would help to have a cleaner back trace immediately after ipa user-add.

I'm trying to reproduce the problem on IPA, but so far no luck...

Could there be anything special in terms of the IPA installation? E.g., could you tell us what options did you give to the command line when you ran ipa-server-install to install IPA?

And could there be any knowledge you could share with such as how powerful your host is? How many CPUs, memory size, etc.? We are interested in the info since it might be a timing issue hidden for most cases...

Thanks!
--noriko

The master was installed with:

ipa-server-install -r THE.FLATIRON.COM -n the.flatiron.com --mkhomedir --no-ntp --setup-dns --reverse-zone=102.10.in-addr.arpa. --forwarder=168.63.129.16

and the replica was set up with:

ipa-replica-install --no-ntp --setup-ca --setup-dns --forwarder=168.63.129.16 /var/lib/ipa/replica-info-p-ipa-wd02.prod.the.flatiron.com.gpg

The machines are virtual machines on Windows Azure. They are small, which means 1 core and 1.75 GB of memory. I'm not sure what else I can share that would be relevant but I'm happy to answer any questions. We have been running these IPA services for a few months with active usage, but I can't think of any unusual configurations. Mostly we have been adding/removing users, groups, and hosts. We first noticed the issue about a month ago, and I can't think of anything that we did around that time.

Thank you for your input. I'm still straggling with duplicating the problem. Reading the code and your stacktrace, it looks to me Password Policy is involved. Could there be any chance for you to have set up password policy?

Also, could you try running this command line?
ldapsearch -LLLx -D 'cn=directory manager' -w Your_Directory_Manager_password -b "cn=costemplates,cn=accounts,dc=the,dc=flatiron,dc=com"

Are any costemplate entries (containing "objectclass: costemplate") returned?

Thanks,
--noriko

Thanks noriko.

Yes, we have added a custom password policy. It applies to a group which contains a single user at the moment. The timing of this is in the same ballpark as when we started to see issues.

Yes, there is a costemplate entry. The full output:

{{{
$ ldapsearch -LLLx -D 'cn=directory manager' -W -b "cn=costemplates,cn=accounts,dc=the,dc=flatiron,dc=com"
Enter LDAP Password:
dn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com
objectClass: top
objectClass: nsContainer
cn: cosTemplates

dn: cn=cn\3Dservice-users\2Ccn\3Dgroups\2Ccn\3Daccounts\2Cdc\3Dthe\2Cdc\3Dflat
iron\2Cdc\3Dcom,cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com
krbPwdPolicyReference: cn=service-users,cn=THE.FLATIRON.COM,cn=kerberos,dc=the
,dc=flatiron,dc=com
objectClass: top
objectClass: costemplate
objectClass: extensibleobject
objectClass: krbcontainer
cosPriority: 50
cn: cn=service-users,cn=groups,cn=accounts,dc=the,dc=flatiron,dc=com
}}}

Thank YOU, jmou. I was hoping to see more complicated costemplate entry, but it looks quite similar to mine.
{{{
dn: cn=cn\3DG1\2Ccn\3Dgroups\2Ccn\3Daccounts\2Cdc\3Dkikif19\2Cdc\3Dexample\2Cdc\3Dcom,cn=cosTemplates,cn=accounts,dc=kikif19,dc=example,dc=com
krbPwdPolicyReference: cn=G1,cn=KIKIF19.EXAMPLE.COM,cn=kerberos,dc=kikif19,dc=example,dc=com
objectClass: top
objectClass: costemplate
objectClass: extensibleobject
objectClass: krbcontainer
cosPriority: 10
cn: cn=G1,cn=groups,cn=accounts,dc=kikif19,dc=example,dc=com
}}}
Now I'm confused more on the deadlock you are running into. Could you tell us how often it is happening? Every user-add hangs? Or once in a while?

The stacktrace you gave me first shows adding a user entry (e.g., uid=joe=test) is trying to call cos_cache_change_notify and hangs in it. The function is called only when the entry is cos related (costemplate, cosdefinition, or cossuperdefinition). And I don't think an ordinary user entry can be a cos entry...

Probably, I need more help from the IPA engineer here...

Hello jmou, if the deadlock occurs again, may we ask you to dump the entry in gdb?
{{{
Using your stacktrace.1387489013.txt, please go to Thread 8, in which the add user is processed.
(gdb) thread 8
Go to the frame 5 (cos_cache_change_notify).
(gdb) frame 5
(gdb) print e
$# = {e_sdn = {...
Please repeat this until you see "a_type = "objectClass" (you are traversing the attribute linked list).
(gdb) p
e->e_attrs
(gdb) p e->e_attrs->a_next
(gdb) p
e->e_attrs->a_next->a_next
$8 = {a_type = 0x7f2f2c010850 "objectClass", a_present_values = {num = 11, max = 16, sorted = 0x7f2f2c0275e0,
va = 0x7f2f2c00a700}, a_flags = 4, a_plugin = 0x7f2f6366cec0, a_deleted_values = {num = 0, max = 0,
sorted = 0x0, va = 0x0}, a_listtofree = 0x0, a_next = 0x7f2f2c00b360, a_deletioncsn = 0x0,
a_mr_eq_plugin = 0x7f2f6361d960, a_mr_ord_plugin = 0x0, a_mr_sub_plugin = 0x0}
Once you reach the point, could you repeat this until you get 0x0? (you are dumping all the objectclass values).
(gdb) p e->e_attrs->a_next->a_next->a_present_values.va[0]
$9 = {bv = {bv_len = 3, bv_val = 0x7f2f2c010870 "top"}, v_csnset = 0x0, v_flags = 0}
(gdb) p
e->e_attrs->a_next->a_next->a_present_values.va[1]
$10 = {bv = {bv_len = 6, bv_val = 0x7f2f2c007d30 "person"}, v_csnset = 0x0, v_flags = 0}
(gdb) p e->e_attrs->a_next->a_next->a_present_values.va[2]
$11 = {bv = {bv_len = 20, bv_val = 0x7f2f2c007d50 "organizationalperson"}, v_csnset = 0x0, v_flags = 0}
(gdb) p
e->e_attrs->a_next->a_next->a_present_values.va[3]
$12 = {bv = {bv_len = 13, bv_val = 0x7f2f2c007d70 "inetorgperson"}, v_csnset = 0x0, v_flags = 0}
(gdb) p e->e_attrs->a_next->a_next->a_present_values.va[4]
$13 = {bv = {bv_len = 8, bv_val = 0x7f2f2c007d90 "inetuser"}, v_csnset = 0x0, v_flags = 0}
(gdb) p
e->e_attrs->a_next->a_next->a_present_values.va[5]
$14 = {bv = {bv_len = 12, bv_val = 0x7f2f2c007db0 "posixaccount"}, v_csnset = 0x0, v_flags = 0}
(gdb) p e->e_attrs->a_next->a_next->a_present_values.va[6]
$15 = {bv = {bv_len = 15, bv_val = 0x7f2f2c00ef20 "krbprincipalaux"}, v_csnset = 0x0, v_flags = 0}
(gdb) p
e->e_attrs->a_next->a_next->a_present_values.va[7]
$16 = {bv = {bv_len = 18, bv_val = 0x7f2f2c00ef70 "krbticketpolicyaux"}, v_csnset = 0x0, v_flags = 0}
(gdb) p e->e_attrs->a_next->a_next->a_present_values.va[8]
$17 = {bv = {bv_len = 9, bv_val = 0x7f2f2c00e9f0 "ipaobject"}, v_csnset = 0x0, v_flags = 0}
(gdb) p
e->e_attrs->a_next->a_next->a_present_values.va[9]
$18 = {bv = {bv_len = 10, bv_val = 0x7f2f2c00ea40 "ipasshuser"}, v_csnset = 0x0, v_flags = 0}
(gdb) p e->e_attrs->a_next->a_next->a_present_values.va[10]
$19 = {bv = {bv_len = 20, bv_val = 0x7f2f2c0275c0 "ipaSshGroupOfPubKeys"}, v_csnset = 0x0, v_flags = 0}
(gdb) p
e->e_attrs->a_next->a_next->a_present_values.va[11]
Cannot access memory at address 0x0
}}}

If you could share the output from gdb, it'll be a great help for us. If you have any difficulties, we could go through with you on IRC #389.

The hang is reproducible 100% of the time when invoking user-add. I induced a hang and attached the gdb session.

Great info!! Thank you so much! May I ask you to run one more search and attach the result to this ticket?

ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -w YOUR_PW -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"

And one more thing... Could you share the command line you ran to set up your password policy?

ipa pwpolicy [options] ???

Thanks noriko. Here is the result of that command.

$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"
Enter LDAP Password:
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
cn: Password Policy
cosspecifier: memberOf
cosAttribute: krbPwdPolicyReference override
costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com
objectClass: top
objectClass: ldapsubentry
objectClass: cosSuperDefinition
objectClass: cosClassicDefinition
description: Password Policy based on group membership

Unfortunately I don't have the command for creating the password policy. I believe it was created using the web UI. The parameters are:

Group: service-users
Max lifetime (days): 3650
Min lifetime (hours): 1
History size (number of passwords): 0
Character classes:
Min length: 16
Max failures:
Failure reset interval (seconds):
Lockout duration (seconds):
Priority: 50

Hmmm, I was expecting the bug is there, but it looks it's deeper than I thought... :(

Could you try this?
At the same location you dumped the entry,
(gdb) t 29
(gdb) f 5
Could you dump the cosCache and its contents?
(gdb) p pCache
...
(gdb) p
pCache->pDefs
...
(gdb) p *pCache->pDefs->pCosTemplateDn
...
And repleat this until it returns 0x0?
(gdb) p pCache->ppTemlateList[0]
(gdb) p pCache->ppTemlateList[1]
(gdb) p pCache->ppTemlateList[N]
$# = 0x0

Sure, here it is: {{{ (gdb) t 12 [Switching to thread 12 (Thread 0x7f80f17ea700 (LWP 22826))] #0 0x00007f8124a72ddd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) f 5 #5 0x00007f811d8d9aec in cos_cache_change_notify (pb=pb@entry=0x7f810806ec80) at ldap/servers/plugins/cos/cos_cache.c:3366 3366 slapi_lock_mutex(change_lock); (gdb) p *pCache $1 = {pDefs = 0x7f8108055860, ppAttrIndex = 0x7f810806cbd0, attrCount = 1, ppTemplateList = 0x7f8108074ce0, templateCount = 1, refCount = 1, vattr_cacheable = 0} (gdb) p *pCache->pDefs $2 = {list = {pNext = 0x0, index = 0x0}, cosType = 1, pDn = 0x7f810803d540, pCosTargetTree = 0x7f810803d460, pCosTemplateDn = 0x7f81080356b0, pCosSpecifier = 0x7f81080592d0, pCosAttrs = 0x7f8108049c90, pCosOverrides = 0x7f8108014bc0, pCosOperational = 0x0, pCosOpDefault = 0x0, pCosMerge = 0x0, pCosTmps = 0x7f81080417a0} (gdb) p *pCache->pDefs->pCosTemplateDn $3 = {list = {pNext = 0x7f810805f750, index = 0x0}, val = 0x7f81080882d0 "cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com"} [...snip...] (gdb) p pCache->ppTemplateList[0] $6 = 0x7f810802ab40 "cn=accounts,dc=the,dc=flatiron,dc=com" (gdb) p pCache->ppTemplateList[1] $7 = 0x7f81080882d0 "cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com" (gdb) p pCache->ppTemplateList[2] $8 = 0x0 }}}

GREAT!! Because of this value, you are having the deadlock problem...
(gdb) p pCache->ppTemplateList[0]
$6 = 0x7f810802ab40 "cn=accounts,dc=the,dc=flatiron,dc=com"

My pCache only has your ppTemplateList[1] equivalent, which is supposed to be... Now we need to figure out from where ppTemplateList[0] value came... But this is a great clue. Thanks so much!!

Glad I could help! Let me know if there's any other information you need.

:) Could you try one more? Your pCosTemplateDn has more values...
(gdb) p *pCache->pDefs->pCosTemplateDn
$3 = {list = {pNext = 0x7f810805f750, index = 0x0}, val = 0x7f81080882d0 "cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com"}

Could you repeat this until you get 0x0?
(gdb) p pCache->pDefs->pCosTemplateDn->list.pNext
(gdb) p
pCache->pDefs->pCosTemplateDn->list.pNext->list.pNext
(gdb) p *pCache->pDefs->pCosTemplateDn->list.pNext->list.pNext->list.pNext

Thank you!!

I think this should be what you are looking for. Let me know. I decided to generate a core dump too, instead of hanging the server each time. Hopefully that will make subsequent debug sessions more consistent too. {{{ (gdb) t 35 [Switching to thread 35 (Thread 0x7f03127ec700 (LWP 24642))] #0 0x00007f0345216ddd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) f 5 #5 0x00007f033e07daec in cos_cache_change_notify (pb=pb@entry=0x7f032805bd40) at ldap/servers/plugins/cos/cos_cache.c:3366 3366 slapi_lock_mutex(change_lock); (gdb) p *pCache->pDefs->pCosTemplateDn $1 = {list = {pNext = 0x7f0328049a50, index = 0x0}, val = 0x7f032804f510 "cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com"} (gdb) p *pCache->pDefs->pCosTemplateDn->list.pNext Attempt to dereference a generic pointer. (gdb) ptype pCache->pDefs->pCosTemplateDn type = struct _cosAttrValue { cosIndexedLinkedList list; char *val; } * (gdb) p *(struct _cosAttrValue)pCache->pDefs->pCosTemplateDn->list.pNext Structure has no component named operator*. (gdb) p *(struct _cosAttrValue*)pCache->pDefs->pCosTemplateDn->list.pNext $2 = {list = {pNext = 0x0, index = 0x0}, val = 0x7f03280363a0 "cn=accounts,dc=the,dc=flatiron,dc=com"} (gdb) p *(struct _cosAttrValue*)pCache->pDefs->pCosTemplateDn->list.pNext->list.pNext Attempt to dereference a generic pointer. (gdb) p *(struct _cosAttrValue*)((struct _cosAttrValue*)pCache->pDefs->pCosTemplateDn->list.pNext)->list.pNext Cannot access memory at address 0x0 }}}

Yes! This is it.
(gdb) p (struct _cosAttrValue)pCache->pDefs->pCosTemplateDn->list.pNext
$2 = {list = {pNext = 0x0, index = 0x0}, val = 0x7f03280363a0
"cn=accounts,dc=the,dc=flatiron,dc=com"}
I'm now investigating why this value is set here...
Thanks!!

Ticket #47649 - Server hangs in cos_cache when adding a user entry

Bug description: cos_dn_defs_cb reads cosDefinition and sets up the cos
Definition part of cos cache. In the function, when processing cosAttribute,
cosTargetTree and cosTemlpateDn are missing, it sets the parent dn of the
cos definition dn. This parent setting is needed only when the 2 attributes
are completely missing from the cos definition. But if the attributes are
located below cosAttribute (see the Example cos definition), in addition to
"cn=cosTemplates,ou=people,dc=example,dc=com", the parent of "cn=generate-
PostalCode,ou=People,dc=example,dc=com" is added to the cos cache as
cosTemplateDn.
Example cos definition:
dn: cn=generatePostalCode,ou=People,dc=example,dc=com
description: generate postalCode attr based on location
objectClass: top
objectClass: ldapsubentry
objectClass: cossuperdefinition
objectClass: cosClassicDefinition
cosAttribute: postalCode
costemplatedn: cn=cosTemplates,ou=people,dc=example,dc=com
cosSpecifier: l
cn: generatePostalCode
The mistakenly added cosTemplatedDn makes adding an entry under ou=People notify
recreating the cos cache. The notification needs to be outside of backend
transaction since it causes a deadlock with the cos_cache_wait_on_change thread
which cannot read the DB due to the transaction but holds the lock that the
notifier thread is waiting for.

Fix description: The parent of the cos definition dn is set to the cosTargetTree
and the cosTemlpateDn, only when the attributes are completely missing.

Please enter the commit message for your changes. Lines starting

in
{{{
char *orig = slapi_dn_parent(pDn->val);
}}}
need to handle case where orig is NULL. Otherwise, ack

Nice work!

git patch file (master; take 2) -- added the case when (orig == NULL).
0001-Ticket-47649-Server-hangs-in-cos_cache-when-adding-a.2.patch

Thank you for the review, Rich! I've updated the patch...

Reviewed by Rich (Thank you!!)

Pushed to master:
c52987d..1e52401 master -> master
commit 1e52401

Pushed to 389-ds-base-1.3.2:
4266657..01c0794 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 01c0794

Pushed to 389-ds-base-1.3.1:
d68dc32..1ebad4b 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 1ebad4b

Thank you so much, jmou. Without your help, we could not figure out the cause of this bug.

We will respin 389-ds-base-1.3.1.19 as soon as possible. But in the meantime, you may want to work around the deadlock problem as follows:

ldapmodify -x -D 'cn=directory manager' -w YOUR_PASSWORD << EOF
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
changetype: modify
delete: cosAttribute
-
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
changetype: modify
add: cosAttribute
cosAttribute: krbPwdPolicyReference override
EOF

This delete-then-add operation should move the cosAttribute at the end of the entry, which is after costemplatedn. It'd stop adding this DN to the cos cache.

(gdb) p pCache->ppTemplateList[0]
$6 = 0x7f810802ab40 "cn=accounts,dc=the,dc=flatiron,dc=com"

Please reopen the bug if it does not solve your problem...

Once again, thank you very much!
--noriko

Thank you noriko. Unfortunately this workaround did not work for me. I ran this on my initial server (what I have been calling the master), which is the one that is hanging.

$ ldapmodify -x -D 'cn=directory manager' -W -f cos.ldif
$ cat cos.ldif
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
changetype: modify
delete: cosAttribute
-
add: cosAttribute
cosAttribute: krbPwdPolicyReference override
-
$ ipa user-add joe-test --first=joe --last=test

The last command still hangs. Did I miss something?

Could you run this command line again?
$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"

That hint got it to work for me! Thanks for all the help.

{{{
$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"
Enter LDAP Password:
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
cosAttribute: krbPwdPolicyReference override
cn: Password Policy
cosspecifier: memberOf
costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com
objectClass: top
objectClass: ldapsubentry
objectClass: cosSuperDefinition
objectClass: cosClassicDefinition
description: Password Policy based on group membership

$ cat cos2.ldif
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
changetype: modify
delete: costemplatedn
-
add: costemplatedn
costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com
-
$ ldapmodify -x -D 'cn=directory manager' -W -f cos2.ldif
Enter LDAP Password:
modifying entry "cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com"

$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -W -b "dc=the,dc=flatiron,dc=com" "(objectclass=ldapsubentry)"
Enter LDAP Password:
dn: cn=Password Policy,cn=accounts,dc=the,dc=flatiron,dc=com
costemplatedn: cn=cosTemplates,cn=accounts,dc=the,dc=flatiron,dc=com
cosAttribute: krbPwdPolicyReference override
cn: Password Policy
cosspecifier: memberOf
objectClass: top
objectClass: ldapsubentry
objectClass: cosSuperDefinition
objectClass: cosClassicDefinition
description: Password Policy based on group membership

$ ipa user-add joe-test --first=joe --last=test

Added user "joe-test"

User login: joe-test
First name: joe
Last name: test
Full name: joe test
Display name: joe test
Initials: jt
Home directory: /home/joe-test
GECOS: joe test
Login shell: /bin/bash
Kerberos principal: joe-test@THE.FLATIRON.COM
Email address: joe-test@flatiron.com
UID: 749400051
GID: 749400051
Password: False
Member of groups: ipausers
Kerberos keys available: False
}}}

Great news!! I'm sooooo happy to hear that...

Thank you so much for your help and patience! ;)

Pushed to 1.2.11

ad5314f..8f39dd5 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit 8f39dd5

Metadata Update from @rmeggins:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.1.18

2 years ago

Login to comment on this ticket.

Metadata