#48406 Avoid self deadlock by PR_Lock(conn->c_mutex)
Closed: wontfix None Opened 4 years ago by nhosoi.

Fixing ticket 48338 introduced a self deadlock.

To avoid the self deadlock, tried to remove PR_Lock(conn->c_mutex) which looked harmless, but it introduced a crash by memory corruption.


I don't understand. Is the intention that you don't try to lock the same mutex more than once in the same thread? If so, you should probably use a thread local variable, and it may be that PR_Monitor is implemented that way, so you might be getting into PR_Monitor territory.

Replying to [comment:3 rmeggins]:

I don't understand. Is the intention that you don't try to lock the same mutex more than once in the same thread? If so, you should probably use a thread local variable, and it may be that PR_Monitor is implemented that way, so you might be getting into PR_Monitor territory.

Well, I was going to send another email to ignore the review request. Sorry about the noise. To reduce the confusion, I'm going to delete the proposed patch.

How about we store the c_mutex locked status in the thread private by PR_SetThreadPrivate and refer it to determine to call PR_Lock or skip it for the re-entrant case? 1) Is it doable? 2) Accessing the thread private is as heavy as PR_Monitor?

I have used local thread storage for the same thing in dynamic plugins. There are helper functions in ds/ldap/servers/slapd/thread_data.c that you should check out.

git patch file (master; take 2) -- using thread data for managing the lock on conn->c_mutex to make it reentrant
0002-Ticket-48406-Avoid-self-deadlock-by-PR_Lock-conn-c_m.patch

The fix looks good to me, but isn't this a kind of reimplementation of PR_Monitor ?

I think Rich mentioned that PR_Monitor might be too heavy weighted, but I wonder if accessing thread local data isn't the same overhead than getting current_thread to compare to the lock ownwer

Replying to [comment:6 lkrispen]:

The fix looks good to me, but isn't this a kind of reimplementation of PR_Monitor ?

I think Rich mentioned that PR_Monitor might be too heavy weighted, but I wonder if accessing thread local data isn't the same overhead than getting current_thread to compare to the lock ownwer

Right. If we are going to implement a locking mechanism that works just like PR_Monitor, we should just use PR_Monitor instead.

Replying to [comment:7 rmeggins]:

Replying to [comment:6 lkrispen]:

The fix looks good to me, but isn't this a kind of reimplementation of PR_Monitor ?

I think Rich mentioned that PR_Monitor might be too heavy weighted, but I wonder if accessing thread local data isn't the same overhead than getting current_thread to compare to the lock ownwer

Right. If we are going to implement a locking mechanism that works just like PR_Monitor, we should just use PR_Monitor instead.

well, I'm fine replacing the changes with PR_Monitor if everyone agrees. I was just wondering PR_Monitor might be doing further and it might add more overheads. If not, I don't mind switching to PR_Monitor.

The background why this fix needed is 1) the current PR_Lock(conn->c_mutex) still generates self-deadlocks, and 2) skipping PR_Lock(conn->c_mutex) for conn->cin_addr (#48338 patch4) brought a memory corruption.

I cannot think of a better solution other than replacing PR_Lock with PR_Monitor-like function.

Whatever implementation we choose, we will need

  • tests to reproduce the original problem
  • performance tests

So I think it is a good idea to just go with PR_Monitor, which is the simplest approach, and test it to make sure it fixes the problem, doesn't introduce any regressions, and doesn't impact performance.

10K Entries [0000 - 9999]
{{{
dn: uid=test####,dc=example,dc=com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: shadowAccount
cn: guest#
sn: guest#
uid: guest#
givenname: givenname#
description: description#
userPassword: password#
mail: uid#
uidnumber: #
gidnumber: #
shadowMin: 0
shadowMax: 99999
shadowInactive: 30
shadowWarning: 7
homeDirectory: /home/uid#
}}}
Operation:
{{{
Search1: ldapsearch ... -D 'uid=test9999,dc=example,dc=com' -w password9999 -b "dc=example,dc=com" -E "pr=1000/noprompt" -l 10000 "(&(objectClass=posixAccount)(uid=)(uidNumber=)(gidNumber=))" uid uidNumber userPassword
Search2: ldapsearch ... -b "dc=example,dc=com" -E "pr=1000/noprompt" -l 10000 "(&(objectClass=posixAccount)(uid=
)(uidNumber=)(gidNumber=))" uid uidNumber userPassword
}}}

Real Time (average of 10 searches)
{{{
- Search1 (sec) Search2 (sec)
Master Branch 0.8758 hung
Thread Data (take 2) 0.8511 0.7939
PR_Monitor (take 3) 0.8583 0.7945
}}}
User Time (average of 10 searches)
{{{
- Search1 (sec) Search2 (sec)
Master Branch 0.0163 hung
Thread Data (take 2) 0.0176 0.0162
PR_Monitor (take 3) 0.0181 0.0167
}}}
System Time (average of 10 searches)
{{{
- Search1 (sec) Search2 (sec)
Master Branch 0.0120 hung
Thread Data (take 2) 0.0089 0.0121
PR_Monitor (take 3) 0.0098 0.0096
}}}

Re-entrant approaches rather shorten the elapsed time.

git patch file (master; take 3) -- replacing PR_Lock/Unlock with PR_EnterMonitor/ExitMonitor, respectively
0002-Ticket-48406-Avoid-self-deadlock-by-PR_Lock-conn-c_m.2.patch

I think this may need a rebase to master. I cannot apply the patch.

{{{

git am ~/Downloads/0002-Ticket-48406-Avoid-self-deadlock-by-PR_Lock-conn-c_m.2.patch
Applying: Ticket #48406 - Avoid self deadlock by PR_Lock(conn->c_mutex)
/home/wibrown/development/389ds/ds/.git/rebase-apply/patch:1424: trailing whitespace.
PRMonitor c_mutex; / protect each conn structure; need to be re-entrant */
error: patch failed: ldap/servers/slapd/pblock.c:223
error: ldap/servers/slapd/pblock.c: patch does not apply
Patch failed at 0001 Ticket #48406 - Avoid self deadlock by PR_Lock(conn->c_mutex)
The copy of the patch that failed is found in:
/home/wibrown/development/389ds/ds/.git/rebase-apply/patch

223 · · · memset( value, 0, sizeof( PRNetAddr ));
224 · · · break;
225 · · }
226 · · / For fields with atomic access, remove the PR_Lock(c_mutex) /
227 · · if ( pblock->pb_conn->cin_addr == NULL ) {
228 · · · memset( value, 0, sizeof( PRNetAddr ));
229 · · } else {
230 · · · ((PRNetAddr )value) = *(pblock->pb_conn->cin_addr);
231 · · }

}}}

Looks like none of the approaches causes a performance regression, so I'm in favor of PR_Monitor.

The next step would be to do a performance test that stresses the mutex - hundreds of simultaneous connections, and many operations that have multiple threads per connection. That should tell us if PR_Monitor causes a performance regression.

I'm also in favour of PR_Monitor, although the thread private approach looks good, the PR_Monitor is widely in use and tested, so we should be on the safe side for functionality.

Regarding performance, it would be good to have no performance impact, but we have a deadlock to solve and as long as we have no other idea how to do it, we will probably have to accept the impact if it exists.

Compared the 3 versions {master, take 2 - thread data, take 3 - PR_Monitor} by the intensive random binds by ldclt [1]
{{{
- binds/thread binds/sec total binds
Master 81769.00 817.69 163538
Thread data 83811.00 838.11 167622
PR_Monitor 82997.00 829.97 165994
}}}
I would say the results are almost the same. Per discussion in the team, I'm thinking to push take 3 PR_Monitor patch.

[1]: ldclt ... -e bindeach,bindonly -D uid=testXXXX,dc=example,dc=com -w passwordXXXX -e randombinddn,randombinddnlow=0,randombinddnhigh=9999 -n 2 -N 20

I agree, pr_monitor is the best solution.

Reviewed by Rich, Ludwig, and William (Thank you!!!)

Pushed to master:
dd90d19..f25f804 master -> master
commit f25f804

Pushed to 389-ds-base-1.3.4:
b31b676..84da7d0 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 84da7d0

Pushed to 389-ds-base-1.3.3:
e62ef7f..ad08d1f 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit ad08d1f

Pushed to 389-ds-base-1.2.11:
44413ce..f69d19a 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit f69d19a

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.2.11.33

3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1737

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

2 months ago

Login to comment on this ticket.

Metadata