#47599 Reduce lock scope in retro changelog plug-in
Closed: Fixed None Opened 6 years ago by nkinder.

Looks like the locking in the retro changelog code is too big - write_replog_db - the retrocl_internal_lock lock scope is very large. Or perhaps even get rid of locking and use atomics.

We are seeing deadlock issues related to this as described in the following FreeIPA ticket:

https://fedorahosted.org/freeipa/ticket/3967

I thought PRUint64 operations were not atomic? I think it would be better to use the slapi_counter API - that way the slapi_counter API will take care of locking

Replying to [comment:6 rmeggins]:

I thought PRUint64 operations were not atomic?

I was under the impression of the exact opposite, and this is exactly how slapi_counters work as well(please take a look). I'd prefer to keep what I have, as it's much simpler than bringing in all the slapi_counter stuff. No? Thoughts?

Replying to [comment:7 mreynolds]:

Replying to [comment:6 rmeggins]:

I thought PRUint64 operations were not atomic?

I was under the impression of the exact opposite, and this is exactly how slapi_counters work as well(please take a look).
{{{
/
* slapi_counter_add()

* Atomically add a value to a Slapi_Counter.
/
PRUint64 slapi_counter_add(Slapi_Counter
counter, PRUint64 addvalue)
{
PRUint64 newvalue = 0;
...

ifdef LINUX

newvalue = __sync_add_and_fetch(&(counter->value), addvalue);

...
return newvalue;
}}}
So no, PRUint64 newvalue++ is not atomic.

I'd prefer to keep what I have, as it's much simpler than bringing in all the slapi_counter stuff. No? Thoughts?

Replying to [comment:8 rmeggins]:

Replying to [comment:7 mreynolds]:

Replying to [comment:6 rmeggins]:

I thought PRUint64 operations were not atomic?

I was under the impression of the exact opposite, and this is exactly how slapi_counters work as well(please take a look).
{{{
/
* slapi_counter_add()

* Atomically add a value to a Slapi_Counter.
/
PRUint64 slapi_counter_add(Slapi_Counter
counter, PRUint64 addvalue)
{
PRUint64 newvalue = 0;
...

ifdef LINUX

newvalue = __sync_add_and_fetch(&(counter->value), addvalue);

...
return newvalue;
}}}
So no, PRUint64 newvalue++ is not atomic.

Yeah, you're absolutely right. I really thought a PRUint64 was atomic, it comes up all the time when there's talk about lock-free access(hazard pointers, etc). I obviously misunderstood some of it. Anyway I'm reworking the patch.

I'd prefer to keep what I have, as it's much simpler than bringing in all the slapi_counter stuff. No? Thoughts?

Replying to [comment:9 mreynolds]:

Replying to [comment:8 rmeggins]:

Replying to [comment:7 mreynolds]:

Replying to [comment:6 rmeggins]:

I thought PRUint64 operations were not atomic?

I was under the impression of the exact opposite, and this is exactly how slapi_counters work as well(please take a look).
{{{
/
* slapi_counter_add()

* Atomically add a value to a Slapi_Counter.
/
PRUint64 slapi_counter_add(Slapi_Counter
counter, PRUint64 addvalue)
{
PRUint64 newvalue = 0;
...

ifdef LINUX

newvalue = __sync_add_and_fetch(&(counter->value), addvalue);

...
return newvalue;
}}}
So no, PRUint64 newvalue++ is not atomic.

Yeah, you're absolutely right. I really thought a PRUint64 was atomic, it comes up all the time when there's talk about lock-free access(hazard pointers, etc). I obviously misunderstood some of it. Anyway I'm reworking the patch.

I'd prefer to keep what I have, as it's much simpler than bringing in all the slapi_counter stuff. No? Thoughts?

New patch is attached.

Thanks,
Mark

With slapi_counter, operations on single counters are atomic. However, operations on multiple counters are not.

{{{
void retrocl_commit_changenumber(void)
{
if ( slapi_counter_get_value(retrocl_first_cn) == 0) {
slapi_counter_set_value(retrocl_first_cn, slapi_counter_get_value(retrocl_internal_cn));
}
}
}}}

This is not atomic, because the value of retrocl_first_cn could change after testing it for 0.

Same with this:
{{{
if(slapi_counter_get_value(retrocl_internal_cn) <= slapi_counter_get_value(retrocl_first_cn)){
}}}
The value of retrocl_internal_cn could change before doing slapi_counter_get_value(retrocl_first_cn) and the <= test.

Replying to [comment:11 rmeggins]:

With slapi_counter, operations on single counters are atomic. However, operations on multiple counters are not.

{{{
void retrocl_commit_changenumber(void)
{
if ( slapi_counter_get_value(retrocl_first_cn) == 0) {
slapi_counter_set_value(retrocl_first_cn, slapi_counter_get_value(retrocl_internal_cn));
}
}
}}}

This is not atomic, because the value of retrocl_first_cn could change after testing it for 0.

Same with this:
{{{
if(slapi_counter_get_value(retrocl_internal_cn) <= slapi_counter_get_value(retrocl_first_cn)){
}}}
The value of retrocl_internal_cn could change before doing slapi_counter_get_value(retrocl_first_cn) and the <= test.

Hmm, I see. Looks like we need to use a RW lock instead of using the slapi counter approach. I'm not sure how else to do it.

This adds additional locking to retrocl_assign_changenumber(). The only place where retrocl_assign_changenumber is called is from write_replog_db(), inside of the PR_Lock(retrocl_internal_lock). I'm not sure if this will solve the ipa deadlock, and not introduce any additional deadlocks.

Hi,

If this ticket is for reducing a critical section but not the hang, please skip the following.

Just a dummy remark regarding the hang (https://fedorahosted.org/freeipa/attachment/ticket/3967/ds.backtrace). It looks to me that the thread 28 (ssl handshake) is waiting for a lock aquired by thread 15 (applying a IPA set password in the retrocl). Thread 15 is waiting for a db lock (used by a cursor). I did not see any others threads in db. One possibility of the hang could be an aborted task (export/import/index..).

Replying to [comment:14 tbordaz]:

Hi,

If this ticket is for reducing a critical section but not the hang, please skip the following.

Just a dummy remark regarding the hang (https://fedorahosted.org/freeipa/attachment/ticket/3967/ds.backtrace). It looks to me that the thread 28 (ssl handshake) is waiting for a lock aquired by thread 15 (applying a IPA set password in the retrocl). Thread 15 is waiting for a db lock (used by a cursor). I did not see any others threads in db. One possibility of the hang could be an aborted task (export/import/index..).

This is a definitely a possibility, but we don't have any DS logs to see if any tasks were issued/aborted. So for now this patch is just to improve the locking. Once all of the retrocl patches are done, another round of testing will be done, and we'll go from there.

Replying to [comment:13 rmeggins]:

This adds additional locking to retrocl_assign_changenumber(). The only place where retrocl_assign_changenumber is called is from write_replog_db(), inside of the PR_Lock(retrocl_internal_lock). I'm not sure if this will solve the ipa deadlock, and not introduce any additional deadlocks.

Reworked the RW locking around retrocl_assign_changenumber() && retrocl_update_lastchangenumber(). Now I release the write lock before calling ldbm_back_seq(), and lock it after returning. It is safe to release the lock without fear of the CN changing, because we are now inside the the big internal lock, so everything(CN updates) are serialized at this stage.

So now the RW lock is only held while making an update to the changenumber, nothing else. There should not be any risk of deadlock now.

New patch attached.

git merge retro
Updating 2971ca1..e2c42bc
Fast-forward
ldap/servers/plugins/retrocl/retrocl.c | 3 +-
ldap/servers/plugins/retrocl/retrocl.h | 1 +
ldap/servers/plugins/retrocl/retrocl_cn.c | 42 ++++++++++++++++++++---------
ldap/servers/plugins/retrocl/retrocl_po.c | 2 +-
4 files changed, 33 insertions(+), 15 deletions(-)

git push origin master
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.37 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
2971ca1..e2c42bc master -> master

commit e2c42bc
Author: Mark Reynolds mreynolds@redhat.com
Date: Wed Nov 20 09:08:50 2013 -0500

1.3.2
d197eca..a87c3ee 389-ds-base-1.3.2 -> 389-ds-base-1.3.2

1.3.1
bb658b8..03f6347 389-ds-base-1.3.1 -> 389-ds-base-1.3.1

89ae932..b19239f 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit b19239f
Author: Rich Megginson rmeggins@redhat.com
Date: Fri Nov 22 16:51:55 2013 -0700
d1d90fd..f4d5900 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit f4d5900
Author: Rich Megginson rmeggins@redhat.com
Date: Fri Nov 22 16:51:55 2013 -0700
111b807..b330876 master -> master
commit b330876
Author: Rich Megginson rmeggins@redhat.com
Date: Fri Nov 22 16:51:55 2013 -0700

git merge ticket47599
Updating 51b1b83..a16bf1b
Fast-forward
ldap/servers/slapd/back-ldbm/seq.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 678 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
51b1b83..a16bf1b master -> master

commit a16bf1b
Author: Mark Reynolds mreynolds@redhat.com
Date: Mon Nov 25 09:36:25 2013 -0500

git push origin 389-ds-base-1.3.2
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 727 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
cbac612..c7e7c68 389-ds-base-1.3.2 -> 389-ds-base-1.3.2

0001-Ticket-47599-fix-memory-leak.patch: Pushed to 389-ds-base-1.3.1:
5c649dd..08dc37d 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 08dc37d

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.1.14

3 years ago

Login to comment on this ticket.

Metadata