#48298 ns-slapd crash during ipa-replica-manage del
Closed: wontfix None Opened 8 years ago by nhosoi.

Description of problem:

During some ipa-replica-manage del tests, I see ns-slapd crash.

Core was generated by `/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-TEST -i
/var/run/dirsrv/slapd-'.
Program terminated with signal 11, Segmentation fault.
#0  slapi_sdn_copy (from=0x0, to=to@entry=0x7f526809b390) at
ldap/servers/slapd/dn.c:2474
2474            if (from->udn)

How reproducible:
unknown

Steps to Reproduce:
Exact steps unknown.

I saw the crash in a 5 node IPA server environment setup in a simple chain:

R1-M-R2-R3-R4

When I saw the crash it was on R3 doing ipa-replica-manage del R4.

Actual results:
ns-slapd crashes.

Expected results:
no crash

In agmt_get_replarea(const Repl_Agmt *ra) - all access to ra or ra->replarea or replarea must be done inside the lock - once you copy slapi_sdn_copy(ra->replarea, return_value) you can release the lock.
For example:
{{{

986         replarea = ra->replarea;

984 987 PR_Unlock(ra->lock);
988 if (replarea) {
}}}
It is unsafe to access replarea like this because replarea points at ra->replarea and may be altered or freed outside of the ra->lock. Same with get_agmt_status().

Thank you, Rich. For the code in your comment:3 (#comment:3), I meant this... :p
{{{
diff --git a/ldap/servers/plugins/replication/repl5_agmt.c b/ldap/servers/plugins/replication/repl5_a
index 7188e40..33a8427 100644
--- a/ldap/servers/plugins/replication/repl5_agmt.c
+++ b/ldap/servers/plugins/replication/repl5_agmt.c
@@ -987,7 +987,7 @@ agmt_get_replarea(const Repl_Agmt *ra)
PR_Unlock(ra->lock);
if (replarea) {
return_value = slapi_sdn_new();
- slapi_sdn_copy(ra->replarea, return_value);
+ slapi_sdn_copy(replarea, return_value);
}
return return_value;
}
}}}

Do you think get_agmt_status is still unsafe even though replarea is checked in agmt_get_replarea and is not accessed in the function after that?

Replying to [comment:4 nhosoi]:

Thank you, Rich. For the code in your comment:3 (#comment:3), I meant this... :p
{{{
diff --git a/ldap/servers/plugins/replication/repl5_agmt.c b/ldap/servers/plugins/replication/repl5_a
index 7188e40..33a8427 100644
--- a/ldap/servers/plugins/replication/repl5_agmt.c
+++ b/ldap/servers/plugins/replication/repl5_agmt.c
@@ -987,7 +987,7 @@ agmt_get_replarea(const Repl_Agmt *ra)
PR_Unlock(ra->lock);
if (replarea) {
return_value = slapi_sdn_new();
- slapi_sdn_copy(ra->replarea, return_value);
+ slapi_sdn_copy(replarea, return_value);
}
return return_value;
}
}}}

Do you think get_agmt_status is still unsafe even though replarea is checked in agmt_get_replarea and is not accessed in the function after that?

If is it possible that ra->replarea could be modified or freed outside of ra->lock, then it doesn't matter if you use ra->replarea or replarea since replarea is just a pointer to ra->replarea and not a copy. You have to do the copy inside the lock, as in the original code.

Replying to [comment:5 rmeggins]:

If is it possible that ra->replarea could be modified or freed outside of ra->lock, then it doesn't matter if you use ra->replarea or replarea since replarea is just a pointer to ra->replarea and not a copy. You have to do the copy inside the lock, as in the original code.

Well, that's true. How about the revised patch?

Replying to [comment:6 nhosoi]:

Replying to [comment:5 rmeggins]:

If is it possible that ra->replarea could be modified or freed outside of ra->lock, then it doesn't matter if you use ra->replarea or replarea since replarea is just a pointer to ra->replarea and not a copy. You have to do the copy inside the lock, as in the original code.

Well, that's true. How about the revised patch?

agmt_get_replarea looks good now.

However, get_agmt_status is problematic. There are several places where ra-> is accessed without being locked. I don't know if it will have the same race condition. Ideally, there would be no access of ra-> at all outside of the lock. But you can't just lock ra->lock for the entire function call - you would have to release it to call agmt_get_replarea(), then acquire the lock again if that function returns non-null.

You have to be careful about having such a big lock in agmt_delete - it should be fine as long as none of the other functions it calls locks ra->lock.

get_agmt_status is better - but you still cannot access replarea outside of the lock. Maybe call agmt_get_replarea() as the very first thing in that function, outside of the lock? Then you will either have a Slapi_DN *replarea_sdn copy of ra->replarea, which is safe to use anywhere, or NULL, in which case you can just skip all of the other processing.

Replying to [comment:8 rmeggins]:

You have to be careful about having such a big lock in agmt_delete - it should be fine as long as none of the other functions it calls locks ra->lock.
slapi_config_remove_callback(get_agmt_status) and repl_session_plugin_call_destroy_agmt_cb locks in it, but the rest looks ok...

get_agmt_status is better - but you still cannot access replarea outside of the lock.
Maybe call agmt_get_replarea() as the very first thing in that function, outside of the lock? Then you will either have a Slapi_DN *replarea_sdn copy of ra->replarea, which is safe to use anywhere, or NULL, in which case you can just skip all of the other processing.

Okay, one more revised patch is coming...

{{{

2691 2702 PR_Lock(ra->lock);
2692 replarea = ra->replarea;
2703 replarea_sdn = agmt_get_replarea(ra);
}}}
This won't work - agmt_get_replarea will also do PR_Lock(ra->lock);

Replying to [comment:10 rmeggins]:

{{{

2691 2702 PR_Lock(ra->lock);
2692 replarea = ra->replarea;
2703 replarea_sdn = agmt_get_replarea(ra);
}}}
This won't work - agmt_get_replarea will also do PR_Lock(ra->lock);
Sigh... Then, I'd like to resume the previous code in terms of ra->replarea. The address of "replarea = ra->replarea;" was just for checking whether the core part of get_agmt_status should be skipped or not. And ra->replarea was re-examined in agmt_get_replarea in PR_Lock(ra->lock). So, I think it's safe...

git patch file (master) -- merged the 2 patches and revised to avoid the self deadlock.
0001-Ticket-48298-ns-slapd-crash-during-ipa-replica-manag.3.patch

Replying to [comment:12 rmeggins]:

Looks good!
Thank you soooo much for your help, Rich!

Unfortunately, it did not pass the IPA tests.

{{{
699 if (!repl_sdn) {
700 return -1;
701 }
}}}
need to prot_free(&prot) here before the return.

{{{
123 if (ro) {
124 exists = 1;
125 }
}}}
Need to object_release(ro) in the if clause.

{{{

2371 2371 val.bv_len = PR_snprintf(data, sizeof(data), "CLEANRUV%d", clean_data->rid);
2372 2372 sdn = agmt_get_replarea(agmt);
2373 if (!sdn) {
2374 conn_delete_internal_ext(conn);
2375 return;
2376 }
}}}
Need to free val.bv_len. Or, get replarea and check for !sdn before allocating val.bv_len.

{{{

410     area_sdn = agmt_get_replarea(prp->agmt); 
411     if (!area_sdn) { 
412         conn_disconnect(prp->conn); 
413         goto done; 
414     } 
415  
416     pb = slapi_pblock_new ();

}}}

I think you should check for !area_sdn very early in this function, before allocating a bunch of stuff, and before actually acquiring the consumer and setting the status. Same with windows_update_local_entry() and windows_search_local_entry_by_uniqueid() - better to stop processing as early as possible.

I ran tests with valgrind. No new leaks on the agreement/protocol were reported.

{{{
"failed to get local sbtree from agreement\n");
}}}
subtree

Otherwise, ack

git patch file (master) -- different approach -- revised following the reviews by Rich (Thank you!!) including fixing "sbtree" :)
0001-Ticket-48298-ns-slapd-crash-during-ipa-replica-manag.4.patch

Reviewed by Rich (Thanks!!)

Pushed to master:
6e45391..3cbdfa6 master -> master
commit 3cbdfa6

Pushed to 389-ds-base-1.3.4:
96b9b67..f09eb8c 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit f09eb8c

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.4 backlog

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1629

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata