#48298 ns-slapd crash during ipa-replica-manage del
Closed: Fixed None Opened 3 years ago by nhosoi.

Description of problem:

During some ipa-replica-manage del tests, I see ns-slapd crash.

Core was generated by `/usr/sbin/ns-slapd -D /etc/dirsrv/slapd-TESTRELM-TEST -i
/var/run/dirsrv/slapd-'.
Program terminated with signal 11, Segmentation fault.
#0  slapi_sdn_copy (from=0x0, to=to@entry=0x7f526809b390) at
ldap/servers/slapd/dn.c:2474
2474            if (from->udn)

How reproducible:
unknown

Steps to Reproduce:
Exact steps unknown.

I saw the crash in a 5 node IPA server environment setup in a simple chain:

R1-M-R2-R3-R4

When I saw the crash it was on R3 doing ipa-replica-manage del R4.

Actual results:
ns-slapd crashes.

Expected results:
no crash

In agmt_get_replarea(const Repl_Agmt *ra) - all access to ra or ra->replarea or replarea must be done inside the lock - once you copy slapi_sdn_copy(ra->replarea, return_value) you can release the lock.
For example:
{{{

986         replarea = ra->replarea;

984 987 PR_Unlock(ra->lock);
988 if (replarea) {
}}}
It is unsafe to access replarea like this because replarea points at ra->replarea and may be altered or freed outside of the ra->lock. Same with get_agmt_status().

Thank you, Rich. For the code in your comment:3 (#comment:3), I meant this... :p
{{{
diff --git a/ldap/servers/plugins/replication/repl5_agmt.c b/ldap/servers/plugins/replication/repl5_a
index 7188e40..33a8427 100644
--- a/ldap/servers/plugins/replication/repl5_agmt.c
+++ b/ldap/servers/plugins/replication/repl5_agmt.c
@@ -987,7 +987,7 @@ agmt_get_replarea(const Repl_Agmt *ra)
PR_Unlock(ra->lock);
if (replarea) {
return_value = slapi_sdn_new();
- slapi_sdn_copy(ra->replarea, return_value);
+ slapi_sdn_copy(replarea, return_value);
}
return return_value;
}
}}}

Do you think get_agmt_status is still unsafe even though replarea is checked in agmt_get_replarea and is not accessed in the function after that?

Replying to [comment:4 nhosoi]:

Thank you, Rich. For the code in your comment:3 (#comment:3), I meant this... :p
{{{
diff --git a/ldap/servers/plugins/replication/repl5_agmt.c b/ldap/servers/plugins/replication/repl5_a
index 7188e40..33a8427 100644
--- a/ldap/servers/plugins/replication/repl5_agmt.c
+++ b/ldap/servers/plugins/replication/repl5_agmt.c
@@ -987,7 +987,7 @@ agmt_get_replarea(const Repl_Agmt *ra)
PR_Unlock(ra->lock);
if (replarea) {
return_value = slapi_sdn_new();
- slapi_sdn_copy(ra->replarea, return_value);
+ slapi_sdn_copy(replarea, return_value);
}
return return_value;
}
}}}

Do you think get_agmt_status is still unsafe even though replarea is checked in agmt_get_replarea and is not accessed in the function after that?

If is it possible that ra->replarea could be modified or freed outside of ra->lock, then it doesn't matter if you use ra->replarea or replarea since replarea is just a pointer to ra->replarea and not a copy. You have to do the copy inside the lock, as in the original code.

Replying to [comment:5 rmeggins]:

If is it possible that ra->replarea could be modified or freed outside of ra->lock, then it doesn't matter if you use ra->replarea or replarea since replarea is just a pointer to ra->replarea and not a copy. You have to do the copy inside the lock, as in the original code.

Well, that's true. How about the revised patch?

Replying to [comment:6 nhosoi]:

Replying to [comment:5 rmeggins]:

If is it possible that ra->replarea could be modified or freed outside of ra->lock, then it doesn't matter if you use ra->replarea or replarea since replarea is just a pointer to ra->replarea and not a copy. You have to do the copy inside the lock, as in the original code.

Well, that's true. How about the revised patch?

agmt_get_replarea looks good now.

However, get_agmt_status is problematic. There are several places where ra-> is accessed without being locked. I don't know if it will have the same race condition. Ideally, there would be no access of ra-> at all outside of the lock. But you can't just lock ra->lock for the entire function call - you would have to release it to call agmt_get_replarea(), then acquire the lock again if that function returns non-null.

You have to be careful about having such a big lock in agmt_delete - it should be fine as long as none of the other functions it calls locks ra->lock.

get_agmt_status is better - but you still cannot access replarea outside of the lock. Maybe call agmt_get_replarea() as the very first thing in that function, outside of the lock? Then you will either have a Slapi_DN *replarea_sdn copy of ra->replarea, which is safe to use anywhere, or NULL, in which case you can just skip all of the other processing.

Replying to [comment:8 rmeggins]:

You have to be careful about having such a big lock in agmt_delete - it should be fine as long as none of the other functions it calls locks ra->lock.
slapi_config_remove_callback(get_agmt_status) and repl_session_plugin_call_destroy_agmt_cb locks in it, but the rest looks ok...

get_agmt_status is better - but you still cannot access replarea outside of the lock.
Maybe call agmt_get_replarea() as the very first thing in that function, outside of the lock? Then you will either have a Slapi_DN *replarea_sdn copy of ra->replarea, which is safe to use anywhere, or NULL, in which case you can just skip all of the other processing.

Okay, one more revised patch is coming...

{{{

2691 2702 PR_Lock(ra->lock);
2692 replarea = ra->replarea;
2703 replarea_sdn = agmt_get_replarea(ra);
}}}
This won't work - agmt_get_replarea will also do PR_Lock(ra->lock);

Replying to [comment:10 rmeggins]:

{{{

2691 2702 PR_Lock(ra->lock);
2692 replarea = ra->replarea;
2703 replarea_sdn = agmt_get_replarea(ra);
}}}
This won't work - agmt_get_replarea will also do PR_Lock(ra->lock);
Sigh... Then, I'd like to resume the previous code in terms of ra->replarea. The address of "replarea = ra->replarea;" was just for checking whether the core part of get_agmt_status should be skipped or not. And ra->replarea was re-examined in agmt_get_replarea in PR_Lock(ra->lock). So, I think it's safe...

git patch file (master) -- merged the 2 patches and revised to avoid the self deadlock.
0001-Ticket-48298-ns-slapd-crash-during-ipa-replica-manag.3.patch

Replying to [comment:12 rmeggins]:

Looks good!
Thank you soooo much for your help, Rich!

Unfortunately, it did not pass the IPA tests.

{{{
699 if (!repl_sdn) {
700 return -1;
701 }
}}}
need to prot_free(&prot) here before the return.

{{{
123 if (ro) {
124 exists = 1;
125 }
}}}
Need to object_release(ro) in the if clause.

{{{

2371 2371 val.bv_len = PR_snprintf(data, sizeof(data), "CLEANRUV%d", clean_data->rid);
2372 2372 sdn = agmt_get_replarea(agmt);
2373 if (!sdn) {
2374 conn_delete_internal_ext(conn);
2375 return;
2376 }
}}}
Need to free val.bv_len. Or, get replarea and check for !sdn before allocating val.bv_len.

{{{

410     area_sdn = agmt_get_replarea(prp->agmt); 
411     if (!area_sdn) { 
412         conn_disconnect(prp->conn); 
413         goto done; 
414     } 
415  
416     pb = slapi_pblock_new ();

}}}

I think you should check for !area_sdn very early in this function, before allocating a bunch of stuff, and before actually acquiring the consumer and setting the status. Same with windows_update_local_entry() and windows_search_local_entry_by_uniqueid() - better to stop processing as early as possible.

I ran tests with valgrind. No new leaks on the agreement/protocol were reported.

{{{
"failed to get local sbtree from agreement\n");
}}}
subtree

Otherwise, ack

git patch file (master) -- different approach -- revised following the reviews by Rich (Thank you!!) including fixing "sbtree" :)
0001-Ticket-48298-ns-slapd-crash-during-ipa-replica-manag.4.patch

Reviewed by Rich (Thanks!!)

Pushed to master:
6e45391..3cbdfa6 master -> master
commit 3cbdfa6

Pushed to 389-ds-base-1.3.4:
96b9b67..f09eb8c 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit f09eb8c

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.4 backlog

2 years ago

Login to comment on this ticket.

Metadata