#50394 Ticket 50389 - ns-slapd craches while two threads are polling the same connection
Closed 3 years ago by spichugi. Opened 4 years ago by tbordaz.
tbordaz/389-ds-base Ticket-50389  into  master

@@ -931,7 +931,7 @@ 

  #define CONN_DONE 3

  #define CONN_TIMEDOUT 4

  

- #define CONN_TURBO_TIMEOUT_INTERVAL 400 /* milliseconds */

+ #define CONN_TURBO_TIMEOUT_INTERVAL 100 /* milliseconds */

  #define CONN_TURBO_TIMEOUT_MAXIMUM 5 /* attempts * interval IE 2000ms with 400 * 5 */

  #define CONN_TURBO_CHECK_INTERVAL 5      /* seconds */

  #define CONN_TURBO_PERCENTILE 50         /* proportion of threads allowed to be in turbo mode */
@@ -1187,7 +1187,9 @@ 

                  pr_pd.fd = (PRFileDesc *)conn->c_prfd;

                  pr_pd.in_flags = PR_POLL_READ;

                  pr_pd.out_flags = 0;

+                 PR_Lock(conn->c_pdumutex);

                  ret = PR_Poll(&pr_pd, 1, timeout);

+                 PR_Unlock(conn->c_pdumutex);

                  waits_done++;

                  /* Did we time out ? */

                  if (0 == ret) {

@@ -1940,6 +1940,8 @@ 

   * or something goes seriously wrong.  Otherwise, return 0.

   * If -1 is returned, PR_GetError() explains why.

   * Revision: handle changed to void * to allow 64bit support

+  *

+  * Caller (flush_ber) must hold conn->c_pdumutex

   */

  static int

  slapd_poll(void *handle, int output)

Bug Description:
nspr IO is not multi-threaded safe.
389-ds should not be in a situation where several threads are polling
a same connection at the same time.
The scenario is a worker send back an operation result at the same time
another worker wants to read an incoming request.

Fix Description:
The fix consist in synchonizing polling with c_pdumutex.

The thread that sends data (flush_ber) hold c_pdumutex.

The thread that reads the data does a non blocking read. It then
enforce ioblocktimeout with iteration of poll.
The reading thread must hold c_pdumutex during poll to synchronize
with the reader thread.
The reading thread must poll with a small timeout
(CONN_TURBO_TIMEOUT_INTERVAL). In order to not block
the thread that send back data, the fix reduces the delay to 0.1s.

https://pagure.io/389-ds-base/issue/50389

Reviewed by: ?

Platforms tested: F28

Flag Day: no

Doc impact: no

LGTM as well! But we should not forget to add a test case once we have a reliable reproducer -- I would keep the associated issue open or file a new one solely for the test case. Thanks.

@mhonek It's hard to get reproducers for connection code :S especially where timing and races are involved :(

Anyway, ack from me too. Looks like you're popular @tbordaz :)

rebased onto 847b65e01ffe0f776e045e07a46ce494a253c260

4 years ago

@mreynolds , @mhonek , @firstyear thank you all for your review and ACK.
I think connection handling is regular popular component :) . I hope it will be less popular with nunc stans re-enabled by default !

It is possible there is a reproducer and waiting some feedback.
If there is no reproducer I will merge the PR.
If there is a reproducer the merge will be delayed to get the result.

@tbordaz nunc-stans is certainly on my todo list, I would like that cleaned up and merged, but I remember there were some issues with replication I would like to investigate.

rebased onto 2886ba7

4 years ago

Finally there is no identified reproducer. Merging the PR. Thanks for the reviews

Pull-Request has been merged by tbordaz

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This pull request has been cloned to Github as issue and is available here:
- https://github.com/389ds/389-ds-base/issues/3452

If you want to continue to work on the PR, please navigate to the github issue,
download the patch from the attachments and file a new pull request.

Thank you for understanding. We apologize for all inconvenience.

Pull-Request has been closed by spichugi

3 years ago