#50493 connection_is_free to trylock
Closed a year ago by spichugi. Opened 2 years ago by firstyear.
firstyear/389-ds-base 50459-c_table-trylock  into  master

@@ -749,10 +749,13 @@ 

  int

  connection_is_free(Connection *conn, int use_lock)

  {

-     int rc;

+     int rc = 0;

  

      if (use_lock) {

-         pthread_mutex_lock(&(conn->c_mutex));

+         /* If the lock is held, someone owns this! */

+         if (pthread_mutex_trylock(&(conn->c_mutex)) != 0) {

+             return 0;

+         }

      }

      rc = conn->c_sd == SLAPD_INVALID_SOCKET && conn->c_refcnt == 0 &&

           !(conn->c_flags & CONN_FLAG_CLOSING);

@@ -115,12 +115,24 @@ 

  connection_table_get_connection(Connection_Table *ct, int sd)

  {

      Connection *c = NULL;

-     int index, count;

+     size_t index = 0;

+     size_t count = 0;

  

      PR_Lock(ct->table_mutex);

  

+     /*

+      * We attempt to loop over the ct twice, because connection_is_free uses trylock

+      * and some resources *might* free in the time needed to loop around.

+      */

+     size_t ct_max_loops = ct->size * 2;

+ 

+     /*

+      * This uses sd as entropy to randomly start inside the ct rather than

+      * always head-loading the list. Not my idea, but it's probably okay ...

+      */

      index = sd % ct->size;

-     for (count = 0; count < ct->size; count++, index = (index + 1) % ct->size) {

+ 

+     for (count = 0; count < ct_max_loops; count++, index = (index + 1) % ct->size) {

          /* Do not use slot 0, slot 0 is head of the list of active connections */

          if (index == 0) {

              continue;
@@ -132,7 +144,8 @@ 

          }

      }

  

-     if (count < ct->size) {

+     /* If count exceeds max loops, we didn't find something into index. */

+     if (count < ct_max_loops) {

          /* Found an available Connection */

          c = &(ct->c[index]);

          PR_ASSERT(c->c_next == NULL);

Bug Description: Due to the nature of the connection table
being single threaded, in connection_is_free, we would iterate
over the CT attempting to lock and check connection free states.
However, because this required the lock, if the connection was
currently in io, or other operations, the ct would delay behind
the c_mutex until it was released, then we would check the free
state.

Fix Description: Change the connection_is_free to use trylock
instead of lock - this means if the connection is locked it's
probably inuse and we can skip over it directly. We also change the
fn to iterate over the ct twice to check for possible connections
incase something frees up.

2 new commits added

  • Ticket 50493 - connection_is_free to trylock
  • Ticket 50459 - Correct issue with allocation state
2 years ago

Here are benchmarks:

Previous:

ldclt[14209]: Average rate: 4307.20/thr  (4307.20/sec), total:  43072
ldclt[14209]: Average rate: 4165.60/thr  (4165.60/sec), total:  41656
ldclt[14209]: Average rate: 3940.70/thr  (3940.70/sec), total:  39407
ldclt[14209]: Average rate: 3775.90/thr  (3775.90/sec), total:  37759
ldclt[14209]: Average rate: 3757.50/thr  (3757.50/sec), total:  37575
ldclt[14209]: Average rate: 3750.50/thr  (3750.50/sec), total:  37505
ldclt[14209]: Average rate: 3644.60/thr  (3644.60/sec), total:  36446
ldclt[14209]: Average rate: 3734.90/thr  (3734.90/sec), total:  37349
ldclt[14209]: Average rate: 3700.70/thr  (3700.70/sec), total:  37007
ldclt[14209]: Average rate: 3679.80/thr  (3679.80/sec), total:  36798
ldclt[14209]: Number of samples achieved. Bye-bye...
ldclt[14209]: All threads are dead - exit.
ldclt[14209]: Global average rate: 38457.40/thr  (3845.74/sec), total: 384574

With trylock instead

ldclt[68258]: Average rate: 4835.70/thr  (4835.70/sec), total:  48357
ldclt[68258]: Average rate: 6399.50/thr  (6399.50/sec), total:  63995
ldclt[68258]: Average rate: 6320.50/thr  (6320.50/sec), total:  63205
ldclt[68258]: Average rate: 6176.90/thr  (6176.90/sec), total:  61769
ldclt[68258]: Average rate: 6196.40/thr  (6196.40/sec), total:  61964
ldclt[68258]: Average rate: 6072.40/thr  (6072.40/sec), total:  60724
ldclt[68258]: Average rate: 6139.20/thr  (6139.20/sec), total:  61392
ldclt[68258]: Average rate: 5905.30/thr  (5905.30/sec), total:  59053
ldclt[68258]: Average rate: 5991.50/thr  (5991.50/sec), total:  59915
ldclt[68258]: Average rate: 5976.10/thr  (5976.10/sec), total:  59761
ldclt[68258]: Number of samples achieved. Bye-bye...
ldclt[68258]: All threads are dead - exit.
ldclt[68258]: Global average rate: 60013.50/thr  (6001.35/sec), total: 600135

Note no thread failures, and the increase in performance by 35%.

This will be of great interest to @tbordaz and @mreynolds. Note this does not use nunc-stans, this is just handling the conntable better in conn allocation.

@firstyear, this is a nice idea. Just some clarification about the performance gain.
My understanding is that the fix will accelerate the new connection setup. So I suspect the ldclt param establish a new connection for each operation. Right ?

In the loop, connection_is_free is a kind of fallback, when the connection is not CONN_STATE_FREE. This fallback checks if the connection slot is about to be free.
Don't you think we can get rid of this fallback. i.e. just iterate testing for a CONN_STATE_FREE ?

Yes, I have this configured for new connection per operation in my ldclt setup. So this improves new connection performance.

No, conn_state_free is actually what lead to my confusion in https://pagure.io/389-ds-base/pull-request/50492. Previously with the PR_Monitor, if the monitor was allocated, the memory for the connection was "setup", and then could be reused. So conn_state_free is about that conn-table slot's allocation state, not about it's "usage" state. Conn_state_free means "never setup piece of 0 memory" and conn_state_init means "connection has been allocated, and now can be used.". So connection_is_free works on != CONN_STATE_FREE and is checking if the allocated connection is ready to be reused or not.

Of course, because the connection is != STATE_FREE, it may or may not be locked by another thread doing work on the connection - that's why the trylock vs lock is important because if the connection is locked, and we wait on trying to lock it, we delay new connection allocation behind any io-event of the connection we are waiting to lock. However, because the lock is locked, that is evidence the connection is in use, so trylock-ing means we can say "hey, that lock is already taken, someone is probably using it, lets move on". And that's what yields the insane performance improvement.

In the future I'd like to make the CONN_STATE flag indicate the connection state as a whole, not just it's allocation state, but that seemed like "too big" of a change for something that should be smaller incremental parts.

Hope that helps,

@firstyear, the more I read this part of code the more I loose the overall logic :)

My understanding is that we have a connection table. At startup slots have connection structure but without allocated c_mutex and c_pdumutex, it is then flagged CONN_STATE_FREE. When this c_mutex and c_pdumutex will be allocated it is for the life duration of the server, not for the life duration of the connection.

Of course state==CONN_STATE_FREE means it is a free slot has it has never been used. For slot already used, the slot remains for ever CONN_STATE_INIT but we use a combinaison of status (in connection_is_free) to know if the slot is available.

I guess, pdumutex and c_mutex are allocated on demand for space saving reason. But it looks very minor saving regarding the size of 'struct conn'.

If I understand correctly that part of code, I find c_state useless. pdumutex and c_mutex should be allocated for the duration of the connection table (server life). Removing the test of c_state would simplify code. The availability of connection slot being only tested with connection_is_free.
If you agree with this cleanup we can open a separated ticket.

Your patch looks good to me (but likely needs a rebase). ACK

Check the patch again - c_state is needed, because the pthread_mutex struct is not a pointer, but allocated inside of the conn struct. So c_mutex can never be != NULL checked, because it's:

struct slapi_conn {
    pthread_mutex c_mutex,
}

Not

struct slapi_conn {
    pthread_mutex *c_mutex,
}

PS: I'm on PTO for two weeks now, but I'll check reviews and merge stuff as needed :)

rebased onto 9529cfc

2 years ago

Pull-Request has been merged by firstyear

2 years ago

Right I missed that #50459 change it from a pointer to a struct.
I opened https://pagure.io/389-ds-base/issue/50502 to evaluate the interest of c_state

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This pull request has been cloned to Github as issue and is available here:
- https://github.com/389ds/389-ds-base/issues/3550

If you want to continue to work on the PR, please navigate to the github issue,
download the patch from the attachments and file a new pull request.

Thank you for understanding. We apologize for all inconvenience.

Pull-Request has been closed by spichugi

a year ago