#50788 Ticket 50786 - connection table freelist
Closed 3 years ago by spichugi. Opened 4 years ago by firstyear.
firstyear/389-ds-base 50786-conntable-freelist  into  master

@@ -156,20 +156,6 @@ 

  

      /* Finally, flag that we are clean - basically write a 0 ...*/

      conn->c_state = CONN_STATE_FREE;

- 

-     /*

-      * WARNING: There is a memory leak here! During a shutdown, connections

-      * can still have events in ns add io timeout job because of post connection

-      * or closing. The issue is that we can track the *jobs*, we only have the

-      * connection, and we can have 1:N connection:jobs. So we can lose IO jobs

-      * here. Thankfully, it's only for existing connections, and they are closed

-      * anyway, so it's just a mem / mutex leak.

-      *

-      * To fix it, involves the rewrite of connection handling, which will happen

-      * soon anyway, so please be patient while I undertake this!

-      *

-      * - wibrown December 2016.

-      */

  }

  

  /*

file modified
+147 -36
@@ -1,6 +1,7 @@ 

  /** BEGIN COPYRIGHT BLOCK

   * Copyright (C) 2001 Sun Microsystems, Inc. Used by permission.

   * Copyright (C) 2005 Red Hat, Inc.

+  * Copyright (C) 2019 William Brown <william@blackhats.net.au>

   * All rights reserved.

   *

   * License: GPL (version 3 or any later version).
@@ -11,7 +12,111 @@ 

  #include <config.h>

  #endif

  

- /* Connection Table */

+ /*

+  * Connection Table

+  *

+  * The connection table exists to serve a few purposes:

+  * 1. To prevent memory allocations of a (reasonably) large structures in a high-churn part

+  *    of the codebase. (Connection from slap.h).

+  * 2. To allow monitoring to iterate over the current set of active connections that exist

+  *    in the server.

+  * 3. To provide free connection slots to connections that have just been accepted.

+  *

+  * Requirement number 2 is the very interesting one, as it means that we need a central location

+  * to store the connections to allow metric gatherings. Requirement 1 means less today in 2019

+  * due to the improvements in malloc, and to aid tools like ASAN.

+  *

+  * == history ==

+  * The connection table previously used a simple method to allocate. The connection table can only

+  * be allocated by one acceptor at a time, which is protected by the connection table lock. The

+  * table is iterated over and each connection was locked to determine if it was in use. This iteration

+  * especially on high CT sizes could be very long, especially as:

+  *

+  * 1. The CT was always iterated from 0 -> size, meaning the "bottom end" of the table was always

+  *    likely to be full, causing delays.

+  * 2. The connection lock itself is what is used to protect that sockets IO, so if a connection was

+  *    writing at the time, the CT would delay waiting on the connection IO, only to find that the

+  *    connection was allocated anyway.

+  *

+  * Clearly this is an issue. In mid 2019 this behaviour was subtley altered, such that when we

+  * attempted to lock we used pthread try_lock (nspr locks did not have this capability). If the try

+  * lock fails this means a connection must be in use, so we can skip it. This yielded a 30%

+  * improvement in throughput.

+  *

+  * == current ==

+  * The new design is to create a parallel freelist of open slots into the table, which is protected

+  * under the connection table lock. This should move the allocation algorithm from O(n) worst case

+  * to O(1) worst case as we always recieve and empty slot *or* ct full. It also reduces lock/atomic

+  * contention on the CPU to improve things.

+  *

+  * The freelist is a ringbuffer of pointers to the connection table. On a small scale it looks like:

+  *

+  *  |--------------------------------------------|

+  *  | slot 1 | slot 2 | slot 3 | slot 4 | slot 5 |

+  *  | _ ptr  | _ ptr  | _ ptr  | _ ptr  | _ ptr  |

+  *  |--------------------------------------------|

+  *     ^  ^- conn_next

+  *     |

+  *     \-- conn_free

+  *

+  * As we allocate, we shift conn_next through the list, yielding the ptr that was stored (and

+  * setting it to NULL as we proceed)

+  *

+  *  |--------------------------------------------|

+  *  | slot 1 | slot 2 | slot 3 | slot 4 | slot 5 |

+  *  | _NULL  | _NULL  | _ ptr  | _ ptr  | _ ptr  |

+  *  |--------------------------------------------|

+  *     ^                  ^- conn_next

+  *     |

+  *     \-- conn_free

+  *

+  * When a connection is "freed" we return it to conn_free, which is then also slid up.

+  *

+  *  |--------------------------------------------|

+  *  | slot 1 | slot 2 | slot 3 | slot 4 | slot 5 |

+  *  | _ ptr  | _NULL  | _ ptr  | _ ptr  | _ ptr  |

+  *  |--------------------------------------------|

+  *              ^         ^- conn_next

+  *              |

+  *              \-- conn_free

+  *

+  * If all connections are exhausted, conn_next will == conn_next, as conn_next must have proceeded

+  * to the end of the ring, and then wrapped back allocating all previous until we meet with conn_free.

+  *

+  *  |--------------------------------------------|

+  *  | slot 1 | slot 2 | slot 3 | slot 4 | slot 5 |

+  *  | _NULL  | _NULL  | _NULL  | _NULL  | _NULL  |

+  *  |--------------------------------------------|

+  *              ^ ^- conn_next

+  *              |

+  *              \-- conn_free

+  *

+  * This means allocations of conns will keep failing until a connection is returned.

+  *

+  *  |--------------------------------------------|

+  *  | slot 1 | slot 2 | slot 3 | slot 4 | slot 5 |

+  *  | _NULL  | _ ptr  | _NULL  | _NULL  | _NULL  |

+  *  |--------------------------------------------|

+  *             ^- conn_next ^

+  *                          |

+  *                          \-- conn_free

+  *

+  * And now conn_next can begin to allocate again.

+  *

+  *

+  *  -- invariants

+  * * when conn_free is slid back to meet conn_next, there can be no situation where another

+  *   connection is returned, as none must allocated  -if they were allocated, conn_free would have

+  *   moved_along.

+  * * the ring buffer must be as large as conntable.

+  * * We do not check conn_next == conn_free (that's the starting state), but we check if the

+  *   slot at conn_next is NULL, which must imply that conn_free has nothing to return.

+  * * connection_table_move_connection_out_of_active_list is the only function able to return a

+  *   connection to the freelist, as it is the function that is called when the event system has

+  *   determined all IO's are complete, or unable to complete. This function is what prepares the

+  *   connection for re-use, which is why it's the only place the freelist can be appended to.

+  *

+  */

  

  #include "fe.h"

  
@@ -27,6 +132,11 @@ 

      ct->c = (Connection *)slapi_ch_calloc(1, table_size * sizeof(Connection));

      ct->fd = (struct POLL_STRUCT *)slapi_ch_calloc(1, table_size * sizeof(struct POLL_STRUCT));

      ct->table_mutex = PR_NewLock();

+     /* Allocate the freelist */

+     ct->c_freelist = (Connection **)slapi_ch_calloc(1, table_size * sizeof(Connection *));

+     /* NEVER use slot 0, this is a list pointer */

+     ct->conn_next_offset = 1;

+     ct->conn_free_offset = 1;

  

      /* We rely on the fact that we called calloc, which zeros the block, so we don't

       * init any structure element unless a zero value is troublesome later
@@ -56,15 +166,20 @@ 

           * careful with things like this ....

           */

          ct->c[i].c_state = CONN_STATE_FREE;

+         /* Prepare the connection into the freelist. */

+         ct->c_freelist[i] = &(ct->c[i]);

      }

+ 

      return ct;

  }

  

  void

  connection_table_free(Connection_Table *ct)

  {

-     int i;

-     for (i = 0; i < ct->size; i++) {

+     /* Release the freelist */

+     slapi_ch_free((void **)&ct->c_freelist);

+     /* Now release the connections. */

+     for (size_t i = 0; i < ct->size; i++) {

          /* Free the contents of the connection structure */

          Connection *c = &(ct->c[i]);

          connection_done(c);
@@ -78,8 +193,7 @@ 

  void

  connection_table_abandon_all_operations(Connection_Table *ct)

  {

-     int i;

-     for (i = 0; i < ct->size; i++) {

+     for (size_t i = 0; i < ct->size; i++) {

          if (ct->c[i].c_state != CONN_STATE_FREE) {

              pthread_mutex_lock(&(ct->c[i].c_mutex));

              connection_abandon_operations(&ct->c[i]);
@@ -114,40 +228,20 @@ 

  Connection *

  connection_table_get_connection(Connection_Table *ct, int sd)

  {

-     Connection *c = NULL;

-     size_t index = 0;

-     size_t count = 0;

- 

      PR_Lock(ct->table_mutex);

  

-     /*

-      * We attempt to loop over the ct twice, because connection_is_free uses trylock

-      * and some resources *might* free in the time needed to loop around.

-      */

-     size_t ct_max_loops = ct->size * 2;

- 

-     /*

-      * This uses sd as entropy to randomly start inside the ct rather than

-      * always head-loading the list. Not my idea, but it's probably okay ...

-      */

-     index = sd % ct->size;

- 

-     for (count = 0; count < ct_max_loops; count++, index = (index + 1) % ct->size) {

-         /* Do not use slot 0, slot 0 is head of the list of active connections */

-         if (index == 0) {

-             continue;

-         } else if (ct->c[index].c_state == CONN_STATE_FREE) {

-             break;

-         } else if (connection_is_free(&(ct->c[index]), 1 /*use lock */)) {

-             /* Connection must be allocated, check if it's okay */

-             break;

+     PR_ASSERT(ct->conn_next_offset != 0);

+     Connection *c = ct->c_freelist[ct->conn_next_offset];

+     if (c != NULL) {

+         /* We allocated it, so now NULL the slot and move forward. */

+         ct->c_freelist[ct->conn_next_offset] = NULL;

+         /* Handle overflow. */

+         ct->conn_next_offset = (ct->conn_next_offset + 1) % ct->size;

+         if (ct->conn_next_offset == 0) {

+             /* Never use slot 0 */

+             ct->conn_next_offset += 1;

          }

-     }

- 

-     /* If count exceeds max loops, we didn't find something into index. */

-     if (count < ct_max_loops) {

-         /* Found an available Connection */

-         c = &(ct->c[index]);

+         /* Now prep the slot for usage. */

          PR_ASSERT(c->c_next == NULL);

          PR_ASSERT(c->c_prev == NULL);

          PR_ASSERT(c->c_extension == NULL);
@@ -183,6 +277,7 @@ 

          slapi_log_err(SLAPI_LOG_CONNS, "connection_table_get_connection", "Max open connections reached\n");

      }

  

+     /* We could move this to before the c alloc as there is no point to remain here. */

      PR_Unlock(ct->table_mutex);

  

      return c;
@@ -257,6 +352,10 @@ 

   * of connections. This function removes a connection (by index) from that

   * list. This list is used to find the connections that should be used in the

   * poll call.

+  *

+  * We only remove from the active list when the connection is ready to be reused,

+  * in other words, this is the "connection free" function. This is where we readd

+  * the connection slot to the freelist for re-use.

   */

  int

  connection_table_move_connection_out_of_active_list(Connection_Table *ct, Connection *c)
@@ -315,8 +414,20 @@ 

  

      connection_release_nolock(c);

  

+     /* Clean the pointer. */

      connection_cleanup(c);

  

+     /* Finally, place the connection back into the freelist for use */

+     PR_ASSERT(c->c_refcnt == 0);

+     PR_ASSERT(ct->conn_free_offset != 0);

+     PR_ASSERT(ct->c_freelist[ct->conn_free_offset] == NULL);

+     ct->c_freelist[ct->conn_free_offset] = c;

+     ct->conn_free_offset = (ct->conn_free_offset + 1) % ct->size;

+     if (ct->conn_free_offset == 0) {

+         /* Never use slot 0 */

+         ct->conn_free_offset += 1;

+     }

+ 

      PR_Unlock(ct->table_mutex);

  

      slapi_log_err(SLAPI_LOG_CONNS, "connection_table_move_connection_out_of_active_list",

file modified
+4
@@ -82,6 +82,10 @@ 

      int size;

      /* An array of connections, file descriptors, and a mapping between them. */

      Connection *c;

+     /* An array of free connections awaiting allocation. */;

+     Connection **c_freelist;

+     size_t conn_next_offset;

+     size_t conn_free_offset;

      struct POLL_STRUCT *fd;

      int n_tcps; /* standard socket start index in fd */

      int n_tcpe; /* standard socket last ( +1 ) index in fd */

Bug Description: The connection table previously to find an available
slot would iterate over the table attempting to find a free connection.
Under high congestion this yields poor performance as we may need to walk
O(n) slots to find the "one free", and the algorithm allowed the table to
be walked twice, making it potentially a O(2n) worst case. To make this
worse, the walking attempted to "trylock" - better than before (which
really locked!), but the trylock still issues atomics that are costly.

Fix Description: Implement a freelist - at start up all connections are
free, and as they are allocated they are removed from the list. As they
are disconnected they are re-added. This makes the lookup of a connection
O(1), removes spurious atomic and locking behaviour, and helps to minimise
time under the conntable lock. In some test cases this is shown to
improve server throughput by at minimum 6%

https://pagure.io/389-ds-base/issue/50786

Author: William Brown william@blackhats.net.au

Review by: ???

--

Tests were performed between two machines. The target which was running ns-slapd is a 4 core virtual machine with 4GB of ram on 40GB of nvme. The source of load is a 2 core virtual machine with 2 GB of ram on nvme. They are connected via a single virtual switch capable of 10GBit of throughput with no routers between. The server has threadnumber 4 to match cores, and a conntable size of 8192. Both machines are adjusted to support a ulimit of 999999 open file descriptors.

/opt/dirsrv/bin/ldclt -h lt2.dev.blackhats.net.au -p 389 -n 128 -N 10 -f '(uid=user_XXXXXXX)' -e bindeach -e esearch,random -r1 -R6000 -I 32 -e randomattrlist=cn:uid:ou

Original

ldclt[19500]: Average rate:  579.18/thr  (7413.50/sec), total:  74135
ldclt[19500]: Average rate:  548.67/thr  (7023.00/sec), total:  70230
ldclt[19500]: Average rate:  573.40/thr  (7339.50/sec), total:  73395
ldclt[19500]: Average rate:  578.99/thr  (7411.10/sec), total:  74111
ldclt[19500]: Average rate:  576.16/thr  (7374.80/sec), total:  73748
ldclt[19500]: Average rate:  541.27/thr  (6928.30/sec), total:  69283
ldclt[19500]: Average rate:  601.46/thr  (7698.70/sec), total:  76987
ldclt[19500]: Average rate:  602.59/thr  (7713.10/sec), total:  77131
ldclt[19500]: Average rate:  566.91/thr  (7256.50/sec), total:  72565
ldclt[19500]: Average rate:  599.62/thr  (7675.10/sec), total:  76751
ldclt[19500]: Number of samples achieved. Bye-bye...
ldclt[19500]: All threads are dead - exit.
ldclt[19500]: Global average rate: 5768.25/thr  (7383.36/sec), total: 738336

Freelist

ldclt[19653]: Average rate:  617.07/thr  (7898.50/sec), total:  78985
ldclt[19653]: Average rate:  609.56/thr  (7802.40/sec), total:  78024
ldclt[19653]: Average rate:  653.90/thr  (8369.90/sec), total:  83699
ldclt[19653]: Average rate:  547.12/thr  (7003.20/sec), total:  70032
ldclt[19653]: Average rate:  648.23/thr  (8297.30/sec), total:  82973
ldclt[19653]: Average rate:  602.27/thr  (7709.10/sec), total:  77091
ldclt[19653]: Average rate:  636.16/thr  (8142.90/sec), total:  81429
ldclt[19653]: Average rate:  639.76/thr  (8188.90/sec), total:  81889
ldclt[19653]: Average rate:  623.45/thr  (7980.20/sec), total:  79802
ldclt[19653]: Average rate:  546.29/thr  (6992.50/sec), total:  69925
ldclt[19653]: Number of samples achieved. Bye-bye...
ldclt[19653]: All threads are dead - exit.
ldclt[19653]: Global average rate: 6123.82/thr  (7838.49/sec), total: 783849

This shows a 6% improvement in the connection throughput.

I may perform some more testing before we merge this.

Silly me, left rust comments here. I'll clean that up.

rebased onto f4d8386d2744a5d2696f0af6d5a1ce33870a72ba

4 years ago

rebased onto 592e03e5dfdcdb3da6cdca0dc5a227918bc1ba42

4 years ago

I won't be able to complete any more testing til early next year due to travel (but I'll still be working).

note to future william: test with various server thread and client cthread combinations.

Okay, I'm going to work on this again shortly, but still happy for people to review this change @tbordaz and @mreynolds especially :)

rebased onto e336b26865e4c11aaa0caf30cd4020269b1af5eb

4 years ago

Rebased to latest master, @tbordaz can you have a look at this? I have some more testing to do about the perf, but I'd appreciate your code review anyway.

Okay, so @tbordaz and @mreynolds I did another test, where I set the conntable to only have 8192 slots, then I used a remote machine with 128 threads to ldclt against it. The difference was:

w_ freelist
ldclt[31972]: Global average rate: 1889.02/thr  (2417.95/sec), total: 241795

w_out freelist
ldclt[32172]: Global average rate: 1667.45/thr  (2134.34/sec), total: 213434

So if that's correct, it's actually about a 12% improvement under highload :)

Anyway, at this point I'm pretty happy with that, so I'll ask you to do a review and decision on merging now!

The approach looks good to me, but I have a few questions. The change gets rid of eg checking "connection_is_free(..,use_lock,..)" which might be ok since we take it from the free list.
But the check in connection_is_free() also takes into account the refcount, if one op is done the conenction might not yet be free.

What I would like to see is these two types of test:
- clients with asynchronous operations, where multiple ops on the same connection can overlap
- more client connections attempting to connect than slots in the connection table, to see if exhausting and recovering of the conn table or free list works

@lkrispen As always, these are good points.

Yes, we no longer need to use the connection_is_free, because the only method for a connection to be in the freelist, is for it to have passed through connection_table_move_connection_out_of_active_list(), which is the final step in making the connection free for re-use. Specifically I account for this checking on line 413 that refcnt == 0 with PR_ASSERT for our development builds to assert we are in the correct state.

A different framing is that for connection_is_free() to assert to true, the connection must have been through connection_table_move_connection_out_of_active_list(), so we no longer need to walk the table and always check is_free, because of the freelist.

I can also do some more testing with these extra scenarios - here's hoping I don't melt my home development server with all these threads :)

Okay, so let's look first at "more connections that slots". This is really showing the ability for the connection table to recycle short lived connections (ie bind, single search) as fast as possible, and really, what this patch really aims to improve.

To do this I used ldclt with it's max thread parameter (1000) against instances with a conntable max of 800.

w_out freelist
Global average rate:  867.41/thr  (8674.12/sec), total: 867412

w_ freelist
Global average rate: 1137.41/thr  (11374.13/sec), total: 1137413

Additionally, 139 of the 1000 threads on the w_out freelist version were "unable to connect at all" causing those ldclt threads to prematurely terminate. This is a great sign in being able to handle much higher incoming load effectively.

Repeating the test with 512 connection table slots, and 1000 threads the following occured.

w_out freelist
Global average rate:  875.00/thr  (8749.99/sec), total: 874999

w_ freelist
Global average rate: 1123.85/thr  (11238.51/sec), total: 1123851

Again, 155 threads failed early, but the freelist was able to cycle the connections so quickly that it did not drop any connections.

To really stress this I tried with 128 conntable slots from 1000 threads:

w_out freelist
Global average rate:  855.68/thr  (8556.81/sec), total: 855681

w_ freelist
Global average rate:  850.29/thr  (8502.91/sec), total: 850291

Without the freelist, 546 connections were dropped, with the freelist only 103 were dropped.

This shows that under high pressure, even a ratio of 7 clients:1 conntable slot, the freelisted connection table is able to allocate connections at such a high rate that many more clients can be served in extreme load conditions compared to the current strategy. This ability to sustain under load is really critical and what will make a massive difference. Compared, the current strategy could only maintain about 3:1.

For the async operation test, I set ldclt to have -a 2 for two async searches at a time, with a connection table of 512 and 512 client threads.

/opt/dirsrv/bin/ldclt -h lt2.dev.blackhats.net.au -p 389 -a 2 -n 512 -N 10 -f '(uid=user_XXXXXXX)' -e bindeach -e esearch,random -r1 -R6000 -I 32 -e randomattrlist=cn:uid:ou

w_out freelist
Global average rate:    4.00/thr  ( 20.48/sec), total:   2048

w_ freelist
Global average rate:    4.00/thr  ( 20.48/sec), total:   2048

This was really weird to see as a result, and maybe there is something about it I do not understand - but it does at least show with async (longer lived connections) we are "no worse" than before. If you have any insights on how to test this more effectively @lkrispen I'd be keen to hear it.

Hope that helps,

Sorry for the late review. The idea, implementation and benchmark are very nice !
I have two comments:

  • regarding drop of connection_is_free. At the moment if a connection being "refcnt=0 and sd=invalid_socket and !closing" is not connection_table_move_connection_out_of_active_list, it is however recycle as a free connection. With the patch, the connection will "leak". Is this case impossible ?, should we create a cleaning thread that recycle those connections ?

  • the way we go through the CT in connection_table_as_entry (cn=monitor) could be change taking benefit of the the new freelist

Sorry for the late review. The idea, implementation and benchmark are very nice !

Thanks for the review :)

I have two comments:

regarding drop of connection_is_free. At the moment if a connection being "refcnt=0 and sd=invalid_socket and !closing" is not connection_table_move_connection_out_of_active_list, it is however recycle as a free connection. With the patch, the connection will "leak". Is this case impossible ?, should we create a cleaning thread that recycle those connections ?

No? When we move out of active list, that's when a connection is "ready to be reused", and that's how it worked before - the conditions we looked for in move out of active, were how we allocated on the table walk.

So now instead, we just use connection move out of active to define the connection is ready.

So I do not think a leak is possible here,

I think more likely, the best procees is actually to clean this up, because it's really confusing how it's design and unclear what states connections are in, that would probably help us in code reviews and modifications.

the way we go through the CT in connection_table_as_entry (cn=monitor) could be change taking benefit of the the new freelist

No, because cn=monitor uses the linked list that exists between connections already, so it isn't doing a conntable scan.

@firstyear, I agree that connection design is a bit fuzzy this is why I had concern regarding potential leak. Would you please add a comment that connection_table_move_connection_out_of_active_list is the only way to add back a connection on the freelist.

For cn=monitor, we go through the complete connection table, and if the connection is not free dump the connection info. So if we have a large connection table but few active connection the loop is a waste of time. Couldn't we go through thre freelist[conn_free..conn_next[ ?

Yep, I'll add that comment.

For cn=monitor, there is a linked list of active connections that exists between all the conn structs. It's already walked, so there is no benefit to using the freelist - there is an active list already.

https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/conntable.c#_331

There is also an iterator for it:

https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/conntable.c#_206

But yes, reviewing: https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/conntable.c#_404

I think you are right, this isn't used in cn=monitor.

Saying thatt the freelist doesn't help - there is an active list we should use instead. But I also think cn=monitor here is going to slow us down no matter what as we take the ct lock.

rebased onto e53b064924000303f1d1981f0bdd2d9fe7a8bce9

4 years ago

@firstyear, do you mind to add the same comment in the description itself of connection_table_move_connection_out_of_active_list. The patch looks good to me. ACK

Regarding the cn=monitor being not optimal I think we should open an other ticket as it is not related to the new freelist.

@tbordaz https://pagure.io/389-ds-base/issue/50906

Comment added, merging now. Thanks again for your detailed review as always!

rebased onto d98699a

4 years ago

Pull-Request has been merged by firstyear

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This pull request has been cloned to Github as issue and is available here:
- https://github.com/389ds/389-ds-base/issues/3843

If you want to continue to work on the PR, please navigate to the github issue,
download the patch from the attachments and file a new pull request.

Thank you for understanding. We apologize for all inconvenience.

Pull-Request has been closed by spichugi

3 years ago