Issue #49330: Performance issues with RHDS 10 - NDN cache investigation. - 389-ds-base

https://pagure.io/389-ds-base/issue/49280 related issue

Metadata Update from @firstyear:
- Custom field type adjusted to defect

6 years ago

This is how I generated the DB for the test. Please note this WILL create a 2.8GB ldif in your directory where you call it.

Metadata Update from @firstyear:
- Issue priority set to: critical
- Issue set to the milestone: 1.3.5.14 (was: 0.0 NEEDS_TRIAGE)
- Issue tagged with: Complex, Investigate, Performance

6 years ago

Metadata Update from @firstyear:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to review
- Custom field version adjusted to None

6 years ago

On the workload from gen_db, during a load test this patch provides a 47x improvement to performance. Searches now have a consistent average time of 5 seconds rather that a wild distrubition from 0.001 to 88 seconds.

I will follow up with another investigation to discover the contention related to the "last 5 seconds" as a single thread search on this work load results in a search of 0.001 second.

Improved the performance of this patch.

Resolve compiler warns.

Great improvement and implementation, really few comments about it.

First a general concern. To be honest all feedback I heard from current NDN are quite positive with significant performance improvement when enabled/tuned. Even if distribution and lru was perfectible it was not reported by customers. So IMHO the main issue was the contention around the global lock.
The new design splits the NDN cache, each worker thread having its own ndn cache. The advantage is to fix this contention but on larger memory footprint (same raw dn/ndn can be cached in several caches) and unpredictable benefit as a request may not benefit from the amount of normalization done by a previous one. An other advantage is to be NUMA friendly but the server itself remains NUMA not friendly, so it is a long term advantage that is not a key element IMO.
I preferred keeping a single cache shared between all threads, splitting the big lock with a lock per hash slot. Now I have not a strong opinion on this as performance are really great with multiple NDN caches and I am not sure they would be better splitting the big lock.

More specific on the patch, it is really nice code congratulations. Few comments

in ndn_cache_lookup:
lookup is done with strncmp, why not strcmp.
For example if cached value is 'cn=A, cn=B, cn=C' --> 'cn=a,cn=b,cn=c'
lookup of 'cn=A, cn=B' is it going to return the cached value ?
in ndn_cache_add
in case of collision, the new_value is added at the end of the child list. Why ?
wouldn't it be faster to store it in first place ?
I would love some comments on head/tail struct definitions of the LRU explaining how it is working ;)

I think Thierry already raised some valid concerns. For me the idea to have a cache per thread seems a great idea to avoid the locking problems. The potential problems are that each cache is independent and it will take some time until it is filled. The other problem is that the memory requirements for large databases and many threads could become excessive, having an option of a shared cache (with a fine grained locking as suggested by Thierry) would be nice, but can be added later - so in general I vote for the suggested design.

I have not fully evaluated the code, i think Thierry's comment on strcmp are correct, but could not help to think of further optimizations.
Right now the cache lookup for a dn is all or nothing, but all dns contain common parts, so if
"cn=xx,ou=yy,ou=dd,dc=com1,dc=com2,dc=com3,dc=com4" is not found in the cache it will be normalized, although very likely "ou=yy,ou=dd,dc=com1,dc=com2,dc=com3,dc=com4" is in the cache or should be in the cache. we could make dn normalization kind of recursive, breaking at the state B4TYPE

in ndn_cache_lookup:
lookup is done with strncmp, why not strcmp.
For example if cached value is 'cn=A, cn=B, cn=C' --> 'cn=a,cn=b,cn=c'
lookup of 'cn=A, cn=B' is it going to return the cached value ?

Because we don't know the length of the incoming dn. It's "paranoia" to check with strncmp, Saying this, in theory we should be able to trust the dn as we already processed it partially. Is there a reason not to use strncmp here I think is a better question to you?

in ndn_cache_add
in case of collision, the new_value is added at the end of the child list. Why ?
wouldn't it be faster to store it in first place ?

I would love some comments on head/tail struct definitions of the LRU explaining how it is working ;)

These two questions go together I think. So let me explain the internal structure of how this works.

siphash13 provides a uint64_t hash of the incoming data. This means that in theory, the hash is anything between 0 -> 0xFFFFFFFFFFFFFFFF. The result is distributed evenly over the set of numbers, so we can assume that given some infinite set of inputs, we'll have even distribution of all values across 0 -> 0xFFFFFFFFFFFFFFF. In mathematics, given a set of integers, when you modulo them by a factor of the input, the result retains the same distribution properties. IE, if you modulo the numbers say 0 - 32 and they are evenly distributed, when do 0 -> 32 % 16, you will still see even distribution (but duplicates now).

So we use this property - we scale the number of slots by powers of 2 so that we can do hash % slots and retain even distribution. IE:

(0 -> 0xFFFFFFFFFFFFFFFF) % slots ....

This means we get even distribution: but it creates a scenario where we are more likely to cause a collision as a result.

Anyway, so lets imagine our small hashtable:

*t_cache
-------------------------
|  0  |  1  |  2  |  3  |
-------------------------

So any incoming DN will be put through:

(0 -> 0xFFFFFFFFFFFFFFFF) % 4 = slot

So lets say we add the values a, b, c (in that order)

*t_cache
head: A
tail: C
-------------------------
|  0  |  1  |  2  |  3  |
-------------------------
-------------------------
|  A  |  B  |  C  |     |
|n:   |n: A |n: B |     |
|p: B |p: C |p:   |     |
-------------------------

Because C was accessed (rather inserted) last, it's at the "tail". This is the most recently used element. Because A was inserted first, it's the least used. So lets do a look up on A:

*t_cache
head: B
tail: A
-------------------------
|  0  |  1  |  2  |  3  |
-------------------------
-------------------------
|  A  |  B  |  C  |     |
|n: C |n:   |n: B |     |
|p:   |p: C |p: A |     |
-------------------------

Because we did a lookup on A, we cut it out from the list and moved it to the tail. This has pushed B to the head.

Now lets say we need to do a removal to fit "D". We would remove elements from the head until we have space. Here we just need to trim B.

*t_cache
head: C
tail: D
-------------------------
|  0  |  1  |  2  |  3  |
-------------------------
-------------------------
|  A  |     |  C  |  D  |
|n: C |     |n: B |n: A |
|p: D |     |p: A |p:   |
-------------------------

So we have removed B, and inserted D to the tail. Again, because A was more recently read than C, C is now at the head. This process continues until the heat death of the universe, or the server stops, what ever comes first (we write very stable code here ;)

Now, lets discuss the "child".

Because of:

(0 -> 0xFFFFFFFFFFFFFFFF) % 4 = slot

The smaller the table, the more likely a hash collision is to occur (and in theory even at a full table size they could still occur ...).

So when you have a small table like this one I originally did not have the ability to have multiple values per bucket, and just let the evictions take place. I did some tests and found in this case, the LL behaviours were faster than repeated evictions.

So lets say in our table we add another value, and it conflicts with the hash of C:

*t_cache
head: C
tail: E
-------------------------
|  0  |  1  |  2  |  3  |
-------------------------
-------------------------
|  A  |     |  C  |  D  |
|n: C |     |n: B |n: A |
|p: D |     |p: A |p: E |
-------------------------
            |  E  |
            |n: D |
            |p:   |
            -------

Now when we do a look up of "E" we'll collide on bucket 2, and then descend down til we exhaust, or find our element. If we were to remove C, we would just promote E to be the head of that slot.

Again, I did test both with and without this - with was much faster, and relies on how even our hash distribution is and that generally with small table sizes we have small capacity, so we evict some values and keep these chains short.

Right now the cache lookup for a dn is all or nothing, but all dns contain common parts, so if
"cn=xx,ou=yy,ou=dd,dc=com1,dc=com2,dc=com3,dc=com4" is not found in the cache it will be
normalized, although very likely "ou=yy,ou=dd,dc=com1,dc=com2,dc=com3,dc=com4" is in the
cache or should be in the cache. we could make dn normalization kind of recursive, breaking at
the state B4TYPE

There are two things about this comment. First, I totally agree. You want a prefix/radix tree rather than a hash - and I agree. It happens that I'm writing one in my free time at the moment but it's not ready yet. We needed a solution today, and this simple hash is it. In the future I would love to make this a prefix tree, because even though we search based on the str content, we already have to read the whole string (and hash it) so the lookup down a prefix path may be faster. Wait and see what my benchmarks show soon. Benefits of this are no collisions, no hash, expands dynamically (so online resize is possible), and we can do partial matches like you suggest here.

Second thing though, is that similar to idlist - while I would love to spend the next week finishing the perfect tree for this, we have a user who needs this today. So I think we can go ahead with what we have now, and in the future I promise that I'll revisit this. After all, I'm a datastructures nerd, so I'm sure I'll remember to come back here and perfect it :)

Hope that helps,

PS: I might transplant this comment into the dn.c file to explain it.

I think Thierry already raised some valid concerns. For me the idea to have a cache per thread seems a great idea to avoid the locking problems. The potential problems are that each cache is independent and it will take some time until it is filled. The other problem is that the memory requirements for large databases and many threads could become excessive, having an option of a shared cache (with a fine grained locking as suggested by Thierry) would be nice, but can be added later - so in general I vote for the suggested design.

I missed answering these points:

The time to fill - each thread is doing separate operations, so really, they only need the NDN's local to them-self - You may have an op where thread A and B both normalise the same DN, but the price of the normalisation is lower than the cost of the mutex, so even though they both have to do the work, they still end up overall faster - This is why we gained 10% on single thread performance despite this. We also tend to see DS run for a long time, rather that frequent restarts, so this initial cost is low in contrast to the whole system.

The memory requirements. In theory, yes, this could get large. But there are something things that help here - it's not all doom and gloom.

First, we generally use an NDN a few times within an operation, so so long as the NDN is cached for at least a few operations before eviction, we still come out ahead. Contrast this to NDN being disabled and normalising each access. tl;dr, we need enough ndn cache space to fit all the ndns that are in a single operation, not all the ndns for the whole server :)

Second, the memory requirements are not as large as one might think. Even to cache 32768 NDN's at 168 bytes per NDN (which is a huge NDN, normally they are actually in the region of 40ish chars long so 100 bytes with all the hash table overheads per NDN). Even still, this only requires 5MB per thread. Even on my laptop, the concept of giving up 120MB (5MB, 24 threads) for this, is not a huge cost. If your dataset is in the millions you are looking at about 160ish MB per thread, but at that size, I think you have put a lot of ram in your machines.

Finally, tie that back to our first point, and even if you set the NDN cache to be "smaller", you'll still come out ahead because of the lack of locking, and because of the operations re-use.

IF we were to come back to this design to make it shared again I would propose a two level cache. We have a small per-thread cache, and we have the shared cache with multiple lock regions as suggested. This is very similar to a CPU with the L1/L2 design, and I think if we were to go down this path, would be the best course of action. However, this would add complexity in that we would need to copy the NDNs from L2 into L1 for safety. We would also have a higher cost on a cache miss - because we go back to locking rather that just re-normalising. I think the benefit to this model is a memory saving at the expense that a cache miss has a higher cost. This per-thread design uses more memory to remove the cost of locking during a miss.

    in ndn_cache_lookup:
    lookup is done with strncmp, why not strcmp.
    For example if cached value is 'cn=A, cn=B, cn=C' --> 'cn=a,cn=b,cn=c'
    lookup of 'cn=A, cn=B' is it going to return the cached value ?

 Because we don't know the length of the incoming dn. It's "paranoia" to check with strncmp,     
 Saying  this, in theory we should be able to trust the dn as we already processed it partially. Is
 there a reason not to use strncmp here I think is a better question to you?

strcmp will stop at the size of the smaller string, if you pass the len of on string to strncmp you don't gain anything, but could get incorrect results since you want to compare if both strings ar equal not if one is a substr of the other.
In Thierry's example, if you want to compare "cn=A, cn=B" to "cn=A, cn=B, cn=C" and pass the length of ( "cn=A, cn=B") to strncmp the strings are equal although they are not.
Now, this situation is very unlikely because of the dns themselves and because both strings would have to hash to the same slot, but it is not impssible and I don't see a need to use strncmp

Edited 6 years ago by lkrispen

But they can hash to the same thing: See the reason we have the child buckets. Once we look up with the hash we need to check that we really do have the right element from the slot.

Strncmp also stops on the length of the smaller string too, but it just won't exceed the length of the dn we have been provided.

Saying this, I think here given we have a dn len, we could use strcmp here - it was more "good behaviour" than anything.

If you want, I can change that to strcmp, but we doo need to keep that check :)

Strncmp also stops on the length of the smaller string too, but it just won't exceed the length of the dn we have been provided.

but this is the problem. If you stop at the kength provided and the strings match so far you get a false positive

Ummm ... really? I'm pretty sure that if you hit size_t n, that you get a non-zero return, because they aren't fully matched.

It looks like this test shows an interesting result:

#include <string.h>
#include <stdio.h>

int
main(int argc, char **argv) {
    char *a = "test a";
    char *b = "test a super";
    printf("%d", strcmp(a, b));
    return 0;
}

Prints 0.

For strncmp:

#include <string.h>
#include <stdio.h>

int
main(int argc, char **argv) {
    char *a = "test a";
    char *b = "test b";
    printf("%d", strbcmp(a, b, 4));
    return 0;
}

This also yields 0.

So actually, we need to check LENGTH and CONTENT to be sure!

I'll update the patch to take account of this.

PS: Thanks C, you are a garbage fire language ;)

Regarding add new_value at the end of the child list
First thanks for your very detailed explanations. It correspond to my understanding but I still miss your conclusion ;)

When a collision occurs, the fix adds the new_value at the end of the collision list. Justification is that we may need remove previous entries and promote the new added one. But in the code, when we add the new_value, we are sure we do not need to remove previous entries because it was already done few lines above (call of ndn_thread_cache_value_destroy). My understanding is that LRU (next/prev) is ordered list but child is just a set of collision (unordered list) so adding it at the beginning is equivalent to add it at the end. Am I missing something ?

static void
ndn_cache_add(char *dn, size_t dn_len, char *ndn, size_t ndn_len)
{
....
    /*
     * Okay, check if we have space, else we need to trim nodes from
     * the LRU
     */
    while (t_cache->head && (t_cache->size + new_value->size) > t_cache->max_size) {
        struct ndn_cache_value *trim_node = t_cache->head;
        ndn_thread_cache_value_destroy(t_cache, trim_node);
    }

    /*
     * Add it!
     */
    if (t_cache->table[insert_slot] == NULL) {
        t_cache->table[insert_slot] = new_value;
    } else {
        /*
         * Hash collision! We need to replace the bucket then ....
         */
        struct ndn_cache_value *parent_node = t_cache->table[insert_slot];
        while (parent_node->child != NULL) {
            parent_node = parent_node->child;
        }
        parent_node->child = new_value;
    }
....

Actuall I'm wrong. I didn't recompile inside my tests. strcmp DOES work correctly. I'll change the patch to use this then, because both @tbordaz and @lkrispen are right here that it affects the outcome.

@tbordaz Ithink you get it. When we have LRU, you have an ordered list like:

A -> B -> C -> D ....

But in a bucket the results are unordered so you may have to check up to O(N) nodes within a slot.

The is the exact same behaviour as the NSPR hashmap though, but we don't suffer the same lack-of-slots issue as they did. :)

I think that to avoid this we would need a prefix tree, but I haven't finished it yet, so this improvement will have to do for now. I'll revisit this in the future.

You may have an op where thread A and B both normalise the same DN, but the price of the normalisation is lower than the cost of the mutex

I tend to disagree on this. Normalization requires a big syntax lock that can be contention issue. In addition the cost of Normalization is significant, callback to many syntax plugins for each DN components. This is the reason of my concern regarding per thread cache because same DN can be normalized several times although it is present in some caches.

so long as the NDN is cached for at least a few operations before eviction, we still come out ahead.

Unless I misunderstand your sentence, this is a drawback of per thread cache: when running few operations, they will not benefit from the normalization done by the preceding operations unless they are processed by the same worker thread.

Regarding the L1/L2 cache, in a first approach it looks quite complex to me with a risk how to enforce coherency between them.

This being said, the tests you ran are showing very great performance gain. There is no evidence that a shared cache (with per slot lock) would give equivalent or better performance. So I would prefer to keep your implementation that is proved to be efficient and scale.

I do not know how you did your test, but I get different results (in your ex you use strbcmp which doesn't exist):

more cmp.c

  #include <string.h> 
  #include <stdio.h>

int
main(int argc, char **argv) {
char *a = "test a";
char *b = "test a super";
printf("strcmp: %d\n", strcmp(a, b));
printf("strncmp: %d\n", strncmp(a, b, strlen(a)));
return 0;
}

gives:
strcmp: -32
strncmp: 0

and that is what I expected

Edited 6 years ago by lkrispen

But in a bucket the results are unordered so you may have to check up to O(N) nodes within a slot.

I do not get it. When ndn_cache_add is called we know that NDN is not present in that slot. I addition the iteration is not checking the DN, it loops so that it finds the last node to add the new value.

    struct ndn_cache_value *parent_node = t_cache->table[insert_slot];
    while (parent_node->child != NULL) {
        parent_node = parent_node->child;
    }
    parent_node->child = new_value;

@lkrispen I forgot to recompile ;_; when I fixed it, I got your result. I'll change this to strcmp.

@tbordaz Ahhhh I see what you are saying. You are suggesting to put it at the head of the slot rather than the tail. I can do this. You are correct this would be quicker.

Corrected patch with both comments :)

The patch looks good, you have my ACK , likely @lkrispen will also comment it.

Just a minor point.
You may want to update your large comment (starting with "Hashing function using Bernstein...") so that it reflects that when there is a collision while adding a new_value, the new value is added at the top of the child list.

Comment updated, will standby for @lkrispen 's last comments or ack :)

there is one comment: in the update of the lru you want to put the node you found to the tail, but you do not care about head. If the node is the head you move it to the tail and then head==tail

Which part of the patch is this @lkrispen ? If you access the node you move to the tail. we only ever directly manipulate head if head == NULL, which means that either an error occured, you have have tail == NULL as well because the set is empty.

We only ever need to update tail, because all manipulations of head are from destroying evicted nodes - note

 473 +static void
 474 +ndn_thread_cache_value_destroy(struct ndn_cache *t_cache, struct ndn_cache_value *v) {
 475 +    /* Update stats */
 476 +    t_cache->size = t_cache->size - v->size;
 477 +    t_cache->count--;
 478 +    t_cache->evicts++;
 479 +
 480 +    if (v == t_cache->head) {
 481 +        t_cache->head = v->prev;
 482 +    }
 483 +    if (v == t_cache->tail) {
 484 +        t_cache->tail = v->next;
 485 +    }

It's on eviction that we update the head value, we don't need to manipulate it at any other time.

I am referring to the part following:

 758 +  if (strcmp(dn, node->dn) == 0) {
 759 +                /* We found it! */
 760 +                /* Update LRU */

you handle tail, but not head

Ahh indeed. We update the other pointers correctly, but if we are head, we don't update it. Great spoting!

Fix for the head update that @lkrispen noticed.

I've applied these successive changes to the 1.3.5 and 1.3.6 patches btw. I just haven't submitted them for review because they are effectively the same

what about the case where we only have one node, then head == tail == node. but then you set head to node->prev (which is NULL)

Indeed, this would mean that this node could be exempted from checks. One day I'll learn to implement algorithms ;)

Here is a proposal to change the lru update

Apart from managing the LRU I the patch looks good

Thanks @lkrispen : I was on an airplane yesterday so didn't write this. I'll include your patch and give this a test. Thanks heaps for your careful review, I really appreciate it.

This is a tweak on your suggestion @lkrispen

This is a tweak on your suggestion @lkrispen

but you made it unclear again, at least for me, I first thought it is wrong. If the node == head,
we already have set head to node->prev and head->next = NULL and since the node is/was head we know node->next == NULL.

so the assignement
node->prev->next = node->next;
is alrady done for the "is head case" and only required for th ein the middle case

This is the code you refer to:

                    if (t_cache->head == node) {
                        /* We are the head, update head to our predecessor */
                        t_cache->head = node->prev;
                        /* Remember, the head has no next. */
                        t_cache->head->next = NULL;
                    } else {
                        /* Right, we aren't the head, so we have a next node. */
                        node->next->prev = node->prev;
                    }

Because we change head to node->prev, when we then call head->next = NULL, it's equivalent to node->prev->next = NULL.

It's in the non-head case that we need to change our next nodes previous to our previous. IE:

A -> B -> C

This would change C->prev to A (since B->prev == A). This is only done in the middle case.

I think the code here is correct still

commit 2748238
To ssh://git@pagure.io/389-ds-base.git
2196614..708938e master -> master

commit 99d39c4
To ssh://git@pagure.io/389-ds-base.git
76feb34..99d39c4 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

6 years ago