Issue #51072: Improve autotuning defaults (hosts and container) - 389-ds-base

389-ds-base

#51072 Improve autotuning defaults (hosts and container)

Closed: wontfix 3 years ago by mreynolds. Opened 3 years ago by firstyear.

Issue Description

During a team discussion, we identified that there is room to improve our autotuning defaults. When these were introduced many years ago they were set conservatively based on the tuning guide, and memory issues with freeipa at the time.

However, we have learnt that the CPU autotuning is too aggresive, potentially decreasing throughput due to overhead in context switching and lock contention, and that our memory tuning is not aggressive enough, at only 10% of the system memory. Additionally, in containers, we are able to have access to different memory limits and reservations, so we can choose to be even more forward in our selection.

We should improve these defaults.

firstyear commented 3 years ago

https://pagure.io/389-ds-base/pull-request/51073

Metadata Update from @firstyear:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

3 years ago

Metadata Update from @mreynolds:
- Issue priority set to: major
- Issue set to the milestone: 1.4.2

3 years ago

firstyear commented 3 years ago

@mreynolds Did you want this backported to 1.4.2/1.4.3 as well?

mreynolds commented 3 years ago

@mreynolds Did you want this backported to 1.4.2/1.4.3 as well?

Yes please

firstyear commented 3 years ago

Will do shortly then, thanks mate!

firstyear commented 3 years ago

   08eed33d6..98fe0d678  389-ds-base-1.4.2 -> 389-ds-base-1.4.2
   66cd2a906..f2804282b  389-ds-base-1.4.3 -> 389-ds-base-1.4.3

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

tbordaz commented 3 years ago

@firstyear, I would like to reopen this ticket because I think would like you evaluate a minimal value for nsslapd-threadnumber.

A run (with this patch) on my laptop showed 8 hw-threads. The patch configure 8 workers that gives the best response time and good throughput (searchrate) but I think it is unrealistic. On a such server, if we have more than 8 long duration operations (unindexed search, MOD, gssapi bind,...) then basically the server will hang.

Here are the results I got (10 and 50 clients):

8 threads
    10: 22444.007        0.444
    50: 27917.825        1.789
16 threads
    10:  7477.535        1.335 (variing)
    50: 27147.838        1.840

24 threads:
    10:  8503.164        1.170
    50: 26634.191        1.876

32 threads:
    10:  8449.538        1.179
    50: 26132.673        1.911

Don't you think we should nsslapd-threadnumber to hw-thread but with a minimal value e.g. 24 ?

Metadata Update from @tbordaz:
- Issue status updated to: Open (was: Closed)

3 years ago

firstyear commented 3 years ago

@firstyear, I would like to reopen this ticket because I think would like you evaluate a minimal value for nsslapd-threadnumber.
A run (with this patch) on my laptop showed 8 hw-threads. The patch configure 8 workers that gives the best response time and good throughput (searchrate) but I think it is unrealistic. On a such server, if we have more than 8 long duration operations (unindexed search, MOD, gssapi bind,...) then basically the server will hang.
Here are the results I got (10 and 50 clients):
8 threads
10: 22444.007 0.444
50: 27917.825 1.789
16 threads
10: 7477.535 1.335 (variing)
50: 27147.838 1.840

24 threads:
10: 8503.164 1.170
50: 26634.191 1.876

32 threads:
10: 8449.538 1.179
50: 26132.673 1.911

Don't you think we should nsslapd-threadnumber to hw-thread but with a minimal value e.g. 24 ?

There was a minimum before. and it was 24. and it made things slower ... so there is something else going on here.

gssapi now has locks in it for safety that you added, but I think they could be adding contention that's causing stalls. In general, gssapi will always make things slower over using a proper encryption layer like TLS. At the time I even commented about this: https://pagure.io/389-ds-base/pull-request/49973#comment-66070

Unindexed searches will always be slow, and resource intensive, so I would expect bad performance there, and this is why mark added a patch to block unindexed searches (But I think it may not work 100% correctly, as most filters get transformed to include a !tombstone, which is indexed, and would trip up that check). That's why most filters show notes=U and not notes=A when the user component is 100% unindexed, so most people misinterpret that message. Plus our filter architecture doesn't let us have pre-knowledge of what is or isn't indexed because indexing is a backend config, not a schema level element.

We know our write performance is poor, and more threads won't change that reality - and in fact going to lmdb with only a single write thread, if we don't split write ops to a dedicated thread, we are going to see our performance only get worse because then we really will have every thread serialising over one write lock. And even with BDB, more threads only increases our changes of deadlocks and lock contention/deadlock breaking.

And then you get libglobs, which isn't transactional and has locks and atomics buried all the way through it, which will cause stalls on systems too.

In your suggestion, if we made it 24 threads instead of 8, we would just be letting 24 threads hang faster than 8. I think your result shows that we have significantly deeper and wider architectural flaws that have built up over time that need addressing through major changes and thought.

The only reason to have "more threads than hardware" is if there is blocking on IO or other tasks that are async relative to the primary CPU. And I understand our codebase, the kernel and others, I do not believe we are susceptible to this.

So I do not think this should be re-opened - if anything it shows that we need to put our minds elsewhere. We need to rethink our front end, how we route and dispatch operations to workers, how we perform locks, putting poorly behaved libraries like sasl into it's own thread, how we structure our config. We need our server to begin being transactional across all layers of the stack, rather than attempting concurrent safety-as-needed like we have today. The only way to get good parallel performance is to remove all contention on shared resources.

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

tbordaz commented 3 years ago

@firstyear, I would like to reopen this ticket because I think would like you evaluate a minimal value for nsslapd-threadnumber.
A run (with this patch) on my laptop showed 8 hw-threads. The patch configure 8 workers that gives the best response time and good throughput (searchrate) but I think it is unrealistic. On a such server, if we have more than 8 long duration operations (unindexed search, MOD, gssapi bind,...) then basically the server will hang.
Here are the results I got (10 and 50 clients):
8 threads
10: 22444.007 0.444
50: 27917.825 1.789
16 threads
10: 7477.535 1.335 (variing)
50: 27147.838 1.840
24 threads:
10: 8503.164 1.170
50: 26634.191 1.876
32 threads:
10: 8449.538 1.179
50: 26132.673 1.911
Don't you think we should nsslapd-threadnumber to hw-thread but with a minimal value e.g. 24 ?

There was a minimum before. and it was 24. and it made things slower ... so there is something else going on here.

I agree 8 workers setting outperform 24 workers but I think it is not realistic to deliver, out of box, a directory that will become unresponsive if there is more than 8 long direct operations.

gssapi now has locks in it for safety that you added, but I think they could be adding contention that's causing stalls. In general, gssapi will always make things slower over using a proper encryption layer like TLS. At the time I even commented about this: https://pagure.io/389-ds-base/pull-request/49973#comment-66070

This lock only impacts outgoing bind (replica agreement, chaining...) and serialize their gssapi authentication. The run above contains incoming simple bind and keeps connection opened (no new bind).

Unindexed searches will always be slow, and resource intensive, so I would expect bad performance there, and this is why mark added a patch to block unindexed searches (But I think it may not work 100% correctly, as most filters get transformed to include a !tombstone, which is indexed, and would trip up that check). That's why most filters show notes=U and not notes=A when the user component is 100% unindexed, so most people misinterpret that message. Plus our filter architecture doesn't let us have pre-knowledge of what is or isn't indexed because indexing is a backend config, not a schema level element.
We know our write performance is poor, and more threads won't change that reality - and in fact going to lmdb with only a single write thread, if we don't split write ops to a dedicated thread, we are going to see our performance only get worse because then we really will have every thread serialising over one write lock. And even with BDB, more threads only increases our changes of deadlocks and lock contention/deadlock breaking.

Considering a single backend instance, our writes are already serialized by the backend lock. it is not clear to me if lmdb will impact (gain/loss) our write throughput or not. I think it can protect us from db-deadlock/retry.
I agree that our write performance is not that good, this is also why I think 8 workers is a risk.

And then you get libglobs, which isn't transactional and has locks and atomics buried all the way through it, which will cause stalls on systems too.
In your suggestion, if we made it 24 threads instead of 8, we would just be letting 24 threads hang faster than 8. I think your result shows that we have significantly deeper and wider architectural flaws that have built up over time that need addressing through major changes and thought.
The only reason to have "more threads than hardware" is if there is blocking on IO or other tasks that are async relative to the primary CPU. And I understand our codebase, the kernel and others, I do not believe we are susceptible to this.

My concern is not performance but responsiveness of the server. In short no having operation pending for too long in the work queue. Do you think we will get similar responsiveness with 8 than with 24 ?

So I do not think this should be re-opened - if anything it shows that we need to put our minds elsewhere. We need to rethink our front end, how we route and dispatch operations to workers, how we perform locks, putting poorly behaved libraries like sasl into it's own thread, how we structure our config. We need our server to begin being transactional across all layers of the stack, rather than attempting concurrent safety-as-needed like we have today. The only way to get good parallel performance is to remove all contention on shared resources.

mreynolds commented 3 years ago

I agree with Thierry on this. I think we might have prematurely pushed this change. Yes it's faster under perfect conditions (no expensive or unindexed searches), but it could make bad situations even worse, and we see these bad conditions all the time with customers (worker thread exhaustion).

We should do more testing with unindexed searches. So on a 4 core machine run the same ldclt test, but do 4 separate ldapsearches that do full unindexed search and see what happens. Then run the same test with 8 threads.

Another option that needs testing is creating worker threads on demand to handle temporary work load increases and then remove these threads once they are no longer needed? I was thinking we could just have a backup pool of threads (5 threads?) that we can grab from when needed and added to the work queue handling - that would save us all the worker thread initialization. But I am getting ahead of myself, first we need to do more testing where worker threads are being exhausted from expensive operations.

firstyear commented 3 years ago

I agree 8 workers setting outperform 24 workers but I think it is not realistic to deliver, out of box, a directory that will become unresponsive if there is more than 8 long direct operations.

And the same will happen with 24 long operations, except now the 24 operations will contend over more resources, causing all the operations to take longer, and lose you throughput.

Look at the numbers:

https://pagure.io/389-ds-base/pull-request/51073#comment-118899

The client with 1 thread vs 24 is within 5% of the performance. But at the top end with 10 clients, it's more than double. This is because over use of threads causes operations to all take longer, and locks are not a linear time consumption.

This lock only impacts outgoing bind (replica agreement, chaining...) and serialize their gssapi authentication. The run above contains incoming simple bind and keeps connection opened (no new bind).

And a lot of operations in freeipa are short, single operation, and all use gssapi, so they'll all serialise around it. It's basically why I switch off to when freeipa asks for more performance, because they aren't actually willing to listen about what needs to be done and what defects exist.

Considering a single backend instance, our writes are already serialized by the backend lock. it is not clear to me if lmdb will impact (gain/loss) our write throughput or not. I think it can protect us from db-deadlock/retry.
I agree that our write performance is not that good, this is also why I think 8 workers is a risk.

Because regardless if it's 8 or 24, we don't split write operations to their own threads, meaning we can have writers block readers. We are talking about thousands of clients vs 1/10th of that number of threads. Bumping that number from 8 to 24 isn't going to stop anything when you have hundreds of write operations. If anything you'll just make it worse, because again, you now have over use of CPU resources, so every op takes longer, and we will bog down worse.

The fix is to make writes happen in a seperate threads to reads. And we have to do this for lmdb anyway.

My concern is not performance but responsiveness of the server. In short no having operation pending for too long in the work queue. Do you think we will get similar responsiveness with 8 than with 24 ?

The numbers shown by vashirov show that with 8 threads, we can put through more than double the number of searches within a time frame. Which means each search is taking half as much time. Which means that responsiveness has been proven to improve in this change.

Remember, these tests are not just a test of "max number of operation throughput", but that throughput exists in a fixed time window and for an operation to complete, it must be taking less time to allow more operations to have proceeded.

So yes, 8 threads means we get better responsiveness, better throughput. And regardless of if we have 8 threads, 24 or even 100, over allocation of CPU resources will always hurt us, and our servers architecture today, especially with write and read ops on the same thread pool, with mutexes littered everywhere without care or thought to their thread interactions, without even having a proper transactional architecture, means that clients can trivially slow down the entire server regardless of how many threads we have.

We should do more testing with unindexed searches. So on a 4 core machine run the same ldclt test, but do 4 separate ldapsearches that do full unindexed search and see what happens. Then run the same test with 8 threads.

8 threads, 4 threads on unindexed? What will that prove? Those are hugely IO intensive operations, they shouldn't be occuring, and they shouldn't be allowed (you even made a patch to prevent unindexed ops). We should not optimise for a bad case. We should prevent the bad case from being able to occur. If anything, if they are still happened we need to re-examine the un-indexed operation patch because the internal filter transform to exclude tombstones adds an indexed component so the filter shows as partial unindexed and is allowed to proceed.

Another option that needs testing is creating worker threads on demand to handle temporary work load increases and then remove these threads once they are no longer needed? I was thinking we could just have a backup pool of threads (5 threads?) that we can grab from when needed and added to the work queue handling - that would save us all the worker thread initialization. But I am getting ahead of myself, first we need to do more testing where worker threads are being exhausted from expensive operations.

We absolutely shouldn't do this because again, we come back to over use of CPU resources as they content on mutexs and more, and creates a situation where every thread not only slows down, but slows down in a way that's not linear. 2 threads on 2 cpus may get you 100 ops, but 4 threads on 2 cpus may get you 80 and all those ops take longer to deliver. Plus the cost of managing a thread pool and dynamically creating and removing those threads is also a cost, and it's a complexity that detracts from the real problems. And the real problem is that we have extremely poor server management of threaded resources like mutexes because our architecture is not properly transactional.

No quick fix or magic bullets here, we need to roll up our sleeves and do some hard work to clean our concurrency story up, else we're never going to get out of this. We can't keep throwing mutexes everywhere, we have to make this properly concurrently readable, and we have to re-architect and rethink large parts of the code base, from plugins, to libglobs, to how we dispatch operations to threads, and more.

mreynolds commented 3 years ago

We should do more testing with unindexed searches. So on a 4 core machine run the same ldclt test, but do 4 separate ldapsearches that do full unindexed search and see what happens. Then run the same test with 8 threads.

8 threads, 4 threads on unindexed? What will that prove? Those are hugely IO intensive operations, they shouldn't be occuring, and they shouldn't be allowed (you even made a patch to prevent unindexed ops).

But this is the reality our customers constantly run into this. We can not force our customers to be perfect. I hear what you saying, but you saying "well they shouldn't be doing unindexed searches in the first place" is a ridiculous response. All I want is more testing of the scenarios our customers run into, as I'm worried this could cause more harm than good. I don't know how many DS customers suse has, but ours do things they are not supposed to. Happens all the time, and we at red hat have to support/fix this on a daily basis. If this change triggers more customer cases, then it is a bad "fix".

All I want is more testing before we release this change to the public. Not sure why doing more testing, so we are more confident in this change, is a bad thing. No one is saying this patch is definitely bad, but we need to test it more. We ran a few tests and saw great results, but it was not a real-world type of test. So I feel it's perfectly reasonable to test this more before I put it into any release (upstream or downstream).

firstyear commented 3 years ago

Then I look forward to see the tests from unindexed searches then to see how they look.

mreynolds commented 3 years ago

Then I look forward to see the tests from unindexed searches then to see how they look.

I just want to make sure that it's not easier to create a DOS when using less worker threads. It's not just about unindexed searches, but expensive searches, like searches that IPA, schema-compat, and slapi-nis do. This kind of reminds me how the cache autotuning currently works. We don't fully optimze the caches because we can't assume how the system resources are being used. You can say the same about the thread number. We can't fully optimize it because we don't know how the server is actually being used.

I wonder if it would make sense to have a "level of auto tuning", like conservative, normal, and aggressive. If "aggressive" we set the thread number to number of cores and set caches to use 75% of available memory. Something like that. Again I'm just throwing out ideas - no need to bash them yet. But first, let us do more testing on this patch this week, and then we'll all analyse the results and go from there...

mreynolds commented 3 years ago

Okay got some numbers that would indicate it's better to have some extra threads. So I ran this test on a 4 CPU machine. I ran searchrate and modrate using 4 client threads in all the tests.

4 Worker Threads

6700 - 8100 per second

4 clients and 4 clients issuing fully unindexed searches

20-26 searches/second (99% performance drop)

4 clients and 4 modrate threads

1000 - 2000 searches (75% performance drop)

8 Worker Threads

5000 - 7100 searches/second

4 clients and 4 clients issuing fully unindexed searches

4800 - 4900 searches/second (28% performance drop)

4 clients and 4 modrate threads

5000 - 7100 searches (no performance drop)

So I propose we set a minimum thread number of 16.

Edited 3 years ago by mreynolds

mreynolds commented 3 years ago

Set the default minimum worker threads to 16:

https://pagure.io/389-ds-base/pull-request/51151

Metadata Update from @mreynolds:
- Assignee reset
- Issue status updated to: Open (was: Closed)
- Issue tagged with: Performance

3 years ago

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

3 years ago

mreynolds commented 3 years ago

Commit dc7bf4a relates to this ticket

mreynolds commented 3 years ago

Set a minimum thread number:

Commit dc7bf4a relates to this ticket

a87fc40..796ca36 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

f5d5fd8..0b9d9e9 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/4125

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Metadata

Assignee

mreynolds

Tags

Blocking

None

Depending on

None

Priority

major

Milestone

1.4.2

reviewstatus

None

rhbz

None

origin

None

Related Pull Requests

#51151 Closed 3 years ago

389-ds-base

Source Code

#51072 Improve autotuning defaults (hosts and container) Closed: wontfix 3 years ago by mreynolds. Opened 3 years ago by firstyear.

Issue Description

4 Worker Threads

4 clients and 4 clients issuing fully unindexed searches

4 clients and 4 modrate threads

8 Worker Threads

4 clients and 4 clients issuing fully unindexed searches

4 clients and 4 modrate threads

Metadata

Performance

Related Pull Requests

#51072 Improve autotuning defaults (hosts and container)

Closed: wontfix 3 years ago by mreynolds. Opened 3 years ago by firstyear.