#49021 Autotune threads at startup
Closed: wontfix None Opened 7 years ago by firstyear.

We hardcode a default threadnumber of 30. This is a pretty conservative default, but it's not great.

Modern hardware is much larger than this, and people are not tuning their systems to match (see also FreeIPA). We should be doing this automatically.

We should make two changes.

First, is that we should change the default thread count from 30 to 0 (-1).

When 0 or -1 is set, we should take:

  • number of hardware threads * factor

As the thread count. That factor may change, and should be tested. We have previously advised factor == 2, but with modern CPU's it may be better to put this to 3 or 4.

This would be a pretty simple change code wise, but it would take some time and testing on a few different hardware platforms to get it right.


Sorry about my poor knowledge about the system...

number of hardware threads = number of core?

If 30 is conservative, what would be the most appropriate number, approximately?

Thanks.

So, on a modern core i7 xeon you will have say 12 cores, but 24 threads. We would be using number of threads multiplied by a factor as default.

So assuming factor of 4,

My laptop: dual core i7, so 4 threads * 4 = 16 DS threads
My dl260: dual socket dual core xeon, so 2 sockets, each socket has 2 cores with 4 threads, IE 8 threads total. 8 * 4 = 32 DS threads on my dl360
Old workplace DS servers were dual socket 12 core xeons, so 24 threads per socket, 48 threads total = 192 DS threads.

Does that help? We would be changing the number in response to hardware. As we move forwards on Nunc Stans this will have a benefit to simplify thread management too.

Then,
number of hardware threads = number of core * 2
or
number of hardware threads = number of socket * number of core * 2?

Could you tell me how to get the value in the program?

On your own system, you can run 'lscpu' to see.

In C, we would use sysconf(_SC_NPROCESSORS_ONLN); to determine the online number of cores. We would then configure slapiFrontendConfig->maxthreads from that * factor.

Code wise, the change will be small, but it would be good to autotune out of box, and good to make this the default. It would take some time to determine the best number of threads to use by default however, (2, 3, 4, 5 ,... per hardware thread)

Replying to [comment:4 firstyear]:

In C, we would use sysconf(_SC_NPROCESSORS_ONLN); to determine the online number of cores. We would then configure slapiFrontendConfig->maxthreads from that * factor.

Thanks!

Code wise, the change will be small, but it would be good to autotune out of box, and good to make this the default. It would take some time to determine the best number of threads to use by default however, (2, 3, 4, 5 ,... per hardware thread)

It makes sense. Sounds good to me. Probably, we should set the minimum thread number, though? Also, what happens on a VM? When you configure it, you are asked the count of CPU. Is the count becomes the number of hardware threads? And it's retrieved in C in the same manner?

A vm when you configure the CPU count would show up in the same way, so we would use it the same.

I think that as for a "minimum", the real question is how many threads should we use on a single core system, a dual core, quad core, then more. We may be able to have some special cases for the single and dual core IE 16 threads is the minimum, then on large systems it starts to scale up.

I thought you meant the default thread number 30 was "conservative" even for the single core system. That gave me an impression if the calculated value is less than 30, you could think it's too conservative.

We could decide it later once we get some performance numbers, of course.

Thanks for the confirmation about the VM. If we could use the same code among various systems, I'm just happy with it.

Per weekly meeting, we need more discussion for the autotuning overall.

Setting the ticket to FUTURE for now.

So here is an initial implementation.

The algorithm for picking threads does the following:

{{{
Hardware threads -> DS threads.
1 -> 16
2 -> 16
4 -> 24
8 -> 32
16 -> 48
32 -> 64
64 -> 96
128 -> 192
256 -> 384
512 -> 512
1024 -> 512
2048 -> 512
}}}

This has a cap for 512 threads at the moment for really large systems. We increase the threads rapidly early on, but then taper off as the number of hardware threads increases.

This is set to automatic by setting -1 to nsslapd-threadnumber. This is the default in libglobs.c/slap.h. This value can be "reset" by mod_delete, and will revert to automatic tuning.

When you search the config, the number of threads is shown even if from autoconfiguration.

config

dn: cn=config
nsslapd-threadnumber: 24

Setting the threadnumber value in dse.ldif overrides the automatic tuning.

This also has a first stage of conservatively increasing the cachesizes of the database. We change:

{{{
dbcachesize 10MB -> 32MB
entrycachesize 10MB -> 32MB
dncachesize 10MB -> 16MB
}}}

In benchmarking on my laptop (I plan to run more on some larger machines soon) with 100,000 objects:

{{{
No change:
ldclt[12956]: Global average rate: 2654.23/thr (796.27/sec), total: 79627
Auto thread and memory change:
ldclt[15030]: Global average rate: 2703.20/thr (810.96/sec), total: 81096
Auto thread with 128Mb of cache:
ldclt[13244]: Global average rate: 2736.03/thr (820.81/sec), total: 82081
}}}

I think we'd better open another ticket for the part "We increase the default cache memory sizes to 32Mb of db and entry cache, and 16mb of dn cache" since they are not related to "Autotune threads"?

If you search nsslapd-threadnumber, it returns the calculated value, right? And it's set to dse.ldif?

If so, does this eval happen just once and the calculated value is set to nsslapd-threadnumber which will be used from the time on?
{{{
threadnum = util_get_hardware_threads();
}}}

Maybe, even if nsslapd-threadnumber is not -1, you could calculate the value at the startup and issue an warning if the current value is different from what you are expecting? (That'd cover the upgrade/migration?)

This enhancement would affect Doc. Don't we need a wiki page for the community until we have the official Doc updates?

Replying to [comment:12 nhosoi]:

I think we'd better open another ticket for the part "We increase the default cache memory sizes to 32Mb of db and entry cache, and 16mb of dn cache" since they are not related to "Autotune threads"?

Okay, will do.

If you search nsslapd-threadnumber, it returns the calculated value, right?

Yes!

And it's set to dse.ldif?

No! Which is a good thing. If you have a VM and add more cores, when you restart it recalculates on start up. Neat huh? :)

If so, does this eval happen just once and the calculated value is set to nsslapd-threadnumber which will be used from the time on?
{{{
threadnum = util_get_hardware_threads();
}}}

Occurs at start up once, then the value is set to cfg->threadnumber but NOT written dse.ldif.

Maybe, even if nsslapd-threadnumber is not -1, you could calculate the value at the startup and issue an warning if the current value is different from what you are expecting? (That'd cover the upgrade/migration?)

Would we want that though? If the value is "set" that indicates the admin wants to set the value to something different, and we shouldn't complain. Previously threadnumber was not in dse.ldif, so upgrade already is covered, and if someone paid attention and set the value, then they don't want it changed either.

So I think upgrade and migration is already covered.

This enhancement would affect Doc. Don't we need a wiki page for the community until we have the official Doc updates?

Hmmm that's a good point. I'll write a document too.

Ok, if the calculated value is not set to dse.ldif, my requests can be abandoned. Thanks.

commit c27605b
Writing objects: 100% (77/77), 10.81 KiB | 0 bytes/s, done.
Total 77 (delta 65), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
86bffc8..b1f434e master -> master

Metadata Update from @nhosoi:
- Issue assigned to firstyear
- Issue set to the milestone: 1.3.6.0

7 years ago

Metadata Update from @spichugi:
- Issue close_status updated to: None (was: Fixed)

6 years ago

Metadata Update from @spichugi:
- Custom field reviewstatus adjusted to review (was: ack)

6 years ago

Hi team,
please, review the test plan and the code.

There are a few failures that probably reveal new bugs.
Or the test cases are wrong (I expect wrong behaviour for example).
Anyway, please, share your thoughts.

@spichugi some recent fixes were just applied to cache autosizing that should be including in your test:

  • nsslapd-cache-autosize can now be set over ldap. You don't need to set it by editing the dse.ldif anymore, but it still requires a restart to take effect.

  • When nsslapd-cache-autosize is set, any attempt to modify nsslapd-dbcachesize should be rejected with an error 53 (unwilling to perform). I just realized that we should also reject updates to nsslapd-cachememsize as well (I'll have to send another patch out for that via https://pagure.io/389-ds-base/issue/49257).

So, except for nsslapd-cachememsize, 1.3.6/master have these changes now and we should test them.

Thanks!

@mreynolds sure. I have a draft already, but I can't test it now, because the master branch build on F25 fails because of the known issue with old selinux-policy package ("memory.usage_in_bytes". errno=13) and I don't have quick F26 for now, some inside issues. So I will test it tomorrow and upload the updated test cases (or there will be a package for RHEL 7.4 with new fixes).

Besides that, I have questions about the behaviour. Some of them were raised by errors revealed by my tests in the patch above. You can check it also if you'll run the suite from patch on 389-ds-base-1.3.6.1-13.el7.x86_64.

  • First, http://www.port389.org/docs/389ds/design/autotuning.html says that we need to set nsslapd-cache-autosize: 0 and nsslapd-dbcachesize: 0 and after that nsslapd-dbcachesize will be autotuned (the same with nsslapd-cachememsize). But how we will set nsslapd-dbcachesize if it will be rejected with an error 53 (unwilling to perform)? Maybe I miss some part of the design, then, please, correct me.

  • Second, according to the docs and the source code logic, we shouldn't set nsslapd-threadnumber to the number higher than 512. Shouldn't we add such restriction to the modify operation? Now we can set it 513 and higher.

  • If we set nsslapd-cache-autosize-split (through the dse.ldif for now) to 0, then after restart we have nsslapd-dbcachesize == 0. You can find the reproducer in test_cache_autosize_basic_sane case.

  • Currently, we can set nsslapd-cache-autosize: -2 (or any negative number) via dse.ldif and it will be accepted and server will start, but I think this issue is fixed in your recent patch.

Thank you!

At a quick read, I'm happy with all of this except the thread test: We shouldn't be testing this at all. we should just check threads >= 24 because we can't guarantee "much else". Also, we can't know if the number of threads might be odd. Imagine a 3 core VM? What about 6? 18? These are real cpu configs. I've seen 36 core machines too.

Instead, we can only know that autotuning set "some sane value" :)

As well, we don't know if python will get the HARDWARE cores, where as DS will look at hardware, and soon getrlimit for nproc, and also cgroup lims. So hardware cores != ds procs / cores can use. So another trap :)

Reply to @spichugi

  • First: so long as one o fthem is 0, it will autotune that only only. So if you set nsslapd-cache-autosize to 0, it resets to "10" underneath, but then if you set nsslapd-dbcachesize to 0 (even if autosize is != 0), it will be "reset". Does that make sense?
  • Second: While we cap threads in autotune because we want it to "work" and be safe,if an admit wants to set 999999 threads, that's on them to do it ;)
  • The autosize split to 0 is a really bad one. Seriously, we should fix that immediately. Great find!
  • I can't confirm if @mreynolds patch fixes negative numbers, but I trust it does. Regardless, lets test it.

Hope that help!

Reply to @firstyear

  • First: so long as one o fthem is 0, it will autotune that only only. So if you set nsslapd-cache-autosize to 0, it resets to "10" underneath, but then if you set nsslapd-dbcachesize to 0 (even if autosize is != 0), it will be "reset". Does that make sense?

Do I understand right that we can't use Mark suggestion from https://pagure.io/389-ds-base/issue/49021#comment-441110 because we still need to set nsslapd-dbcachesize to 0 sometime?

I've applied the comments to the new patch.
It has some failures now though that relates to the fact we can't edit 'nsslapd-cachememsize'.

In http://www.port389.org/docs/389ds/design/autotuning.html under Manual tuning in detail section, we can find the information that we can autotune the values once again (not only on the instance installation time).
It can be done with setting nsslapd-cachememsize and nsslapd-cache-autosize to 0 (and restarting).

if (cachememsize == 0 && nsslapd-cache-autosize == 0) || nsslapd-cache-autosize > 0:
    cachememsize = auto entry cachesize value, and write to dse.ldif

Now this operation works for nsslapd-dbcachesize only.

if (dbcachesize == 0 && nsslapd-cache-autosize == 0) || nsslapd-cache-autosize > 0:
    dbcachesize = auto db cachesize value, and write to dse.ldif

You can verify all of this with my patch.

If my understanding is correct, and based on the patches I recently pushed, you should not be trying to set dbcachesize while nsslapd-cache-autosize is set. Actually, you should verify that we do get an error 53 (UNWILLING_TO_PERFORM) if it is attempted.

Uploaded a new patch according to your comment from here and from https://pagure.io/389-ds-base/issue/49257

Though, we have inconsistency then (you can reproduce it with my test suite):
- trying to set dbcachesize while nsslapd-cache-autosize is set - no UNWILLING_TO_PERFORM and operation is successful;
- trying to set cachememsize while nsslapd-cache-autosize is set - UNWILLING_TO_PERFORM happens.

test_cache_autosize_percentage

This test may not work, because 10 -> 20% may still hit the cap. You would need to control the amount of memory on the test system, or some othe means to assert this .....

There are cases where changing this from 10 -> 20 may result in no change to dbcache, and entry cache.

So this test may just need to be removed.

@firstyear fixed, thank you.

So now we have the nex failures that we need to fix (according to @mreynolds comment https://pagure.io/389-ds-base/issue/49257#comment-442491 ):

  • trying to set nsslapd-cachememsize to 0 while nsslapd-cache-autosize is set - UNWILLING_TO_PERFORM isn't raised now;

  • trying to set nsslapd-dbcachesize to 3333333 while nsslapd-cache-autosize is set - UNWILLING_TO_PERFORM isn't raised now;

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to ack (was: review)

6 years ago

18ccdb4..4d457b5 master -> master
commit 4d457b5
Author: Simon Pichugin spichugi@redhat.com
Date: Fri Apr 28 11:03:16 2017 +0200

6725e03..a1d749f 389-ds-base-1.3.6 -> 389-ds-base-1.3.6
commit a1d749f
Author: Simon Pichugin spichugi@redhat.com
Date: Fri Apr 28 11:03:16 2017 +0200

Regarding the function choosing the number of threads for the server (util_get_hardware_threads(void), file /ldap/servers/slapd/util.c), it is not linearly increasing with the number of hardware threads.
For example, for hw_threads = 56 threads will get the value 112, when it should be 96. It gets back to 96 at hw_threads = 64. Another example - hw_threads=480 means threads=720 which is beyond the cap of 512.
The value returned by this function should increase monotonously with the number of hw_threads, right now it goes forth and back all the time.
The snippet of the function:
~~~~
* 1 -> 16
* 2 -> 16
* 4 -> 24
* 8 -> 32
* 16 -> 48
* 32 -> 64
* 64 -> 96
* 128 -> 192
* 256 -> 384
* 512 -> 512
* 1024 -> 512
* 2048 -> 512
*/

if (hw_threads >= 0 && hw_threads < 4) {
threads = 16;
} else if (hw_threads >= 4 && hw_threads < 32) {
threads = 16 + (hw_threads * 2);
} else if (hw_threads >= 32 && hw_threads < 64) {
threads = (hw_threads * 2);
} else if (hw_threads >= 64 && hw_threads < 512) {
/ Same as 1.5 /
threads = (hw_threads * 2) - (hw_threads / 2);
} else {
/
Cap at 512 for now ... */
threads = 512;
}
~~~~

It's designed to "not be linear". So this is expected. The concern was that if we increased threads too much we may hit other limits at such a large scale. We would rather work with cases individually to work out the best situation, and in the future we may review this number.

The 480 -> 720 certainly is a bug that I should resolve however.

I was not talking about "linear", i was more talking about "monotonous", sorry for the wrong choice of words :)

I don't understand your "monotonous" issue here I'm sorry :(

I think perhpas that the values can go "up or down" in same cases?

yep. I was saying that function value should always increase along with its argument. Like, larger number of hw_threads means larger resulting value of nsslapd-threadnumber. It's the definition of a mathematical monotonic function (https://en.wikipedia.org/wiki/Monotonic_function) , that's what i meant. The code in git does not satisfy that condition, threadnumber goes up and down when hw_threads increases.

Okay, I see now. Sorry it took a bit. I can correct this.

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2080

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix

3 years ago

Login to comment on this ticket.