#48384 Server startup should warn about values consuming too much ram
Closed: wontfix None Opened 8 years ago by firstyear.

Based on tbordaz's formula:

nsslapd-dbcachesize: 200000000
nsslapd-cachememsize: 250000000
nsslapd-dncachememsize: 50485760

- The reasoning behind this is:

(250 + 200 + 50) *20 = 10Gb

If these values will consume more than 80% of system ram, then we should issue a warning to the error log about the potential for high memory usage.


There is a helper function dblayer_is_cachesize_sane(size_t *cachesize) in dblayer.c, which is called when entry cache size is set in entrycache_set_max_size in cache.c. It's supposed to log in the error log if it is not available.
LDAPDebug(LDAP_DEBUG_ANY,
"WARNING -- Possible CONFIGURATION ERROR -- cachesize "
"(%lu) may be configured to use more than the available "
"physical memory.\n", bytes, 0, 0);

If it is not working, dblayer_is_cachesize_sane has to be debugged.

This first patch corrects the behaviour of dblayer_is_cachesize_sane and dblayer_sys_pages.

Right now, there is a hardcoded define to CONSERVATIVE to take advantage of the new page checking routine I have written: the intent would be to remove this define flag in the future. Right now this allows a trivial change to use the older behaviour.

A following patch will use dblayer_is_cachesize_sane to check the sum of all cachesizes in sane with regards to the pages available on the system.

The main outstanding point of this patch is the behaviour of dblayer_is_cachesize_sane when the value is "not sane". Right now we adjust the value down to a sane number, and leave issane set to 0, which causes the adjusted value to not be used. Should we be changing the value of issane after the adjustment, or should we continue to let the server fail on startup instead?

The comparison logs are:

Before

[12/Jan/2016:15:32:31 +1000] - dblayer_is_cachesize_sane cachesize=10000000 / pagesize=4096
[12/Jan/2016:15:32:31 +1000] - dblayer_is_cachesize_sane pages=3050679 - procpages=268506199
[12/Jan/2016:15:32:31 +1000] - dblayer_is_cachesize_sane issane=1
// Yes this then checked if (10000000 / 4096) <= (3050679 - 268506199) ... and it passed.

After

[13/Jan/2016:12:00:56 +1000] - dblayer_sys_pages pages=3050679, vmsize=268505992,
[13/Jan/2016:12:00:56 +1000] - dblayer_sys_pages using pages for pages
[13/Jan/2016:12:00:56 +1000] - dblayer_sys_pages pages=3050679, getrlim=-1, freesize=1470632
[13/Jan/2016:12:00:56 +1000] - dblayer_sys_pages using freesize for availpages
[13/Jan/2016:12:00:56 +1000] - dblayer_sys_pages USING pages=3050679, procpages=16491, availpages=1470632
[13/Jan/2016:12:00:56 +1000] - dblayer_is_cachesize_sane cachesize=10000000 / pagesize=4096
[13/Jan/2016:12:00:56 +1000] - dblayer_is_cachesize_sane cachepages=2441 <= availpages=1470632
[13/Jan/2016:12:00:56 +1000] - dblayer_is_cachesize_sane issane=1

[13/Jan/2016:12:10:10 +1000] - dblayer_sys_pages pages=3050679, vmsize=268505492,
[13/Jan/2016:12:10:10 +1000] - dblayer_sys_pages using pages for pages
[13/Jan/2016:12:10:10 +1000] - dblayer_sys_pages pages=3050679, getrlim=4294967295, freesize=1452295
[13/Jan/2016:12:10:10 +1000] - dblayer_sys_pages using freesize for availpages
[13/Jan/2016:12:10:10 +1000] - dblayer_sys_pages USING pages=3050679, procpages=15632, availpages=1452295
[13/Jan/2016:12:10:10 +1000] - dblayer_is_cachesize_sane cachesize=3457482752 / pagesize=4096
[13/Jan/2016:12:10:10 +1000] - dblayer_is_cachesize_sane cachepages=2941264 <= availpages=1452295
[13/Jan/2016:12:10:10 +1000] dblayer_is_cachesize_sane - WARNING adjusted cachesize to 2974298112
[13/Jan/2016:12:10:10 +1000] - dblayer_is_cachesize_sane issane=0
[13/Jan/2016:12:10:10 +1000] - Error: dbcachememsize value is too large.
[13/Jan/2016:12:10:10 +1000] - Error with config attribute nsslapd-dbcachesize : Error: dbcachememsize value is too large.

Your fix looks good to me. (You may be thinking to come up with another version, but I thought it'd be worth discussing about the current patch since the code might be reused in the new implementation...)

1) This is from #comment:4.

Yes this then checked if (10000000 / 4096) <= (3050679 - 268506199) ... and it passed.
The evaluation is totally wrong, but it did not pass, didn't it? The right hand side value (3050679 - 268506199) is negative...

2) You put this comment in your proposal.
{{{
1177 / Since we adjusted the value, we should set this value to be sane?/
1178 / issane = 1 /
}}}
dblayer_is_cachesize_sane is used in 2 cases. One is for the cache size to be reset dynamically. If the new value is insane, the attempt fails and discarded. It requires the info that the value is insane. Another is for the server start up. In this case, the insane value is discarded and the adjusted size is used with this warning. So, please don't set 1 to issane when adjusted.
{{{
if (! dblayer_is_cachesize_sane(&bytes)) {
LDAPDebug(LDAP_DEBUG_ANY,
"WARNING -- Possible CONFIGURATION ERROR -- cachesize "
"(%lu) may be configured to use more than the available "
"physical memory.\n", bytes, 0, 0);
}
}}}

3) Could you please fix this typo? (I don't know how long it's been there... :)
{{{
1044 1159 /* If the requested cache size is larger than the remaining pysical memory
^^^^^^^
}}}

Replying to [comment:6 nhosoi]:

Your fix looks good to me. (You may be thinking to come up with another version, but I thought it'd be worth discussing about the current patch since the code might be reused in the new implementation...)

1) This is from #comment:4.

Yes this then checked if (10000000 / 4096) <= (3050679 - 268506199) ... and it passed.
The evaluation is totally wrong, but it did not pass, didn't it? The right hand side value (3050679 - 268506199) is negative...

Because this ends up in a size_t which is unsigned, it underflows to a large positive value I believe. Therefore, it does pass the check.

2) You put this comment in your proposal.
{{{
1177 / Since we adjusted the value, we should set this value to be sane?/
1178 / issane = 1 /
}}}
dblayer_is_cachesize_sane is used in 2 cases. One is for the cache size to be reset dynamically. If the new value is insane, the attempt fails and discarded. It requires the info that the value is insane. Another is for the server start up. In this case, the insane value is discarded and the adjusted size is used with this warning. So, please don't set 1 to issane when adjusted.

Noted: I will not adjust this value.

{{{
if (! dblayer_is_cachesize_sane(&bytes)) {
LDAPDebug(LDAP_DEBUG_ANY,
"WARNING -- Possible CONFIGURATION ERROR -- cachesize "
"(%lu) may be configured to use more than the available "
"physical memory.\n", bytes, 0, 0);
}
}}}

3) Could you please fix this typo? (I don't know how long it's been there... :)
{{{
1044 1159 /* If the requested cache size is larger than the remaining pysical memory
^^^^^^^
}}}

I will fix this.

The main question I want answered is: Do we approve of the new "conservative" calculation, or do we want it to be optional and we retain the existing calculations?

Replying to [comment:7 firstyear]:

The main question I want answered is: Do we approve of the new "conservative" calculation, or do we want it to be optional and we retain the existing calculations?

The main purpose is not to run into the memory full or killed by OOM killer. So, "conservative" calculation sounds good to me. Outside of the caches, the program needs memory for running, which is not calculated at the initializing time. We need the room for such natural growth...

This second version of the patch moves the is sane check and sys pages check to util.c. This way we can call them from elsewhere. We also change the way we get total and free memory as it turns out linux's sysinfo call doesn't do the right thing and we need to read /proc/meminfo.

The next stage to making this work:

  • Clean up the memory auto-allocation for ldbm backends where it uses "fuzzy approximations".
  • Add a check before the backends are started to view their cache configs, and validate the sum of all of them with util_is_cachesize_sane.

For now, this patch can be merged, and the next changes will be delivered in a new patch.

I have not fully reviewed the patch, but two comments:

  • changes in the compiler options should be a separate commit, not a side effect of a specific fix
  • do the util functions have to be in the public api, I think slapi-private.h would be ok

One small remark, if for any reason getrlimit fails util_getvirtualmemsize may return a non null invalid value.

2 minor requests...
1) You are very welcome to add the noise(s).
/ We should probably make noise here! /

2) As you wrote this comment and did the case here,
{{{
/ This is stupid. If you set %u to %zu to print a size_t, you get literal %zu in your logs
* So do the filthy cast instead.
/
}}}
You may want to add the casts to make coverity happier?
1738 LDAPDebug(LDAP_DEBUG_TRACE,"util_is_cachesize_sane cachesize=%u / pagesize=%u \n",*cachesize,pagesize,0);
1743 LDAPDebug(LDAP_DEBUG_TRACE,"util_is_cachesize_sane cachepages=%u <= availpages=%u\n",cachepages,availpages,0);
1755 LDAPDebug(LDAP_DEBUG_TRACE,"util_is_cachesize_sane pages=%u - procpages=%u\n",pages,procpages,0);
1758 LDAPDebug(LDAP_DEBUG_TRACE,"util_is_cachesize_sane cachepages=%u <= freepages=%u\n",cachepages,freepages,0);

Replying to [comment:10 lkrispen]:

I have not fully reviewed the patch, but two comments:

  • changes in the compiler options should be a separate commit, not a side effect of a specific fix

Sorry, that was a mistake, they weren't meant to be left in there. I'll remove them.

  • do the util functions have to be in the public api, I think slapi-private.h would be ok

I'm not 100% on what each of the .h files exposes and to what scope.

I would think that these functions have use to slapd and any plugins of slapd. But I do not think they would be part of a "public" api.

perhaps for now we make them private, and if we want to use them in a plugin in the future, we relocate their header definitions then?

One small remark, if for any reason getrlimit fails util_getvirtualmemsize may return a non null invalid value.

Hmmm. This might be something we should improve and check for with the getrlimit call. I see that the hpux code calls this too, so we should be sure to keep it compatible.

1) You are very welcome to add the noise(s).

Will do.

2) As you wrote this comment and did the case here,

gcc only flagged those few cases. I'll fix up the rest now.

Apply recomendations. Correct issues in ldbm config calls to check sanity.
0001-Ticket-48384-Fix-dblayer_is_cachesize_sane-and-dblay.3.patch

Looks good to me. Great job!! You have my ack.

I'll leave the Review menu for Ludwig or Thierry... ;)

Replying to [comment:14 nhosoi]:

Looks good to me. Great job!! You have my ack.

I'll leave the Review menu for Ludwig or Thierry... ;)

Looks good to me as well. Ack.

Leaving the Review menu for the next one.. Ludwig ;)

It looks good, but I didn't check the Solaris or hpux stuff, we should eventually determine which platforms we will continue support and remove some uneeded ifdefs

commit ee49672d92e929d82ae5a443ce34e681ee56f925

To ssh://git.fedorahosted.org/git/389/ds.git
9741f24..476af05 master -> master

This is a good start but we may need to revisit this warning again in the future. There are some other issues I found in the process of the ticket, but that will be a new trac item.

git patch file (master) -- define GIGABYTE for the 32bit build.
0001-Ticket-48384-Fix-dblayer_is_cachesize_sane-and-dblay.4.patch

One line fix to master (32bit only):
d4deb29..333f963 master -> master
commit 333f963

Metadata Update from @nhosoi:
- Issue assigned to firstyear
- Issue set to the milestone: 1.3.5 backlog

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1715

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata