#49669 Autotuning breaks archive2db on a system with >=32GB RAM
Closed: wontfix 4 years ago Opened 4 years ago by vashirov.

Issue Description

On a system with >=32GB RAM autotuning sets nsslapd-cachememsize to a large value that is bigger than INT_MAX:

sh-4.4# grep Mem /proc/meminfo 
MemTotal:       65401208 kB
MemFree:        43524404 kB
MemAvailable:   58409212 kB

In the logs:

[10/May/2018:20:44:47.490690337 +0000] - ERR - ldbm_config_set - Value 4630511616 for attr nsslapd-cachememsize is greater than the maximum 2147483647
[10/May/2018:20:44:47.494356162 +0000] - ERR - parse_ldbm_instance_config_entry - Error with config attribute nsslapd-cachememsize : Error: value 4630511616 for attr nsslapd-cachememsize is greater than the maximum 2147483647                                                                                                                                                                     
[10/May/2018:20:44:47.498026129 +0000] - ERR - ldbm_instance_config_load_dse_info - Error parsing the config DSE

During normal operation it is corrected, for example on a different system:

[10/May/2018:16:33:13.489337009 -0400] - WARN - ldbm_instance_config_cachememsize_set - delta +22407994880 of request 22408506880 reduced to 7872421888                                                                                       

But when ns-slapd is running in archive2db mode, we don't correct the value and segmentation fault can occur:

#0  0x00007ffff4646d2e in __strcmp_sse2_unaligned () at /lib64/libc.so.6
#1  0x00007ffff796bcc0 in slapd_comp_path () at /usr/lib64/dirsrv/libslapd.so.0
#2  0x00007fffea9ab630 in dblayer_restore (li=0x555556084c80, src_dir=0x55555634c6c0 "/var/lib/dirsrv/slapd-standalone1/bak/backup_test", task=0x0, bename=0x0) at ldap/servers/slapd/back-ldbm/dblayer.c:6569
#3  0x00007fffea99c75a in ldbm_back_archive2ldbm (pb=0x555556390000) at ldap/servers/slapd/back-ldbm/archive.c:165
#4  0x0000555555565011 in slapd_exemode_archive2db (mcfg=0x7fffffffe880) at ldap/servers/slapd/main.c:2571
#5  0x0000555555565011 in main (argc=6, argv=0x7fffffffec58) at ldap/servers/slapd/main.c:953  

Apparently inst->inst_dir_name is NULL, because we bail out before it is set to a valid value.

Package Version and Platform

Fedora 28

Steps to reproduce

  1. Run dirsrvtests/tests/suites/basic/basic_test.py::test_basic_backup on a machine with >=32GB RAM

With the fix from 5d700cc basic_test.py::test_basic_import_export no longer crashes the server, but still it would be good to make backup/restore more robust to avoid potential crashes.

Metadata Update from @vashirov:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

4 years ago

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

4 years ago

I have a fix for this, but I don't know if it's right. Is the bug that we are not correcting the cache size? Or should we just abort the bak2db when the inst struct has a NULL parent dir because of the invalid cache size? I need to look into this more...

Metadata Update from @mreynolds:
- Assignee reset

4 years ago

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

4 years ago

I'd say it's a combination of three bugs:
1. We should correct the cache size if it's larger than available RAM in archive2db/db2archive modes.
2. We should not continue with archive2db if some of the required parameters are invalid or NULL.
3. slapd_comp_path() should be more robust and check for NULL pointer.

Okay so the problem only occurs when the value is greater than the integral type (unsigned 64bit integer). This check is done right before all backend config set functions are called. We error out before can adjust the value as we never actually call the set function..

Autotuning would never set a cache size higher that a uint64-t (its impossible), and that's the only condition that would trigger this error. It would be very invasive to the backend code to try and resize the cache if its larger than a uint64 in this scenario.

So [1] will be addressed if the value is a valid number(in range of uint64), but if you manually set the cache size to 10000000000000000000000000000000 the restore/backup will fail. Otherwise it will resize, if needed, and complete.

PR to come soon...

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to review (was: None)

4 years ago

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue set to the milestone: 1.4.0
- Issue status updated to: Closed (was: Open)

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2728

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

2 years ago

Login to comment on this ticket.