Issue #49204: On small VM, autotune breaks the access of the suffixes - 389-ds-base

389-ds-base

#49204 On small VM, autotune breaks the access of the suffixes

Closed: wontfix 7 years ago Opened 7 years ago by tbordaz.

Issue Description

The issue is described in RH BZ 1435122.
Although suffix files (id2entry, entryrdn...) are valid and contains the expected entries, at startup the data appear to be missing.
An error is logged :

[31/Mar/2017:14:41:44.650378747 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)

Note the failure is 100% reproducible on small VM but having not found RC, there is no guaranty it can also be reproducible on larger VM

Package Version and Platform

DS 1.3.6
IPA 4.5

Steps to reproduce

Install IPA on a small VM (1Gb memory)
2.
3.

Actual results

Failure at install

Expected results

Success

firstyear commented 7 years ago

I think I may have just in-advertently fixed this.

My current theory is that on a small machine, because our memory detection actually works now, we are breaking the hardlimit in check_and_set_import_cache, which causes ldif2db failures. As a result, setup-ds.pl then won't import the backend which causes id2entry to not exist (thus creating #49198)

I'm currently fixing the issue in https://pagure.io/389-ds-base/issue/48864, and perhaps you can test the patches there to see if they resolve the issue.

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to new
- Custom field type adjusted to defect

7 years ago

Metadata Update from @firstyear:
- Issue assigned to firstyear

7 years ago

tbordaz commented 7 years ago

Unfortunately the fix for https://pagure.io/389-ds-base/issue/48864 does not fix https://pagure.io/389-ds-base/issue/49204.

There are the same starting errors

-rw-------. 1 dirsrv dirsrv  57344 Apr  5 00:51 entryrdn.db
-rw-------. 1 dirsrv dirsrv 172032 Apr  5 00:51 id2entry.db

 [05/Apr/2017:00:51:33.690701632 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)
 [05/Apr/2017:00:51:33.754144025 +051800] - ERR - index_addordel_entry - database index operation failed BAD 1031, err=-30988 BDB0073 DB_NOTFOUND: No matching key/data pair found

and dump of entryrdn and id2entry shows that index and db are looking fine

entryrdn
    ID: 1; RDN: "dc=testrelm,dc=test"; NRDN: "dc=testrelm,dc=test"
C1

 id2entry
id 1
    rdn: dc=testrelm,dc=test
    objectClass: top
    objectClass: domain
    objectClass: pilotObject
    ...

lkrispen commented 7 years ago

you say that it is always reproducable on small VM,
- did you test on the QA beaker machine only or on a different small VM as well ?
- does it happen with earlier releases, is it a regression ?

firstyear commented 7 years ago

Have we got the log from setup-ds.pl too?

firstyear commented 7 years ago

Actually, what if it's that we are setting the BDB cache size too small in the autotuning setup?

tbordaz commented 7 years ago

@lkrispen : The bug was not reproducible on beaker machine (with larger VM). I am using QE dedicated VM to debug.
It is difficult to confirm it is regression but IMHO it is very likely. The failure occurs at DS startup BEFORE IPA runs an IPA upgrade procedure (./freeipa/install/updates/90-post_upgrade_plugins.update).
To verify it is a regression that would need to run new DS (with autotune) with old IPA installer.

@firstyear : setup-ds is launched without debug mode. It was successfull. setup-ds is a preliminary step of IPA install and DS was successful for ~10min before the failure is detected

Metadata Update from @tbordaz:
- Custom field reviewstatus reset (from new)

7 years ago

firstyear commented 7 years ago

I'm was walking through the shopping centre today and I wondered if there was a condition by which the machine is under enough memory pressure that we autotune to 0. That would certainly cause some issues.

I'll be checking this case, but it's only a guess, not a reall investigated solution.

firstyear commented 7 years ago

Hey mate,

I ran a build of DS in a container with a 256M memory limit. I can see it OOMing a highmemory test:

Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [29887] 1343600009 29887     9210        0      24      509             0 zsh
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ 3807] 1343600009  3807    30291       29      60     3958             0 gdb
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ 3826] 1343600009  3826 4563422503    47969     292    78569             0 lt-test_libsds
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: Memory cgroup out of memory: Kill process 3826 (lt-test_libsds) score 967 or sacrifice child
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: Killed process 3826 (lt-test_libsds) total-vm:18253690012kB, anon-rss:191876kB, file-rss:0kB, shmem-rss:0kB

Running setup-ds.pl, I can see the following warnings:

[10/Apr/2017:13:20:05.856337382 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (0 bytes).
[10/Apr/2017:13:20:05.901729073 +1000] - INFO - dblayer_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[10/Apr/2017:13:20:05.917218763 +1000] - INFO - check_and_set_import_cache - pagesize: 4096, available bytes 6070214656, process usage 65798144 
[10/Apr/2017:13:20:05.935436023 +1000] - INFO - check_and_set_import_cache - Import allocates 2371177KB import cache.
[10/Apr/2017:13:20:06.102251864 +1000] - INFO - import_main_offline - import userRoot: Beginning import job...
[10/Apr/2017:13:20:06.117203867 +1000] - INFO - import_main_offline - import userRoot: Index buffering enabled with bucket size 100
[10/Apr/2017:13:20:06.327920654 +1000] - INFO - import_producer - import userRoot: Processing file "/tmp/ldifhfIEEp.ldif"
[10/Apr/2017:13:20:06.361842151 +1000] - INFO - import_producer - import userRoot: Finished scanning file "/tmp/ldifhfIEEp.ldif" (9 entries)
[10/Apr/2017:13:20:06.846415798 +1000] - INFO - import_monitor_threads - import userRoot: Workers finished; cleaning up...
[10/Apr/2017:13:20:07.065802887 +1000] - INFO - import_monitor_threads - import userRoot: Workers cleaned up.
[10/Apr/2017:13:20:07.084330172 +1000] - INFO - import_main_offline - import userRoot: Cleaning up producer thread...
[10/Apr/2017:13:20:07.098956207 +1000] - INFO - import_main_offline - import userRoot: Indexing complete.  Post-processing...
[10/Apr/2017:13:20:07.111706878 +1000] - INFO - import_main_offline - import userRoot: Generating numsubordinates (this may take several minutes to complete)...
[10/Apr/2017:13:20:07.150434875 +1000] - INFO - import_main_offline - import userRoot: Generating numSubordinates complete.
[10/Apr/2017:13:20:07.170592164 +1000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs...
[10/Apr/2017:13:20:07.185626419 +1000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs.
[10/Apr/2017:13:20:07.211114186 +1000] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Creating ancestorid index (new idl)...
[10/Apr/2017:13:20:07.229210532 +1000] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Created ancestorid index (new idl).
[10/Apr/2017:13:20:07.246735160 +1000] - INFO - import_main_offline - import userRoot: Flushing caches...
[10/Apr/2017:13:20:07.265159211 +1000] - INFO - import_main_offline - import userRoot: Closing files...
[10/Apr/2017:13:20:07.510799223 +1000] - INFO - dblayer_pre_close - All database threads now stopped
[10/Apr/2017:13:20:07.524294478 +1000] - INFO - import_main_offline - import userRoot: Import complete.  Processed 9 entries in 1 seconds. (9.00 entries/sec)

+[10/Apr/2017:13:20:09.122785218 +1000] - NOTICE - ldbm_back_start - found 262144k physical memory
+[10/Apr/2017:13:20:09.133959896 +1000] - NOTICE - ldbm_back_start - found 64197k avaliable
+[10/Apr/2017:13:20:09.150539284 +1000] - NOTICE - ldbm_back_start - cache autosizing: db cache: 10485k
+[10/Apr/2017:13:20:09.169102140 +1000] - NOTICE - ldbm_back_start - cache autosizing: userRoot entry cache (1 total): 65536k
+[10/Apr/2017:13:20:09.188007215 +1000] - WARN - entrycache_set_max_size - Cachesize (47374540) may use more than the available physical memory.
+[10/Apr/2017:13:20:09.202063252 +1000] - NOTICE - ldbm_back_start - total cache size: 92476014 B; 
+[10/Apr/2017:13:20:09.213497617 +1000] - WARN - ldbm_back_start - It is highly likely your memory configuration of all backends will EXCEED your systems memory.
+[10/Apr/2017:13:20:09.228089373 +1000] - WARN - ldbm_back_start - In a future release this WILL prevent server start up. You MUST alter your configuration.
+[10/Apr/2017:13:20:09.246454204 +1000] - WARN - ldbm_back_start - Total entry cache size: 49303756 B; dbcache size: 8589934 B; available memory size: 65738342 B; 
+[10/Apr/2017:13:20:09.264939787 +1000] - WARN - ldbm_back_start - This can be corrected by altering the values of nsslapd-dbcachesize, nsslapd-cachememsize and nsslapd-dncachememsize
+
+[10/Apr/2017:13:20:09.280016381 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (8589934 bytes).
+[10/Apr/2017:13:20:09.297357115 +1000] - INFO - dblayer_start - Resizing db cache size: 2428085862 -> 8589934
+[10/Apr/2017:13:20:10.573234982 +1000] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 389 for LDAP requests

However, it looks like the server starts, with all entries added:

The only bit I'm concerned about here is:

[10/Apr/2017:13:20:05.856337382 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (0 bytes).

This would be during ldif2db in setup-ds.pl. However, the server still starts correctly.

I'll make a patch to prevent 0 from being set on these values, but this may not be the cause of the issue.

firstyear commented 7 years ago

[10/Apr/2017:13:38:44.393226936 +1000] - INFO - check_and_set_import_cache - Import allocates 6211KB import cache.

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review

7 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to ack (was: review)

7 years ago

tbordaz commented 7 years ago

The patch looks good but note that it does not fully fix the problem. It keeps occurring

[11/Apr/2017:02:57:59.416666948 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)
[11/Apr/2017:02:57:59.492313735 +051800] - ERR - index_addordel_entry - database index operation failed BAD 1031, err=-30988 BDB0073 DB_NOTFOUND: No matching key/data pair found

tbordaz commented 7 years ago

The root cause is possibly identified: If at startup memory pressure prevent setting of entrycache size, this entrycache remains NULL.

Running long duration tests (the problem use to occur 50%-75%) with the following patch.

This is likely NOT the definitive fix

firstyear commented 7 years ago

So this fixes the problem you have seen @tbordaz ?

I would be curious to see the value of delta in this log output. I think the idea of this change is good though, but the catch is that this is where the user sets the cachememsize via cn=config. So we check the delta of val - current size, to see if we can accomodate the difference because we already consume current_size in memory used (*mi). So really, delta is valid if it's 0, because you aren't changing the size at all, and for the sanity check to be carried out, delta must be a positive value relative to current size.

Second is that this code is only called from a change to cn=config, so I'm still not 100% it's the fix.

Perhaps the real fix is that util_is_cachesize_sane shouldn't be reducing the size of the allocation so much, and we should check it's output. IE if the request is say ... 1mb, we always allow it, and if it's greater than that we have enough room to actually do a reduction. As well, we should check that the result is never 0 as well.

firstyear commented 7 years ago

Hey @tbordaz . I've had a look at your fix and tweaked it a bit. I squashed all three together to this single patch, so I hope that it helps and fixes the issue for you.

tbordaz commented 7 years ago

Long duration tests completed successfully with https://pagure.io/389-ds-base/issue/raw/files/6f730599b09c6e098488cb1601550e7160e2918deafcc482c74323a86ab89b2c-0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch

I will check if the error came from entrycache being zeroed or uninitialized

tbordaz commented 7 years ago

The entrycache.size is not uninitialized, it can be 0 or some specific value (like 512K). So I think the error comes from the failure (LDAP_UNWILLING_TO_PERFORM) when parsing dse.ldif. The failure itself is triggered by memory pressure but I have not a clear understanding how the failure (unwilling) can lead to this weird error (suffix entries not readable).

Testing https://pagure.io/389-ds-base/issue/raw/files/754f2ef07452640a3b5af46a9ae95ec07537d03e62b9f4937b371b79a9753dcf-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch

tbordaz commented 7 years ago

Testing https://pagure.io/389-ds-base/issue/raw/files/754f2ef07452640a3b5af46a9ae95ec07537d03e62b9f4937b371b79a9753dcf-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch

Tests are inconclusive: patch was successful for the first 18 run but after DS crashed systematically for the remaining run. Unfortunately ABRT did not dump the core because of the memory pressure.

In conclusion:
- The patch looks good to fix the current ticket, because of the 18 success in a raw
- The patch may introduce an other problem or we may have an environment issue (memory pressure too high). Note that the available memory is 1Gb+1.5Gb_swap.
- I will rerun tests

firstyear commented 7 years ago

How about we take the patch as it currently is, and I try running DS in a container with say 64mb of ram available? That should provide a better env for me to reproduce in?

tbordaz commented 7 years ago

The failures were triggered by IPC being exhausted by apache. Doing regular cleanup, the tests (25first runs) were successful.
IMHO the fix is ok. ACK

firstyear commented 7 years ago

commit ae44ed4
To ssh://git@pagure.io/389-ds-base.git
8979cc6..ae44ed4 master -> master