The issue is described in RH BZ 1435122. Although suffix files (id2entry, entryrdn...) are valid and contains the expected entries, at startup the data appear to be missing. An error is logged :
[31/Mar/2017:14:41:44.650378747 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)
Note the failure is 100% reproducible on small VM but having not found RC, there is no guaranty it can also be reproducible on larger VM
DS 1.3.6 IPA 4.5
Failure at install
Success
I think I may have just in-advertently fixed this.
My current theory is that on a small machine, because our memory detection actually works now, we are breaking the hardlimit in check_and_set_import_cache, which causes ldif2db failures. As a result, setup-ds.pl then won't import the backend which causes id2entry to not exist (thus creating #49198)
I'm currently fixing the issue in https://pagure.io/389-ds-base/issue/48864, and perhaps you can test the patches there to see if they resolve the issue.
Metadata Update from @firstyear: - Custom field reviewstatus adjusted to new - Custom field type adjusted to defect
Metadata Update from @firstyear: - Issue assigned to firstyear
Unfortunately the fix for https://pagure.io/389-ds-base/issue/48864 does not fix https://pagure.io/389-ds-base/issue/49204.
There are the same starting errors
-rw-------. 1 dirsrv dirsrv 57344 Apr 5 00:51 entryrdn.db -rw-------. 1 dirsrv dirsrv 172032 Apr 5 00:51 id2entry.db [05/Apr/2017:00:51:33.690701632 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988) [05/Apr/2017:00:51:33.754144025 +051800] - ERR - index_addordel_entry - database index operation failed BAD 1031, err=-30988 BDB0073 DB_NOTFOUND: No matching key/data pair found
and dump of entryrdn and id2entry shows that index and db are looking fine
entryrdn ID: 1; RDN: "dc=testrelm,dc=test"; NRDN: "dc=testrelm,dc=test" C1 id2entry id 1 rdn: dc=testrelm,dc=test objectClass: top objectClass: domain objectClass: pilotObject ...
you say that it is always reproducable on small VM, - did you test on the QA beaker machine only or on a different small VM as well ? - does it happen with earlier releases, is it a regression ?
Have we got the log from setup-ds.pl too?
Actually, what if it's that we are setting the BDB cache size too small in the autotuning setup?
@lkrispen : The bug was not reproducible on beaker machine (with larger VM). I am using QE dedicated VM to debug. It is difficult to confirm it is regression but IMHO it is very likely. The failure occurs at DS startup BEFORE IPA runs an IPA upgrade procedure (./freeipa/install/updates/90-post_upgrade_plugins.update). To verify it is a regression that would need to run new DS (with autotune) with old IPA installer.
@firstyear : setup-ds is launched without debug mode. It was successfull. setup-ds is a preliminary step of IPA install and DS was successful for ~10min before the failure is detected
Metadata Update from @tbordaz: - Custom field reviewstatus reset (from new)
I'm was walking through the shopping centre today and I wondered if there was a condition by which the machine is under enough memory pressure that we autotune to 0. That would certainly cause some issues.
I'll be checking this case, but it's only a guess, not a reall investigated solution.
Hey mate,
I ran a build of DS in a container with a 256M memory limit. I can see it OOMing a highmemory test:
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [29887] 1343600009 29887 9210 0 24 509 0 zsh Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ 3807] 1343600009 3807 30291 29 60 3958 0 gdb Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ 3826] 1343600009 3826 4563422503 47969 292 78569 0 lt-test_libsds Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: Memory cgroup out of memory: Kill process 3826 (lt-test_libsds) score 967 or sacrifice child Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: Killed process 3826 (lt-test_libsds) total-vm:18253690012kB, anon-rss:191876kB, file-rss:0kB, shmem-rss:0kB
Running setup-ds.pl, I can see the following warnings:
[10/Apr/2017:13:20:05.856337382 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (0 bytes). [10/Apr/2017:13:20:05.901729073 +1000] - INFO - dblayer_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [10/Apr/2017:13:20:05.917218763 +1000] - INFO - check_and_set_import_cache - pagesize: 4096, available bytes 6070214656, process usage 65798144 [10/Apr/2017:13:20:05.935436023 +1000] - INFO - check_and_set_import_cache - Import allocates 2371177KB import cache. [10/Apr/2017:13:20:06.102251864 +1000] - INFO - import_main_offline - import userRoot: Beginning import job... [10/Apr/2017:13:20:06.117203867 +1000] - INFO - import_main_offline - import userRoot: Index buffering enabled with bucket size 100 [10/Apr/2017:13:20:06.327920654 +1000] - INFO - import_producer - import userRoot: Processing file "/tmp/ldifhfIEEp.ldif" [10/Apr/2017:13:20:06.361842151 +1000] - INFO - import_producer - import userRoot: Finished scanning file "/tmp/ldifhfIEEp.ldif" (9 entries) [10/Apr/2017:13:20:06.846415798 +1000] - INFO - import_monitor_threads - import userRoot: Workers finished; cleaning up... [10/Apr/2017:13:20:07.065802887 +1000] - INFO - import_monitor_threads - import userRoot: Workers cleaned up. [10/Apr/2017:13:20:07.084330172 +1000] - INFO - import_main_offline - import userRoot: Cleaning up producer thread... [10/Apr/2017:13:20:07.098956207 +1000] - INFO - import_main_offline - import userRoot: Indexing complete. Post-processing... [10/Apr/2017:13:20:07.111706878 +1000] - INFO - import_main_offline - import userRoot: Generating numsubordinates (this may take several minutes to complete)... [10/Apr/2017:13:20:07.150434875 +1000] - INFO - import_main_offline - import userRoot: Generating numSubordinates complete. [10/Apr/2017:13:20:07.170592164 +1000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs... [10/Apr/2017:13:20:07.185626419 +1000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs. [10/Apr/2017:13:20:07.211114186 +1000] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Creating ancestorid index (new idl)... [10/Apr/2017:13:20:07.229210532 +1000] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Created ancestorid index (new idl). [10/Apr/2017:13:20:07.246735160 +1000] - INFO - import_main_offline - import userRoot: Flushing caches... [10/Apr/2017:13:20:07.265159211 +1000] - INFO - import_main_offline - import userRoot: Closing files... [10/Apr/2017:13:20:07.510799223 +1000] - INFO - dblayer_pre_close - All database threads now stopped [10/Apr/2017:13:20:07.524294478 +1000] - INFO - import_main_offline - import userRoot: Import complete. Processed 9 entries in 1 seconds. (9.00 entries/sec) +[10/Apr/2017:13:20:09.122785218 +1000] - NOTICE - ldbm_back_start - found 262144k physical memory +[10/Apr/2017:13:20:09.133959896 +1000] - NOTICE - ldbm_back_start - found 64197k avaliable +[10/Apr/2017:13:20:09.150539284 +1000] - NOTICE - ldbm_back_start - cache autosizing: db cache: 10485k +[10/Apr/2017:13:20:09.169102140 +1000] - NOTICE - ldbm_back_start - cache autosizing: userRoot entry cache (1 total): 65536k +[10/Apr/2017:13:20:09.188007215 +1000] - WARN - entrycache_set_max_size - Cachesize (47374540) may use more than the available physical memory. +[10/Apr/2017:13:20:09.202063252 +1000] - NOTICE - ldbm_back_start - total cache size: 92476014 B; +[10/Apr/2017:13:20:09.213497617 +1000] - WARN - ldbm_back_start - It is highly likely your memory configuration of all backends will EXCEED your systems memory. +[10/Apr/2017:13:20:09.228089373 +1000] - WARN - ldbm_back_start - In a future release this WILL prevent server start up. You MUST alter your configuration. +[10/Apr/2017:13:20:09.246454204 +1000] - WARN - ldbm_back_start - Total entry cache size: 49303756 B; dbcache size: 8589934 B; available memory size: 65738342 B; +[10/Apr/2017:13:20:09.264939787 +1000] - WARN - ldbm_back_start - This can be corrected by altering the values of nsslapd-dbcachesize, nsslapd-cachememsize and nsslapd-dncachememsize + +[10/Apr/2017:13:20:09.280016381 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (8589934 bytes). +[10/Apr/2017:13:20:09.297357115 +1000] - INFO - dblayer_start - Resizing db cache size: 2428085862 -> 8589934 +[10/Apr/2017:13:20:10.573234982 +1000] - INFO - slapd_daemon - slapd started. Listening on All Interfaces port 389 for LDAP requests
However, it looks like the server starts, with all entries added:
The only bit I'm concerned about here is:
[10/Apr/2017:13:20:05.856337382 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (0 bytes).
This would be during ldif2db in setup-ds.pl. However, the server still starts correctly.
I'll make a patch to prevent 0 from being set on these values, but this may not be the cause of the issue.
[10/Apr/2017:13:38:44.393226936 +1000] - INFO - check_and_set_import_cache - Import allocates 6211KB import cache.
<img alt="0001-Ticket-49204-Fix-lower-bounds-on-import-autosize.patch" src="/389-ds-base/issue/raw/files/439afb815888f065e5ad50db3918444da66758a6b85338265d8db85f115f9fce-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize.patch" />
Metadata Update from @firstyear: - Custom field reviewstatus adjusted to review
Metadata Update from @mreynolds: - Custom field reviewstatus adjusted to ack (was: review)
The patch looks good but note that it does not fully fix the problem. It keeps occurring
[11/Apr/2017:02:57:59.416666948 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988) [11/Apr/2017:02:57:59.492313735 +051800] - ERR - index_addordel_entry - database index operation failed BAD 1031, err=-30988 BDB0073 DB_NOTFOUND: No matching key/data pair found
The root cause is possibly identified: If at startup memory pressure prevent setting of entrycache size, this entrycache remains NULL.
Running long duration tests (the problem use to occur 50%-75%) with the following patch.
<img alt="0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch" src="/389-ds-base/issue/raw/files/6f730599b09c6e098488cb1601550e7160e2918deafcc482c74323a86ab89b2c-0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch" />
This is likely NOT the definitive fix
So this fixes the problem you have seen @tbordaz ?
I would be curious to see the value of delta in this log output. I think the idea of this change is good though, but the catch is that this is where the user sets the cachememsize via cn=config. So we check the delta of val - current size, to see if we can accomodate the difference because we already consume current_size in memory used (*mi). So really, delta is valid if it's 0, because you aren't changing the size at all, and for the sanity check to be carried out, delta must be a positive value relative to current size.
Second is that this code is only called from a change to cn=config, so I'm still not 100% it's the fix.
Perhaps the real fix is that util_is_cachesize_sane shouldn't be reducing the size of the allocation so much, and we should check it's output. IE if the request is say ... 1mb, we always allow it, and if it's greater than that we have enough room to actually do a reduction. As well, we should check that the result is never 0 as well.
Hey @tbordaz . I've had a look at your fix and tweaked it a bit. I squashed all three together to this single patch, so I hope that it helps and fixes the issue for you.
<img alt="0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch" src="/389-ds-base/issue/raw/files/754f2ef07452640a3b5af46a9ae95ec07537d03e62b9f4937b371b79a9753dcf-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch" />
Long duration tests completed successfully with https://pagure.io/389-ds-base/issue/raw/files/6f730599b09c6e098488cb1601550e7160e2918deafcc482c74323a86ab89b2c-0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch
I will check if the error came from entrycache being zeroed or uninitialized
The entrycache.size is not uninitialized, it can be 0 or some specific value (like 512K). So I think the error comes from the failure (LDAP_UNWILLING_TO_PERFORM) when parsing dse.ldif. The failure itself is triggered by memory pressure but I have not a clear understanding how the failure (unwilling) can lead to this weird error (suffix entries not readable).
Testing https://pagure.io/389-ds-base/issue/raw/files/754f2ef07452640a3b5af46a9ae95ec07537d03e62b9f4937b371b79a9753dcf-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch
Tests are inconclusive: patch was successful for the first 18 run but after DS crashed systematically for the remaining run. Unfortunately ABRT did not dump the core because of the memory pressure.
In conclusion: - The patch looks good to fix the current ticket, because of the 18 success in a raw - The patch may introduce an other problem or we may have an environment issue (memory pressure too high). Note that the available memory is 1Gb+1.5Gb_swap. - I will rerun tests
How about we take the patch as it currently is, and I try running DS in a container with say 64mb of ram available? That should provide a better env for me to reproduce in?
The failures were triggered by IPC being exhausted by apache. Doing regular cleanup, the tests (25first runs) were successful. IMHO the fix is ok. ACK
commit ae44ed4 To ssh://git@pagure.io/389-ds-base.git 8979cc6..ae44ed4 master -> master
Metadata Update from @firstyear: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
Fix conflicting types in header file:
ae44ed4..553d7dd master -> master
<img alt="0001-Issue-49204-Fix-32bit-arch-build-failures.patch" src="/389-ds-base/issue/raw/files/ff04aeb6f4055e59e60e58925e67f68b9f1dd9cbfef7a32fecb917e5407d1d92-0001-Issue-49204-Fix-32bit-arch-build-failures.patch" />
Metadata Update from @mreynolds: - Custom field reviewstatus adjusted to review (was: ack)
Metadata Update from @firstyear: - Custom field reviewstatus adjusted to ack (was: review)
Thanks William!
553d7dd..86931be master -> master
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2263
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Login to comment on this ticket.