The DS can fail to run recovery, even when it is needed, if any directory containing non-DB files is located at the same level of the backend directories.
This is actually quite easy to make happen, since by default db2ldif for example, will create a directory called ldif.
db2ldif
ldif
For example:
$ ls -l /var/lib/dirsrv/slapd-test9/ total 49728 drwxr-xr-x. 2 nobody nobody 4096 Oct 21 19:59 changelogdb -rw-------. 1 nobody nobody 249368 Oct 21 19:59 __db.001 -rw-------. 1 nobody nobody 2945240 Oct 21 19:59 __db.002 -rw-------. 1 nobody nobody 1267488 Oct 21 19:59 __db.003 -rw-------. 1 nobody nobody 51 Oct 17 06:12 DBVERSION -rw-------. 1 nobody nobody 52 Oct 21 23:07 guardian drwxrwx---. 2 nobody nobody 4096 Oct 17 03:03 ldif # <----- THIS CAN BREAK IT -rw-------. 1 nobody nobody 1045760 Oct 21 19:59 log.0000000002 drwx------. 2 nobody nobody 4096 Oct 21 23:07 userRoot
1.3.6 on CentOS 7
userRoot
kill -9
Recovery is not run.
Recovery is run.
There is a bug in count_dbfiles_in_dir. If recurse is false, it resets the counter to 0. However, it toggles recurse to false when it calls itself, so that it only descends one directory deep.
count_dbfiles_in_dir
recurse
0
The effect is that if the last directory it consults is empty of *.db files, the final count will be 0, no matter how many *.db files it had encountered and counted up to that point.
*.db
<img alt="ds_recovery.patch" src="/389-ds-base/issue/raw/files/d7a9777b3f983dd9fda1a8db4819d4ab20d2a82f0b1ff5661bc2bb4af3726bb2-ds_recovery.patch" />
The patch just removes this quirky behavior. The code is called in exactly one place in the DS, and that one place does the right thing and sets the value to zero before passing in the param.
It seems better to document the proper way to call this function than it does to have the strange behavior of sometimes resetting the param and sometimes not.
Thanks for the patch Telackey! This seems like a really simple fix. It would be great if there were some tests for this, but I'm happy if you would rather us write those up.
@mreynolds What do you think of this, I think you know the BE code better than I.
Metadata Update from @firstyear: - Custom field origin adjusted to None - Custom field reviewstatus adjusted to None
Yes, I'm afraid I don't have the opportunity to write up any automated tests. For manual testing, I just did as indicated above:
Before:
After: 1. kill -9 the server again. 2. Look to see if it runs recovery on DS startup (it should).
This can actually be pretty serious, since if the process exits unexpectedly it can leak locks. These should be cleaned up by recovery, but if they aren't you could get strange deadlocks (eg, https://pagure.io/389-ds-base/issue/49824).
I also saw it recently in a server with a hardware failure that led to some corruption. When the DS started and recovery was not run, but it tried to write incoming replicated changes to the DB anyway, a series of very unfortunate events followed as you can imagine.
All good. I'll write up a test for this this morning then to reproduce it.
Okay, so I've written a test up - it looks like the conditions required (guardian file present, directory before userroot, kill -9), but I'm not able to reproduce - I'm wondering if there is something I have missed in the setup or the reproduction env.
I've looked in gdb and I'm seeing the following steps taken in the read_metadata function
(gdb) next 3924 PRInt32 byte_count = 0; (gdb) 3927 int number = 0; (gdb) 3928 bdb_config *conf = (bdb_config *)li->li_dblayer_config; (gdb) 3929 dblayer_private *priv = li->li_dblayer_private; (gdb) 3935 conf->bdb_previous_cachesize = 0; (gdb) 3936 conf->bdb_previous_ncache = 0; (gdb) 3937 conf->bdb_previous_lock_config = 0; (gdb) 3939 PR_snprintf(filename, sizeof(filename), "%s/guardian", conf->bdb_home_directory); (gdb) 3941 memset(&prfinfo, '\0', sizeof(PRFileInfo64)); (gdb) 3942 (void)PR_GetFileInfo64(filename, &prfinfo); (gdb) 3944 prfd = PR_Open(filename, PR_RDONLY, priv->dblayer_file_mode); (gdb) 3945 if (NULL == prfd || 0 == prfinfo.size) { (gdb) 3947 int count = 0; (gdb) 3948 for (dirp = conf->bdb_data_directories; dirp && *dirp; dirp++) { (gdb) 3949 count_dbfiles_in_dir(*dirp, &count, 1 /* recurse */); (gdb) 3950 if (count > 0) { (gdb) 3951 conf->bdb_recovery_required = 1; (gdb) 3952 return 0; (gdb) 3915 {
Keep in mind, I am testing on git master (so 1.4.2) not 1.3.6.
My test can be found: https://pagure.io/389-ds-base/pull-request/50666
Metadata Update from @mreynolds: - Issue set to the milestone: 1.4.2
For some reason I didn't see an e-mail notice, so I just saw your reply.
Can you add some debug output to count_dbfiles_in_dir? The key to making it reproduce is that the last directory it examines needs to be one without *.db files in it.
Without additional output, it is hard to see if that is what's happening. At least in my case, this was fairly easy to trigger, but I'm not sure what the actual sort characteristics are for count_dbfiles_in_dir, so it is probably easiest to have it spit out what it is examining and its current total count at each iteration.
@telackey Yep, I'll try that, it's a good idea. I'll review this again soon.
Right, so I understand why I can't reproduce this now. PR_OpenDir/PR_ReadDir is a wrapper to the os' opendir, and opendir makes "no guarantees about the order in which files are checked". Which really means "it's sorted by reverse inode order in the dirent". Because I can't write a test that can 100% guarantee inode ordering or how the dirent is constructed between different filesystems, I think it's not worth attempting.
I can however "see" the bug on a cursory read of the code. I'll apply your patch and send it as a PR today in this case.
https://pagure.io/389-ds-base/pull-request/50750
Metadata Update from @firstyear: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
Thanks for the report and patch @telackey, and again sorry it took so long to resolve.
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/3719
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Login to comment on this ticket.