#50664 DS Can Fail to Run Recovery if Directories with No .db Files Exist
Closed: fixed 2 months ago by firstyear. Opened 4 months ago by telackey.

Issue Description

The DS can fail to run recovery, even when it is needed, if any directory containing non-DB files is located at the same level of the backend directories.

This is actually quite easy to make happen, since by default db2ldif for example, will create a directory called ldif.

For example:

$ ls -l /var/lib/dirsrv/slapd-test9/
total 49728
drwxr-xr-x. 2 nobody nobody     4096 Oct 21 19:59 changelogdb
-rw-------. 1 nobody nobody  249368 Oct 21 19:59 __db.001
-rw-------. 1 nobody nobody 2945240 Oct 21 19:59 __db.002
-rw-------. 1 nobody nobody 1267488 Oct 21 19:59 __db.003
-rw-------. 1 nobody nobody       51 Oct 17 06:12 DBVERSION
-rw-------. 1 nobody nobody       52 Oct 21 23:07 guardian
drwxrwx---. 2 nobody nobody     4096 Oct 17 03:03 ldif # <-----  THIS CAN BREAK IT
-rw-------. 1 nobody nobody 1045760 Oct 21 19:59 log.0000000002
drwx------. 2 nobody nobody     4096 Oct 21 23:07 userRoot

Package Version and Platform

1.3.6 on CentOS 7

Steps to reproduce

  1. Setup the DS.
  2. Create some empty folders alongside userRoot.
  3. kill -9 the DS process.
  4. Restart the DS process, noticing that even though recovery should be run, it isn't.

Actual results

Recovery is not run.

Expected results

Recovery is run.

Why

There is a bug in count_dbfiles_in_dir. If recurse is false, it resets the counter to 0. However, it toggles recurse to false when it calls itself, so that it only descends one directory deep.

The effect is that if the last directory it consults is empty of *.db files, the final count will be 0, no matter how many *.db files it had encountered and counted up to that point.


ds_recovery.patch

The patch just removes this quirky behavior. The code is called in exactly one place in the DS, and that one place does the right thing and sets the value to zero before passing in the param.

It seems better to document the proper way to call this function than it does to have the strange behavior of sometimes resetting the param and sometimes not.

Thanks for the patch Telackey! This seems like a really simple fix. It would be great if there were some tests for this, but I'm happy if you would rather us write those up.

@mreynolds What do you think of this, I think you know the BE code better than I.

Metadata Update from @firstyear:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

4 months ago

Yes, I'm afraid I don't have the opportunity to write up any automated tests. For manual testing, I just did as indicated above:

Before:

  1. Make some empty folders.
  2. kill -9 the server
  3. Look to see if it runs recovery on DS startup (it won't).

After:
1. kill -9 the server again.
2. Look to see if it runs recovery on DS startup (it should).

This can actually be pretty serious, since if the process exits unexpectedly it can leak locks. These should be cleaned up by recovery, but if they aren't you could get strange deadlocks (eg, https://pagure.io/389-ds-base/issue/49824).

I also saw it recently in a server with a hardware failure that led to some corruption. When the DS started and recovery was not run, but it tried to write incoming replicated changes to the DB anyway, a series of very unfortunate events followed as you can imagine.

All good. I'll write up a test for this this morning then to reproduce it.

Okay, so I've written a test up - it looks like the conditions required (guardian file present, directory before userroot, kill -9), but I'm not able to reproduce - I'm wondering if there is something I have missed in the setup or the reproduction env.

I've looked in gdb and I'm seeing the following steps taken in the read_metadata function

(gdb) next
3924        PRInt32 byte_count = 0;
(gdb)
3927        int number = 0;
(gdb)
3928        bdb_config *conf = (bdb_config *)li->li_dblayer_config;
(gdb)
3929        dblayer_private *priv = li->li_dblayer_private;
(gdb)
3935        conf->bdb_previous_cachesize = 0;
(gdb)
3936        conf->bdb_previous_ncache = 0;
(gdb)
3937        conf->bdb_previous_lock_config = 0;
(gdb)
3939        PR_snprintf(filename, sizeof(filename), "%s/guardian", conf->bdb_home_directory);
(gdb)
3941        memset(&prfinfo, '\0', sizeof(PRFileInfo64));
(gdb)
3942        (void)PR_GetFileInfo64(filename, &prfinfo);
(gdb)
3944        prfd = PR_Open(filename, PR_RDONLY, priv->dblayer_file_mode);
(gdb)
3945        if (NULL == prfd || 0 == prfinfo.size) {
(gdb)
3947            int count = 0;
(gdb)
3948            for (dirp = conf->bdb_data_directories; dirp && *dirp; dirp++) {
(gdb)
3949                count_dbfiles_in_dir(*dirp, &count, 1 /* recurse */);
(gdb)
3950                if (count > 0) {
(gdb)
3951                    conf->bdb_recovery_required = 1;
(gdb)
3952                    return 0;
(gdb)
3915    {

Keep in mind, I am testing on git master (so 1.4.2) not 1.3.6.

My test can be found: https://pagure.io/389-ds-base/pull-request/50666

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.2

4 months ago

For some reason I didn't see an e-mail notice, so I just saw your reply.

Can you add some debug output to count_dbfiles_in_dir? The key to making it reproduce is that the last directory it examines needs to be one without *.db files in it.

Without additional output, it is hard to see if that is what's happening. At least in my case, this was fairly easy to trigger, but I'm not sure what the actual sort characteristics are for count_dbfiles_in_dir, so it is probably easiest to have it spit out what it is examining and its current total count at each iteration.

@telackey Yep, I'll try that, it's a good idea. I'll review this again soon.

Right, so I understand why I can't reproduce this now. PR_OpenDir/PR_ReadDir is a wrapper to the os' opendir, and opendir makes "no guarantees about the order in which files are checked". Which really means "it's sorted by reverse inode order in the dirent". Because I can't write a test that can 100% guarantee inode ordering or how the dirent is constructed between different filesystems, I think it's not worth attempting.

I can however "see" the bug on a cursory read of the code. I'll apply your patch and send it as a PR today in this case.

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Thanks for the report and patch @telackey, and again sorry it took so long to resolve.

Login to comment on this ticket.

Metadata
Attachments 1
Attached 4 months ago View Comment