#50599 Remove DB env files prior to startup
Closed: fixed 5 months ago by mreynolds. Opened 9 months ago by mreynolds.

Issue Description

We occasionally see problems where the server fails to start, and it appears to be an issue with the __db files. Removing these files prior to startup fixes the problem, so maybe we should always remove these files when starting the server and allow them to be recreated by libdb


Metadata Update from @mreynolds:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Issue set to the milestone: FUTURE

8 months ago

This might be popping up again, I think we should revisit the priority on this one...

See: https://bugzilla.redhat.com/show_bug.cgi?id=1787921

Metadata Update from @mreynolds:
- Issue set to the milestone: 0.0 NEEDS_TRIAGE (was: FUTURE)

5 months ago

@lkrispen What do you think here?

I'd want to know what the __db files do and their function, just removing them scares me a bit :|

the __db files contain the database environment for BDB, dbcache, locks, ... and are mmapped to the slapd process.
They are files in the filesystem because theoretically could be shared by several process, I think we have excluded almost all combinations of slapd processes to run in parallel, if I am right slapd in normal mode and offline ldif export are the only left (but we would need to check). And in this parallel mode, it would not be nice to erase the files.
And we do the check if the process is allowed to continue in the startup itself by checking if there is a pid file locked, so just removing files before startup is also a bad idea.

Another reason not to remove them that startup could become a bit slower, if a large dbcache has to be rebuilt, bdb will create the file, mmap it and then write to each page to ensure it is really allocated.

If we want to move in the direction of removing __db file before startup we should clarify/ensure this:
- which slapd modes are allowed to run in parallel ?
- can we detect this before actually start the slapd process ?
- do we really want to allow different slapd modes to run in parallel ?

If we always have only one process running, we could also get rid of the __db files completely. bdb offers to use nsslapd-private-mem, the db environment will then be only allocated on the heap without any backup in the file system.
But we would lose diagnostic options, eg investigating for deadlocks by running "db_stat -C A -h ....", this would require to reproduce with using __db files

Using nsslapd-private-mem doesn't seem like a good option. In fact there doesn't seem to be a good option at all, but we keep seeing customer bugs being opened because these mmapped files are somehow getting corrupted, and it's not clear that they need to be removed in order for the server to start up after a crash.

The problem is that it's hard to detect inside our code when we should remove these files. In the bug mentioned in a comment above the server is crashing and there is no way to catch it in our code. So there is no way to remove the files or even log a message that removing the files might resolve the startup issue.

If we could get a reproducer for these types of issues then maybe we could find a way to detect it and then do the cleanup, although it might not be possible in all cases.

I'm just trying to prevent further support cases from being opened over something I was hoping we could automatically prevent. Maybe for now we just need more encompassing knowledge base articles saying if the server fails to start then removing these files might help...

If you do the same checks as in add_new_slapd_process() if a process is already running I think you can safely remove the file if there is no process. Of course there could be some race conditions when starting two slapds in parallel.

If you do the same checks as in add_new_slapd_process() if a process is already running I think you can safely remove the file if there is no process. Of course there could be some race conditions when starting two slapds in parallel.

Maybe we could do the file removal if we see both: disorderly shutdown, and the add_new_slapd_process() check?

ah, you want to do it inside slapd, thought you wanted to do it before, in start-dirsrv or dsadm.

we need to check if we do not try to read the env before checking for disorderly shutdown. Also the check for another process is done in main, so if we reach the check for disorderly we could have another process running.

Lets do some more investigation on possible paths

Looking at bdb_start() we do detect the disorderly shutdown before we call the crashing function:

return_value = (pEnv->bdb_DB_ENV->open)(

I also noticed there are other conditions where we call:

return_value = thisenv->remove(thisenv, region_dir, DB_FORCE);

What exactly is this removing? Would this be the same call we would make if we wanted to remove the __db files?

It will remove the __db files, but they will also be removed if you call dbenv->open with the DB_RECOVER_FATAL flag set.

All the modes and flags are not very clear, the least to say

It will remove the __db files, but they will also be removed if you call dbenv->open with the DB_RECOVER_FATAL flag set.
All the modes and flags are not very clear, the least to say

It still fails to remove them internally and we still crash inside libdb open(). So we have to explicitly delete the three __db* files. I have a PR coming, and it has passed ASAN tests.

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

5 months ago

https://pagure.io/389-ds-base/pull-request/50811

Here is an initial fix. With this fix I don't think we need to do any checks like what we have in add_new_slapd_process().

This does pass ASAN tests and the database can be queried and updated after the recovery.

Please review...

Commit fa1f69a relates to this ticket

Commit fa1f69a relates to this ticket

22d53d0..f09adba 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

696c0f3..9f3f5d5 389-ds-base-1.4.1 -> 389-ds-base-1.4.1

e415129..73928c4 389-ds-base-1.3.10 -> 389-ds-base-1.3.10

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

5 months ago

Metadata Update from @mreynolds:
- Issue status updated to: Open (was: Closed)

5 months ago

Commit cf849cc relates to this ticket

Commit cf849cc relates to this ticket

2a1a0d7..b747be0 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

07c4799..44d9b73 389-ds-base-1.4.1 -> 389-ds-base-1.4.1

73928c4..89422c6 389-ds-base-1.3.10 -> 389-ds-base-1.3.10

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

5 months ago

Metadata Update from @vashirov:
- Issue set to the milestone: None (was: 0.0 NEEDS_TRIAGE)

4 months ago

Login to comment on this ticket.

Metadata
Related Pull Requests