#49298 dse.ldif is truncated on reboot in edge cases
Closed: fixed 2 years ago Opened 2 years ago by xani.

Issue Description

In edge cases (sorry I do not have real way of repeating it, it happens on some of our VMs but not other) dse.ldif is truncated to 0 on reboot. It happens only on reboot and not when just stopping service via systemctl.
Now here is the interesting part: it does not happen on ext3. I can take machine with that problem, mount tiny ext3 partition in 389 config dir and problem is fixed.

Adding /bin/sync to systemd's ExecStopPost also fixes that problem

I believe it is related to http://www.spinics.net/lists/xfs/msg27497.html
There is no fsync call on directory after renaming files.

from man fsync:

Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk. For that an
explicit fsync() on a file descriptor for the directory is also
needed.

Package Version and Platform

Debian 9 (1.3.5.17-2) with stable kernel (4.9+80)
XFS

Steps to reproduce

  1. Install 389 on XFS root
  2. reboot
  3. Be unlucky

Hmmm, perhaps the issue isn't a sync issue, it's that DS isn't stopping in time before the systemd timeout kills it which cuold cause the issue.

Would you mind setting:

[Service]
TimeoutStopSec=600

In your .service file? You may adjust the number up as needed. I'm not sure if there is a better solution today than this.

Metadata Update from @firstyear:
- Custom field type adjusted to defect

2 years ago

I've already checked that. And it shuts down in like a second anyway it's a small database

I also added ps aux in ExecStopPost and verified that daemon is actually down.

Perhaps what we need is to call sync during server shutdown instead of writing it into the systemd unit file, because this could affect non-systemd systems also.

Metadata Update from @firstyear:
- Issue assigned to firstyear

2 years ago

the issue of 0 byte dse.ldif has been reported several times (eg #49131) and we tried to do the best on restart to recover from an alternative. but this sounds promising to avoid the scenario.

Well it is in the doc of fsync on how exactly avoid it, just fsync the directory after doing the rename. sync() would be fine too but wouldn't that force system-wide sync which might not be desirable if user have something else running on the machine ?

Curiously, ext3/4 have mount option to do fsync automatically after rename:

   dirsync
          All directory updates within the filesystem should be done synchronously.  This affects the following
          system calls: creat, link, unlink, symlink, mkdir, rmdir, mknod and rename.

and it seems to be on by default, I couldn't repeat the issue on ext3 filesystem

0001-Ticket-49298-force-sync-on-shutdown.patch
This should do the trick then.

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review

2 years ago

Metadata Update from @lkrispen:
- Custom field reviewstatus adjusted to ack (was: review)

2 years ago

That would certainly fix that particular issue, but I have my doubts what will happen if system was shut down uncleanly during. It should fsync after close of file (from man 2 close)

   A  successful  close does not guarantee that the data has been successfully saved to disk, as the kernel uses the buffer cache to defer writes.  Typically, filesystems do not flush buffers when a
   file is closed.  If you need to be sure that the data is physically stored on the underlying disk, use fsync(2).

and then on directory itself (man 2 fsync)

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk.  For that an explicit fsync() on a file descriptor for the directory is also
   needed.

Some filesystems like ext3/4 "do that for user" when rename() is called but that is not the case in every filesystem

We don't need to call fsync on the file, we need to call it on the dir. I think that we need to do this in slapi_rename_destructive.

I'll ammend the patch then

0003-Ticket-49298-force-sync-on-shutdown.patch

This might be a better patch then. It calls fsync() on the configdir after rename.

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.3.7.0

2 years ago

commit 7efc2d2
To ssh://git@pagure.io/389-ds-base.git
ddf1c6d..7efc2d2 master -> master

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Getting compiler warnings with this fix :-(

[root@localhost BUILD]# make install > /dev/null
../389-ds-base/ldap/servers/slapd/dse.c: In function ‘dse_write_file_nolock’:
../389-ds-base/ldap/servers/slapd/dse.c:1055:5: warning: implicit declaration of function ‘fsync’ [-Wimplicit-function-declaration]
     fsync(fp_configdir);
     ^~~~~
../389-ds-base/ldap/servers/slapd/dse.c:1056:5: warning: implicit declaration of function ‘close’ [-Wimplicit-function-declaration]
     close(fp_configdir);
     ^~~~~

Metadata Update from @mreynolds:
- Issue status updated to: Open (was: Closed)

2 years ago

Yep I just saw it myself :( I'm writing a follow up.

commit 3abcd68
To ssh://git@pagure.io/389-ds-base.git
06c0a80..3abcd68 master -> master

One line rule

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

95b39e2..76feb34 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

Metadata Update from @mreynolds:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field version adjusted to None
- Issue set to the milestone: 1.3.6.0 (was: 1.3.7.0)

2 years ago

Reopen to harden missing dse case to prevent segfaults.

Metadata Update from @firstyear:
- Issue status updated to: Open (was: Closed)

2 years ago

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review (was: ack)

2 years ago

create_test.py will generate : id : for you. Or you can generate with:

python -c 'import uuid; print(uuid.uuid4())'

Could you please fix the docstrings? First, the '1. Create an instance' is already part of the 'setup'.
Second, we need to have all steps that are meaningful (like for 'stop the instance' too, etc.). And you test different things, so the first line should be different.

def test_restore_config(topo):
    """Check that if a dse.ldif is removed, that the server still can start

    :id: d7e61131-a34e-444d-b065-d537320958fd
    :setup: Standalone instance
    :steps:
        1. Stop the instance
        2. Delete 'dse.ldif'
        3. Start the instance
    :expectedresults:
        1. Steps 1 and 2 should be successful
        2. Server will succeed to start with restored cfg.
    """

def test_removed_config(topo):
    """Check that if a dse.ldif and backup are removed, that the server
    exits better than "segfault".

    :id: b52baa92-4590-4b31-a060-b6761471062e
    :setup: Standalone instance
    :steps:
        1. Stop the instance
        2. Delete 'dse.ldif', 'dse.ldif.bak', 'dse.ldif.startOK'
        3. Start the instance
    :expectedresults:
        1. Steps 1 and 2 should be successful
        2. Server will fail to start, but will not crash.
    """

0001-Ticket-49298-Correct-error-codes-with-config-restore.patch

Updated with your comments, the test parts make more sense now thanks :)

Ack from me. Pushing so this can make it into the next rhel 7.5 build

6e794a8..75e55e2 master -> master

d67e8ae..3cd1ab4 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

628a927..8dd1c60 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

COmpiler warning

        } else {
            slapi_log_err(SLAPI_LOG_INFO, "dse_check_file",
                          "The config %s has zero length. Attempting restore ... \n", filename, rc);
            rc = PR_Delete(filename);
        }

it should be something like:

 "The config %s has zero length. Attempting restore ... (reason: %d) \n", filename, rc);

I'll fix that up with one line :)

commit d40d919
To ssh://git@pagure.io/389-ds-base.git
75e55e2..d40d919 master -> master

commit 0a436b7
To ssh://git@pagure.io/389-ds-base.git
8dd1c60..0a436b7 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

commit 928e5a4
To ssh://git@pagure.io/389-ds-base.git
3cd1ab4..928e5a4 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

BZ opened with doctext.

Test case has an issue in some cases with the perl installer, that it can't remove the instance. Instead of rm dse.ldif we shoule move it. Reopen and I'll fix it asap,

Metadata Update from @firstyear:
- Issue status updated to: Open (was: Closed)

2 years ago

Metadata Update from @vashirov:
- Custom field reviewstatus adjusted to ack (was: review)

2 years ago

commit 545b627
To ssh://git@pagure.io/389-ds-base.git
4c4d3c1..545b627 master -> master

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata