In edge cases (sorry I do not have real way of repeating it, it happens on some of our VMs but not other) dse.ldif is truncated to 0 on reboot. It happens only on reboot and not when just stopping service via systemctl. Now here is the interesting part: it does not happen on ext3. I can take machine with that problem, mount tiny ext3 partition in 389 config dir and problem is fixed.
Adding /bin/sync to systemd's ExecStopPost also fixes that problem
/bin/sync
ExecStopPost
I believe it is related to http://www.spinics.net/lists/xfs/msg27497.html There is no fsync call on directory after renaming files.
from man fsync:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
Debian 9 (1.3.5.17-2) with stable kernel (4.9+80) XFS
Hmmm, perhaps the issue isn't a sync issue, it's that DS isn't stopping in time before the systemd timeout kills it which cuold cause the issue.
Would you mind setting:
[Service] TimeoutStopSec=600
In your .service file? You may adjust the number up as needed. I'm not sure if there is a better solution today than this.
Metadata Update from @firstyear: - Custom field type adjusted to defect
I've already checked that. And it shuts down in like a second anyway it's a small database
I also added ps aux in ExecStopPost and verified that daemon is actually down.
ps aux
Perhaps what we need is to call sync during server shutdown instead of writing it into the systemd unit file, because this could affect non-systemd systems also.
Metadata Update from @firstyear: - Issue assigned to firstyear
the issue of 0 byte dse.ldif has been reported several times (eg #49131) and we tried to do the best on restart to recover from an alternative. but this sounds promising to avoid the scenario.
Well it is in the doc of fsync on how exactly avoid it, just fsync the directory after doing the rename. sync() would be fine too but wouldn't that force system-wide sync which might not be desirable if user have something else running on the machine ?
Curiously, ext3/4 have mount option to do fsync automatically after rename:
dirsync All directory updates within the filesystem should be done synchronously. This affects the following system calls: creat, link, unlink, symlink, mkdir, rmdir, mknod and rename.
and it seems to be on by default, I couldn't repeat the issue on ext3 filesystem
<img alt="0001-Ticket-49298-force-sync-on-shutdown.patch" src="/389-ds-base/issue/raw/eab7271478d708b1919b3375668d27fccd9e9682ba119a2e55ec9ee4ffe1c079-0001-Ticket-49298-force-sync-on-shutdown.patch" /> This should do the trick then.
Metadata Update from @firstyear: - Custom field reviewstatus adjusted to review
Metadata Update from @lkrispen: - Custom field reviewstatus adjusted to ack (was: review)
That would certainly fix that particular issue, but I have my doubts what will happen if system was shut down uncleanly during. It should fsync after close of file (from man 2 close)
man 2 close
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel uses the buffer cache to defer writes. Typically, filesystems do not flush buffers when a file is closed. If you need to be sure that the data is physically stored on the underlying disk, use fsync(2).
and then on directory itself (man 2 fsync)
Some filesystems like ext3/4 "do that for user" when rename() is called but that is not the case in every filesystem
We don't need to call fsync on the file, we need to call it on the dir. I think that we need to do this in slapi_rename_destructive.
I'll ammend the patch then
<img alt="0003-Ticket-49298-force-sync-on-shutdown.patch" src="/389-ds-base/issue/raw/2a68902f4df9f19ef673fb54ae32b4b8ce1ea69d3adc58902ac018c4c49352ba-0003-Ticket-49298-force-sync-on-shutdown.patch" />
This might be a better patch then. It calls fsync() on the configdir after rename.
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.7.0
ack
commit 7efc2d2 To ssh://git@pagure.io/389-ds-base.git ddf1c6d..7efc2d2 master -> master
Metadata Update from @firstyear: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
Getting compiler warnings with this fix :-(
[root@localhost BUILD]# make install > /dev/null ../389-ds-base/ldap/servers/slapd/dse.c: In function ‘dse_write_file_nolock’: ../389-ds-base/ldap/servers/slapd/dse.c:1055:5: warning: implicit declaration of function ‘fsync’ [-Wimplicit-function-declaration] fsync(fp_configdir); ^~~~~ ../389-ds-base/ldap/servers/slapd/dse.c:1056:5: warning: implicit declaration of function ‘close’ [-Wimplicit-function-declaration] close(fp_configdir); ^~~~~
Metadata Update from @mreynolds: - Issue status updated to: Open (was: Closed)
Yep I just saw it myself :( I'm writing a follow up.
commit 3abcd68 To ssh://git@pagure.io/389-ds-base.git 06c0a80..3abcd68 master -> master
One line rule
95b39e2..76feb34 389-ds-base-1.3.6 -> 389-ds-base-1.3.6
Metadata Update from @mreynolds: - Custom field component adjusted to None - Custom field origin adjusted to None - Custom field version adjusted to None - Issue set to the milestone: 1.3.6.0 (was: 1.3.7.0)
Reopen to harden missing dse case to prevent segfaults.
Metadata Update from @firstyear: - Issue status updated to: Open (was: Closed)
<img alt="0001-Ticket-49298-Correct-error-codes-with-config-restore.patch" src="/389-ds-base/issue/raw/files/f4fc55687a3f12600d06eacc12b731fcbc7e750b4a77ec101736fa85045f575f-0001-Ticket-49298-Correct-error-codes-with-config-restore.patch" />
Metadata Update from @firstyear: - Custom field reviewstatus adjusted to review (was: ack)
create_test.py will generate : id : for you. Or you can generate with:
python -c 'import uuid; print(uuid.uuid4())'
Could you please fix the docstrings? First, the '1. Create an instance' is already part of the 'setup'. Second, we need to have all steps that are meaningful (like for 'stop the instance' too, etc.). And you test different things, so the first line should be different.
def test_restore_config(topo): """Check that if a dse.ldif is removed, that the server still can start :id: d7e61131-a34e-444d-b065-d537320958fd :setup: Standalone instance :steps: 1. Stop the instance 2. Delete 'dse.ldif' 3. Start the instance :expectedresults: 1. Steps 1 and 2 should be successful 2. Server will succeed to start with restored cfg. """ def test_removed_config(topo): """Check that if a dse.ldif and backup are removed, that the server exits better than "segfault". :id: b52baa92-4590-4b31-a060-b6761471062e :setup: Standalone instance :steps: 1. Stop the instance 2. Delete 'dse.ldif', 'dse.ldif.bak', 'dse.ldif.startOK' 3. Start the instance :expectedresults: 1. Steps 1 and 2 should be successful 2. Server will fail to start, but will not crash. """
<img alt="0001-Ticket-49298-Correct-error-codes-with-config-restore.patch" src="/389-ds-base/issue/raw/files/a311ad333ee45f979238811cf89b23216989dd639523219b22a55f752c541ed5-0001-Ticket-49298-Correct-error-codes-with-config-restore.patch" />
Updated with your comments, the test parts make more sense now thanks :)
Ack from me. Pushing so this can make it into the next rhel 7.5 build
6e794a8..75e55e2 master -> master
d67e8ae..3cd1ab4 389-ds-base-1.3.7 -> 389-ds-base-1.3.7
628a927..8dd1c60 389-ds-base-1.3.6 -> 389-ds-base-1.3.6
COmpiler warning
} else { slapi_log_err(SLAPI_LOG_INFO, "dse_check_file", "The config %s has zero length. Attempting restore ... \n", filename, rc); rc = PR_Delete(filename); }
it should be something like:
"The config %s has zero length. Attempting restore ... (reason: %d) \n", filename, rc);
I'll fix that up with one line :)
commit d40d919 To ssh://git@pagure.io/389-ds-base.git 75e55e2..d40d919 master -> master
commit 0a436b7 To ssh://git@pagure.io/389-ds-base.git 8dd1c60..0a436b7 389-ds-base-1.3.6 -> 389-ds-base-1.3.6
commit 928e5a4 To ssh://git@pagure.io/389-ds-base.git 3cd1ab4..928e5a4 389-ds-base-1.3.7 -> 389-ds-base-1.3.7
BZ opened with doctext.
Test case has an issue in some cases with the perl installer, that it can't remove the instance. Instead of rm dse.ldif we shoule move it. Reopen and I'll fix it asap,
<img alt="0001-Ticket-49298-issue-with-test-case-and-remove-ds.pl.patch" src="/389-ds-base/issue/raw/files/3248c8b3b3b37b4a7bc476f59643e3e05a410f993d672a96411602c1eef88983-0001-Ticket-49298-issue-with-test-case-and-remove-ds.pl.patch" />
@vashirov
Metadata Update from @vashirov: - Custom field reviewstatus adjusted to ack (was: review)
commit 545b627 To ssh://git@pagure.io/389-ds-base.git 4c4d3c1..545b627 master -> master
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2357
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Login to comment on this ticket.