#527 ns-slapd segfaults if it cannot rename the logs
Closed None Opened 7 years ago by nkinder.

https://bugzilla.redhat.com/show_bug.cgi?id=878111 (Red Hat Enterprise Linux 6)

Description of problem: 389-ds segfaults if it cannot rename the logs (due to
wrong permission on the parent directory).

/etc/init.d/dirsrv start PKI-IPA
Starting dirsrv:
ns-slapd: Failed to rename errors log file, Netscape Portable Runtime error
-5966 (Access Denied.). Exiting...
ns-slapd: Failed to rename errors log file, Netscape Portable Runtime error
-5966 (Access Denied.). Exiting...
rsyslogd-2177: imuxsock begins to drop messages from pid 3554 due to
rate-limiting
kernel: ns-slapd[3554]: segfault at 7fffe5205adc ip 0000003aa02725b4 sp
00007fffe5205ac0 error 6 in libslapd.so.0.0.0[3aa0200000+f0000]

Version-Release number of selected component (if applicable):
389-ds-base-1.2.10.2-20.el6_3.x86_64

How reproducible: Always.

Steps to Reproduce:
1. chown root.root /var/log/dirsrv-<instance>
2. try to start slapd.

Actual results: slapd segfaults

Expected results: No segfaults

Additional info:
As I understood, slapd fails & segfault only if it requires log rotation,
otherwise it starts normally (provided the files under
/var/log/dirsrv-<instance> is writable).

I can not reproduce this on any version of 389. Here is what I am doing:

[1] Create a DS instance to run as user "dirman"
[2] Turn on verbose logging to fill up error log (above 1 mb)
[3] Stop DS
[4] Change the max log size to 1 mb - this will force rotation at startup
[5] chown root:root /var/log/dirsrv/slapd-localhost
[6] chown root:root /var/log/dirsrv/slapd-localhost/*
[7] start ds: service dirsrv start
[8] No crash or segfault
[9] Error log is properly rotated, and ownership is restored to "dirman" for all the files under /var/log/dirsrv/slapd-localhost

Reproduced issue. We are getting into an infinite loop which leads to a stack overflow.

When we go to write to the error we see that it needs to be rotated, but since the permission are wrong we fail to rename the file. So we try and log a message that we can not rename the log, but trying to log this new message triggers the log rotation again, which fails, and we just keep looping and overflow the stack.

The fix is to log a message, without doing the rotation check.

git merge ticket527
Updating e3aac66..4e9aab8
Fast-forward
ldap/servers/slapd/log.c | 63 ++++++++++++++++++++++++++++++++++++++-------
1 files changed, 53 insertions(+), 10 deletions(-)

git push origin master
Counting objects: 11, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 1.14 KiB, done.
Total 6 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
e3aac66..4e9aab8 master -> master

git checkout 389-ds-base-1.2.11
Switched to branch '389-ds-base-1.2.11'

git cherry-pick -x master
Finished one cherry-pick.
[389-ds-base-1.2.11 85261ef] Ticket 527 - ns-slapd segfaults if it cannot rename the logs
1 files changed, 53 insertions(+), 10 deletions(-)

git push origin 389-ds-base-1.2.11
Counting objects: 11, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 1.18 KiB, done.
Total 6 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
39b0938..85261ef 389-ds-base-1.2.11 -> 389-ds-base-1.2.11

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11.17

2 years ago

I have several brand new installations that are experiencing this behavior. Using 389-ds-base 1.3.9.1.10.el7. Is there any corrective action associated with this behavior?

I have several brand new installations that are experiencing this behavior. Using 389-ds-base 1.3.9.1.10.el7. Is there any corrective action associated with this behavior?

Are you actually seeing this error: "Failed to rename errors log file" and then the server crashes?

This bug was fixed 6 years ago, and no one has complained about it since. If it's crashing at startup please follow this doc on how to get as stack trace so we can see what is going on:

http://www.port389.org/docs/389ds/FAQ/faq.html#debug-startup-crashes

Thanks!

Metadata Update from @mreynolds:
- Issue close_status updated to: None (was: Fixed)
- Issue set to the milestone: None (was: 1.2.11.17)

2 months ago

I will check logs for that specific error. They're definitely dying at log rotation with "access denied" error message. The only way I can start the server is to delete the error log first, otherwise it just hangs at "starting up".

status-dirsrv shows a number of these errors:

Oct 12 00:00:00 patchqaldap02s.eng.c4e.ops ns-slapd[22846]: Failed to open errors log file /var/log/dirsrv/slapd-patchqaldap02s/errors: error 13 (Permission denied); Exiting...
Oct 12 00:00:00 patchqaldap02s.eng.c4e.ops ns-slapd[22846]: Failed to reopen errors log file, Netscape Portable Runtime error -5966 (Access Denied.)

Thanks for the link. I'll follow it accordingly.

I will check logs for that specific error. They're definitely dying at log rotation with "access denied" error message. The only way I can start the server is to delete the error log first, otherwise it just hangs at "starting up".
status-dirsrv shows a number of these errors:
Oct 12 00:00:00 patchqaldap02s.eng.c4e.ops ns-slapd[22846]: Failed to open errors log file /var/log/dirsrv/slapd-patchqaldap02s/errors: error 13 (Permission denied); Exiting...
Oct 12 00:00:00 patchqaldap02s.eng.c4e.ops ns-slapd[22846]: Failed to reopen errors log file, Netscape Portable Runtime error -5966 (Access Denied.)

A few more questions:

What are the permissions of the logs directory and all the logs? Can you grab a quick "ls -la"?
Is the server running as "dirsrv"?
What platform are you running on?
If you disable selinux does it work?
Does it fail to start after every single restart?

Thanks!

I will check logs for that specific error. They're definitely dying at log rotation with "access denied" error message. The only way I can start the server is to delete the error log first, otherwise it just hangs at "starting up".
status-dirsrv shows a number of these errors:
Oct 12 00:00:00 patchqaldap02s.eng.c4e.ops ns-slapd[22846]: Failed to open errors log file /var/log/dirsrv/slapd-patchqaldap02s/errors: error 13 (Permission denied); Exiting...
Oct 12 00:00:00 patchqaldap02s.eng.c4e.ops ns-slapd[22846]: Failed to reopen errors log file, Netscape Portable Runtime error -5966 (Access Denied.)

A few more questions:
What are the permissions of the logs directory and all the logs? Can you grab a quick "ls -la"?
Is the server running as "dirsrv"?
What platform are you running on?
If you disable selinux does it work?
Does it fail to start after every single restart?
Thanks!

  • ls -la is at the bottom. Changing permissions did not allow the directory server to start, just hang at "starting"
  • Yes, default user and group
  • RHEL7 - 3.10.0-1062.1.1.el7.x86_64
  • I have not yet tried that but we have reviewed the context. Every server, production and lab, area all SELinux and it doesn't appear that it fails every time on every server.
  • It looks like the dozen lab servers we deployed two weeks ago fail every time it attempts a log rotation and the only way to start it up is to remove the error log. Changing permissions will not help. Our production servers don't seem as consistent. but they were deployed with 1.3.8.7 then upgraded to 1.3.9.1. It's possible one or two pairs in production were deployed with 1.3.9.1 but I can't confirm.

ls -la /var/log|grep dirsrv

drwxr-x---. 4 root root 46 Oct 2 14:04 dirsrv

[root@devldap02s]# ls -la /var/log/dirsrv/
total 8
drwxr-x---. 4 dirsrv dirsrv 46 Oct 2 14:04 .
drwxr-x---. 15 root log 4096 Oct 13 03:43 ..
drwx------. 2 dirsrv dirsrv 31 Sep 23 14:10 admin-serv
drwxr-xr-x. 2 dirsrv dirsrv 4096 Oct 14 12:02 slapd-devldap02s

[root@devldap02s]# ls -la /var/log/dirsrv/slapd-devldap02s/
total 1568
drwxr-xr-x. 2 dirsrv dirsrv 4096 Oct 14 12:02 .
drwxr-x---. 4 dirsrv dirsrv 46 Oct 2 14:04 ..
-rw-------. 1 dirsrv dirsrv 1534406 Oct 14 14:14 access
-rw-------. 1 dirsrv dirsrv 63 Oct 14 12:02 access.rotationinfo
-rw-------. 1 dirsrv dirsrv 0 Oct 14 12:02 audit
-rw-------. 1 dirsrv dirsrv 63 Oct 14 12:02 audit.rotationinfo
-rw-------. 1 dirsrv dirsrv 8438 Oct 14 12:02 errors
-rw-------. 1 dirsrv dirsrv 63 Oct 14 12:02 errors.rotationinfo

Okay, can you get the errors log from the failed startup (before it has to be deleted)? I need to see what it's saying because everything looks fine so I have no idea how it could be inconsistently failing.

You can also try removing errors and errors.rotationinfo. Perhaps they are out of sync. They will get rewritten at a successful startup

I did save the log files on one of the servers and there's nothing in the error pertaining to the crash, just this:
[10/Oct/2019:00:00:00.052636174 -0400] - ERR - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=OECAgreement" (genldap01s:389): Event update_window_opened should not occur in state wait_for_changes

It's true that it will start after deleting only the error log, I'd just been deleting everything in case it happens with the other logs aswell.

I will work on getting the core dump and will see if I can find any additional info from the error logs.

I came in today and had three of 12 servers down in the lab and one of 16 down in production. I'm having some trouble getting the core dump configured. The instructions might be a bit outdated. I'll press on and see what I can do with it. This time, I deleted only the errors.rotationinfo file and it came up on restart. There are no log entries at all at the time of the incident but that's probably because it can't open the file.

[root@genldap02s slapd-genldap02s]# ls -l
total 15672
-rw-------. 1 dirsrv dirsrv 16000447 Oct 12 12:40 access
-rw-------. 1 dirsrv dirsrv 63 Oct 11 12:40 access.rotationinfo
-rw-------. 1 dirsrv dirsrv 0 Oct 11 12:40 audit
-rw-------. 1 dirsrv dirsrv 63 Oct 11 12:40 audit.rotationinfo
-rw-rw-rw-. 1 dirsrv dirsrv 28595 Oct 18 00:00 errors
-rw-------. 1 dirsrv dirsrv 63 Oct 11 12:40 errors.rotationinfo

[root@genldap02s slapd-genldap02s]# cat errors.rotationinfo
LOGINFO:Log file created at: 11/Oct/2019:12:40:26 (1570812026)

Login to comment on this ticket.

Metadata