#315 ns-slapd exits/crashes if /var fills up
Closed: Fixed None by orion. Opened 6 years ago by orion.

I'm not sure if anything can be done here, but currently ns-slapd will crash/exit if /var fills up. Now the database is mainly read-only so it seems that the only thing that would be written to would be the access log. I would rather that the server stay active and loose access log entries than have the server go down.


Initial testing:

[1] I moved the access log to /home/log (not /var)
[2] Consumed all the disk space
[3] No errors, and slapd works as expected except nothing was being logged.
[4] As soon I cleaned up some space, logging resumed.
[5] DS still accepted and responded to search queries with /home/log full and with free space.

I do see messages in /var/log/messages:

Mar 15 17:18:20 localhost setroubleshoot: SELinux is preventing /usr/sbin/ns-slapd from getattr access on the file /home/log/access. For complete SELinux messages. run sealert -l 0d3fa507-4d2f-47e5-b7b4-dda4ab93c113

Conclusion:

The crash is not because the DS can not write to a log file, but that /var is full. /var is very important to all applications on the system. Variables, locks, libraries are all stored and used in /var. DS also stores the db, txn logs, etc on /var. Once its full, any running application is apt to fail/crash, especially the DS.

I need to think of a better solution to prevent DS from crashing. At the very least, log an error and gracefully shut down the server so the database doesn't get corrupted. Or disabling access/audit logging is an option, as well as deleting rotated logs.

Continuing investigation...

Mark

PS - if this is an ongoing issue there are many workarounds, like moving the logs to a different partition, refining the log rotation policy, etc.

Mark - Thanks for looking into this. I'm very well aware that I am a bad sysadmin for letting /var fill up and need to take my punishment. But ldap is so critical for almost everything that it would be great if it could keep working in some capacity.

Orion,

I'm working on a solution, but I want to run it by the team. Basically the current plan is to try everything we can to clean up DS disk space: stop verbose error logging, then try and delete rotated logs, finally stop access/audit logging, but if all that fails, and we are still about to exceed the disk space, then it will shutdown slapd gracefully. I am adding a configurable grace period before the shutdown, that will allow an admin to do some manual cleanup. If the cleanup is successful within the grace period, it will not shutdown slapd down, and it will re-enable the access/audit logging.

Like I said, I need to run this approach by the team before I get too deep in the coding.

Mark

Pushing the milestone to 1.3 because of the QE impact this feature will have...

Your approach looks good. Considering the big picture, we'd like to encapsulate the database backend related codes in back-ldbm (libback-ldbm.so), and avoid the frontend (libslapd.so) to have any internal knowledge about the database including the directory info. Ideally, we should be able to replace the backend with other type of backend (e.g., sql)...

Could it be possible to move the real monitoring code to the backend? And if you need to have the info in the frontend, the frontend accesses the APIs via pointer to the function? (If it's done, other backend, e.g., dse should have the corresponding implementation, which could be just stubs, though...)

Hi Noriko,

Are you referring to: slapi_back_get_info()

This is already used in the frontend (backend.c), and the plugins. I would think it is out of the scope of my fix to change how the frontend and plugin API behaves. No?

Mark

Replying to [comment:8 mreynolds]:

Hi Noriko,

Are you referring to: slapi_back_get_info()

This is already used in the frontend (backend.c), and the plugins. I would think it is out of the scope of my fix to change how the frontend and plugin API behaves. No?

Mark

Mark, you are right. You've already implemented as I asked for. Sorry, I misread your code at first glance... :p

disk_mon_add_dir() - the directory argument - list owns the memory - shouldn't it do
{{{
slapi_ch_array_add(list, slapi_ch_strdup(dir));
}}}
instead? Then you could just do disk_mon_add_dir(list, "/var");
And instead of "/var" use LOCALSTATEDIR or maybe something like this:
{{{

ifdef LOCALSTATEDIR

    disk_mon_add_dir(list, LOCALSTATEDIR);

else

    disk_mon_add_dir(list, "/var");

endif

}}}

in disk_mon_check_diskspace(char *dirs, long threshold, PRInt64 disk_space) - threshold should be PRInt64 - long is 32-bit on 32-bit platforms

I also created a new ext function: ldbm_config_db_logdirectory_get_ext() as ldbm_config_db_logdirectory_get() was returning a copy of the directory, which would of been copied again by disk_mon_add_dir();

Thanks for the review Rich!

[mareynol@localhost servers]$ git merge ticket315
Updating a4e4edc..65f473c
Fast-forward
ldap/servers/slapd/back-ldbm/dblayer.c | 18 +
ldap/servers/slapd/back-ldbm/ldbm_config.c | 12 +-
ldap/servers/slapd/back-ldbm/proto-back-ldbm.h | 1 +
ldap/servers/slapd/daemon.c | 511 +++++++++++++++++++++++-
ldap/servers/slapd/libglobs.c | 257 +++++++++++--
ldap/servers/slapd/log.c | 65 +++-
ldap/servers/slapd/proto-slap.h | 16 +-
ldap/servers/slapd/slap.h | 12 +
ldap/servers/slapd/slapi-plugin.h | 4 +-
9 files changed, 849 insertions(+), 47 deletions(-)

[mareynol@localhost servers]$ git push origin master
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 8.26 KiB, done.
Total 15 (delta 13), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
a4e4edc..65f473c master -> master

[mareynol@localhost slapd]$ git merge disk
Updating 65f473c..c039fd4
Fast-forward
ldap/servers/slapd/libglobs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

[mareynol@localhost slapd]$ git push
Counting objects: 11, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 646 bytes, done.
Total 6 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
65f473c..c039fd4 master -> master

Added initial screened field value.

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.2.11

2 years ago

Login to comment on this ticket.

Metadata