#3617 sssd did not start due to bad /var/run/sssd.pid
Opened a year ago by harri. Modified 8 months ago

At boot time sssd was not started, apparently due to an old sssd.pid file:

# ps -ef | grep sssd
root       1253    430  0 09:00 pts/4    00:00:00 grep --color=auto sssd
# cat /var/run/sssd.pid 
62
# ls -l /proc/62/exe
lrwxrwxrwx 1 root root 0 Jan 15 08:33 /proc/62/exe -> /usr/bin/dbus-daemon

Do you think it would be possible to improve the check whether /var/run/sssd.pid is valid?

Platform is Centos 7.4, sssd 1.15.2-50.

Thanx in advance
Harri


I do not think it is due to old /var/run/sssd.pid because /run is tmpfs on el7

sh# mount -l | grep run
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=388216k,mode=700)

sh# readlink -f /var/run/
/run

I would recommend to follow one of following pages (2nd one is preferred :-)

Its /var/run/sssd.pid, i.e. / on my systems.

But /var/run is a symbolic link to /run on rhel7 by default and /run is a tmpfs by default

sh# find /var/ -maxdepth 1 -type l
/var/run
/var/lock
/var/mail

sh# readlink -f /var/run/
/run

Please follow links in my previous comment or at least provide more info otherwise we can't help you.

On my hosts /var/run is a directory on /. ext4.

I doubt that this discussion is reasonable. Even if /var/run is a tmpfs on some Centos hosts, you cannot be sure about this on other linux distros.

On my hosts /var/run is a directory on /. ext4.
I doubt that this discussion is reasonable.

I agree that there are communication issues. But we would like to get as much info as possible to confirm/reproduce a bug. I'm sorry we do not have an access to your machine for troubleshooting. (and we do not want such access :-)

Your assumption about /var/run/sssd.pid might be correct but there is a question why there file exist after reboot. sssd should be properly stopped by systemd and thus pid file should not exist.

Please either follow following link https://docs.pagure.org/SSSD.sssd/users/reporting_bugs.html or at least provide as much info as possible.

The ideal state would be to provide exact steps to reproduce the problem.

I do not know where the sssd.pid file came from, but I would assume this was an unclean shutdown of the previous session. By chance the 62 was in use by another service on the next session.

There is no easy, reliable way to reproduce the problem. It should be possible to fake it, though: Add some temporary code to copy the service.pid file of an unrelated service (e.g. lvmetad) to /var/run/sssd.pid before the startup script for sssd is run. lvmetad is part of lvm2 and it is run very early at boot time, AFAIR.

The sssd startup script should recognize that the pid it has read from sssd.pid points to a different service, delete /var/run/sssd.pid and then start sssd as usual.

Just a suggestion, of course.
Regards
Harri

I do not know where the sssd.pid file came from, but I would assume this was an unclean shutdown of the previous session. By chance the 62 was in use by
another service on the next session.

Could you check in journald or /var/log/messages?
Maybe you could see crash there or something similar.
If there is a crash do you have a coredump?

There is no easy, reliable way to reproduce the problem.

That's good. I though you hit this quite frequently due to other bug in sssd.
therefore I asked for more info.

If you do hot hit other bug in sssd then it seems to be a duplicate of #3528 and there is a trivial workaround mentioned in comment. https://pagure.io/SSSD/sssd/issue/3528#comment-470285

Thank you very much for bug report.

@harri have you find a reason why sssd.pid was not removed?

Becasue we woudl like to close this a duplicate of #3528

There is an important difference to #3528: In #3617 /var/run/sssd.pid was invalid, but the pid listed in this file was in use (by dbus). Since Linux doesn't support randomized PIDs there is a pretty high chance that this comes up again on the next unclean shutdown and boot.

There is an important difference to #3528: In #3617 /var/run/sssd.pid was invalid, but the pid listed in this file was in use (by dbus). Since Linux doesn't support randomized PIDs there is a pretty high chance that this comes up again on the next unclean shutdown and boot.

I not not think that it is a difference between #3528 and this ticket. In case of OOM sssd would be killed; sssd will not have a chance to remove pid file; then some other service is restarted and might have the as PID as it is stored in /var/run/sssd.pid.

The only difference is that dbus is not restart only reloaded (due to deep integration with system) and the simplest way how to get new pid for dbus is to reboot machine.

Anyway, I would prefer to focus to different think then sssd pid file. It is already fixed in #3528. I would appreciate if you could find out what happened to sssd that pid file was not removed.

The problem did not come up on a restart of sssd, but at boot time after an unclean shutdown.

On (22/01/18 11:12), Harald Dunkel wrote:

The problem did not come up on a restart of sssd, but at boot time after an unclean shutdown.

May I know what do you mean by unclean shutdown ?

Do you mean power off ?

Thats the part I haven't found yet. I would guess something took too much time to stop and the container was killed, without removing sssd.pid first. There was no indication for an oom error.

Thats the part I haven't found yet. I would guess something took too much time to stop and the container was killed, without removing sssd.pid first. There was no indication for an oom error.

Ahh, in case of container situation is a little bit complicated and thats exactly a reason why it was solved in #3528 only by changing systemd unit file.

In theory, sssd could try to find out whether PID stored in sssd.pid file match some existing process and do some cleanup in case of non-existing process with such PID. However, there can be such process in different PID namespace(container) and it would be impossible to find out such situation. So we would end with accessing the same files by different processes in different namespaces. This is a reason why does not do cleanup itself and just logs a message + fails to start.

With the latest explanation, can be close this ticket?

About 60% of my servers run with sysvinit (Debian). I wonder if src/sysv/sssd.in is covered in #3528 as well?

On (25/01/18 06:53), Harald Dunkel wrote:

About 60% of my servers run with sysvinit (Debian).
I wonder if src/sysv/sssd.in is covered in #3528 as well?

IIUC it is not possible to do cleanup in init scripts.
At least I am not aware of way hot to detect OOM or pkill -9 sssd

Correct me if I am wrong. Or if you know how to do it would you mind
to provide a patch?

LS

The init script can surely write the pid file (supporting daemons without internal pid file handling) , so I don't see any reason why it shouldn't be able to delete it. Something like this should do:

PID_FILE=/var/run/sssd.pid
if [ -f $PID_FILE ]; then
    x=`cat $PID_FILE`
    [ -d /proc/$x -a "x`readlink /proc/$x/exe`" != x/usr/sbin/sssd ] && rm -f $PIDFILE
fi

Can you open a PR on github with your proposed change?

Metadata Update from @jhrozek:
- Issue set to the milestone: SSSD Patches welcome

10 months ago

A quick mockup of something that reads the file and checks if it's a process that is running....

It's quick, it's hacky, it's based on my memory of C - and I'm a bit pressed for time ATM

I suspect that it can be done a lot nicer and that there is some nice utility functions that I'm missing but if someone points me in the right direction then we can work on this faster.

UPDATE: Uploading file since the markdown editor broke things

Also, this should most likely reside in src/util/check_and_open.c or so

Usually kill(0, pid) is used to test the validity of a PID (see man 3 kill). I guess you're proposing to rm the pidfile if there's no process with that PID? One of the previous comments by @lslebodn explains why we don't do that and currently I can't think of a counterargument, so I tend to agree with Lukas and I think we just shouldn't do any cleanup in the deamon.

Normally this should be the job of the init system and systemd is already equipped to do that since commit f4b808c in issue #3528

Hmm I also noticed this issue is still open, I propose to close it as wontfix.

And now I realized the previous comment might have sounded too harsh. I just don't think the daemon itself can do anything about this case because it simply has no context. The entity starting the daemon might, so the only good idea I have is to return a special error code in this case. We already sort of do that:

3045     /* Check if the SSSD is already running */
3046     ret = check_file(SSSD_PIDFILE, 0, 0, S_IFREG|0600, 0, NULL, false);
3047     if (ret == EOK) {
3048         DEBUG(SSSDBG_FATAL_FAILURE,
3049               "pidfile exists at %s\n", SSSD_PIDFILE);
3050         ERROR("SSSD is already running\n");
3051         return 2;
3052     }

We just don't document the return code anywhere. Would documenting that whoever invokes sssd should handle return code 2 as 'cannot start because PID exists' help?

All unix daemons has handled stale pidfiles - it's pretty common - why can't sssd do it?

As I stated in the other bug report I did, it should preferably be done using a UNIX socket to verify that SSSD is actually running. I only did the proc approach since I hadn't had time to dig through the source code to see if there was a signal handler and make sure that it didn't just die on kill(pid, 0)....

Also, IMHO fixing it in the init system is not the solution - it would have to be fixed once per init system instead of just fix it once in the daemon that actually should have more context, esp if you do the unix socket approach.

This is not a new problem. It's been handled and using init to workaround it is not it^tm.

Most kernels avoids reusing pids - so it would have to be quite some time between the file being created, the process crashed and the final restart of the process - it is very unlikely for the pids to clash.... And namespaces doesn't even come in to it since it's started by the same script with the same paths.

Basically, linux wrapsaround pids in 32k pid uses - and if systemd can't ensure a stable namespace for it's init scrips and pid files, then it will not be usable with any standard unix daemon.

So, either sssd works as expected or it's broken, even if you can nicely workaround it in one init system it's not a good generic solution and also, it surprises me that it would be an acceptable solution.

Basically, linux wrapsaround pids in 32k pid uses - and if systemd can't ensure a stable namespace for it's init scrips and pid files, then it will not be usable with any standard unix daemon.

SSSD is a daemon which should ideally run forever. But in case of issues(OOM) init system need to decide whether pid file should be removed and daemon restarted or just daemon should be killed.

SSSD can run in containers without any issue. So it could easily happen that accidentally two containers with the same parameters are executed. They would share the same resources which could case many problems.
SSSD itself does not have enough context to determine whether PID in pidfile can be valid in different PID namespace. The only safe action is to refuse starting daemon.

So, either sssd works as expected or it's broken, even if you can nicely workaround it in one init system it's not a good generic solution and also, it surprises me that it would be an acceptable solution.

There is not a generic solution for any init system + all possible container use-cases. OOM case (or other critical issue with sbin/sssd) is fixed in #3528

  1. Yes it should run forever, but when it doesn't the machine becomes unusable and a reboot is required.

2, Yes, it can run in containers - but containers has different filesystems so this is not an issue. Also namespaces is not a issue since it should push different paths, else you'll still run in to trouble with how sssd is implemented.

3, It doesn't need the context, if you provide the same path to every instance then you're doing it wrong!

Also, if you don't deamonize - why create a pidfile? Is there something external that needs this? In this case the pid file has to always be correct else it can just be omitted when not deamonized.

sss_process.c does some weird things with the pid file and I don't know the structures well enough to make the changes easily - but IMHO if we set interactive then we shouldn't create a pid file. This also means that systemd will not/won't have to care about pidfiles at all in this case.

Small patch making interactive not use pid files but it doesn't handle src/tools/common/sss_process.c - which still wants a pid file

interactive.diff

And something more common for checking the pid for the -D use case.

check-pid.diff

Login to comment on this ticket.

Metadata
Attachments 3
Attached 8 months ago View Comment
Attached 8 months ago View Comment
Attached 8 months ago View Comment