#3528 sssd refuses to start when pidfile is present, but the process is gone
Closed: Fixed 2 years ago Opened 2 years ago by laggyluke.

Steps to reproduce

  1. start sssd
  2. hard kill sssd: pkill -KILL -F /var/run/sssd.pid
  3. try starting sssd again

Expected result

sssd sees the pidfile, reads the pid from the file, observes there's no more process with this pid and proceeds to start as if there was no pidfile present.

Actual result

sssd sees the pidfile and immediately exists with the following message: pidfile exists at /var/run/sssd.pid


Hmm, I haven't tried to reproduce this issue, but it does sound like a bug, because the pidfile() function explicitly calls kill(pid, 0) to find out if the process with the number in the pidfile exists..

I was able to reproduce this one on a quick test. Didn't check any further than "yes, I can reproduce the issue", though.

I do not think it is a bug. It is just wrong use-case.
pkill -KILL -F /var/run/sssd.pid is a wrong way how to stop sssd.

Anyway in case of container and some shared directories there needn't be process with PID stored in /var/run/sssd.pid due to namespaces. So if you can see message "pidfile exists at /var/run/sssd.pid" then there is something fishy. and sssd should not try to clean this file itself. Such case should be investigated and file manually removed after finding + fixing reason/problem.

Anyway, we might enhance debug/error message.

I completely agree that killing sssd like that is a wrong use-case, obviously I only used it as an example of hard failure that results in a dangling pidfile. In my specific case it was OOM-killed - server misconfiguration plus some bad luck. But it would be great if sssd could be simply restarted by systemd instead of having to reboot the whole server.

I found the bit of code that apparently does the right thing:
https://github.com/SSSD/sssd/blob/0fba03cab9580cab6898e855bcd0ca9b2e54ce67/src/util/server.c#L181

Unfortunately my C skills are not good enough to prepare a pull request, although I might give it a try if noone else is interested.

I completely agree that killing sssd like that is a wrong use-case, obviously I only used it as an example of hard failure that results in a dangling pidfile. In my specific case it was OOM-killed - server misconfiguration plus some bad luck. But it would be great if sssd could be simply restarted by systemd instead of having to reboot the whole server.

You needn't reboot whole server. You need to investigate thing a bit and find out why pid file exists. So in your case you will need to just remove pidfile.
and start sssd.

This is a reason why better error message might make it clear. But as i wrote in previous message. It needn't be straightforward in container world.

BTW which version of sssd do you use? becasue later version of sssd use systemd + type notify and pidfile needn't be use there.

I was able to reproduce this in a fresh VM running CentOS Linux release 7.3.1611 (Core) and sssd v1.15.2.

After further investigation we found out that in our setup sssd was hard killed after not being able to gracefully shut down in time on an overloaded server. But I still believe the actual reason for a hard kill is irrelevant. One way or another, we end up in a situation where the pidfile exists, but the corresponding process is no longer running. The pidfile can even be empty and it will still refuse to start.

I'm not sure I fully understand the container case that you've mentioned, but it looks like you're describing a setup that has a pid namespace isolation together with a shared filesystem - is that actually a setup you want to support?

I was able to reproduce this in a fresh VM running CentOS Linux release 7.3.1611 (Core) and sssd v1.15.2.

This version of CentOS has sssd 1-14 by default.

After further investigation we found out that in our setup sssd was hard killed after not being able to gracefully shut down in time on an overloaded server. But I still believe the actual reason for a hard kill is irrelevant. One way or another, we end up in a situation where the pidfile exists, but the corresponding process is no longer running. The pidfile can even be empty and it will still refuse to start.
I'm not sure I fully understand the container case that you've mentioned, but it looks like you're describing a setup that has a pid namespace isolation together with a shared filesystem - is that actually a setup you want to support?

Following steps should be sufficient workaround. sssd creates pidfile even though we moved from Type forking -> notify. So pid file is cleaned by SSSD in most cases but systemd does not know anything about it and cannot clean the file in case of corner cases (OOM ...)

  • echo -e '[Service]\nPIDFile=/var/run/sssd.pid\n' > /etc/systemd/system/sssd.service.d/sssd_pidfile.conf
  • systemctl daemon-reload
  • systemctl restart sssd.service

I still am not sure whether it is bug or feature :-) Because current state allows us to return back to "Type=forking" for testing purposes.

That's exactly the kind of workaround that we're going to use, thanks!

But there's one more argument to fixing this on sssd side without relying on systemd or any other init system:

If sssd creates a pidfile in the first place, it is probably reasonable to expect it to take full ownership of the whole lifecycle around that pidfile, including cleaning up the stale one in case something goes wrong.

If sssd creates a pidfile in the first place, it is probably reasonable to expect it to take full ownership of the whole lifecycle around that pidfile, including cleaning up the stale one in case something goes wrong.

sssd takes care of PIDfile but it does not have a chance in case of OOM.

And as i mentioned few times we cannot remove files due to container use case.
Because containers usually have different PID namespace and we cannot say whether file is leftover after force shutdown (kill -9; OOM ...) or files was created by different process in different PID namespace.

Lukas said he would tune the default systemd unit file, but nonetheless, this is a minor issue.

Metadata Update from @jhrozek:
- Issue assigned to lslebodn

2 years ago

Metadata Update from @jhrozek:
- Issue set to the milestone: SSSD 1.16.0

2 years ago

Metadata Update from @jhrozek:
- Issue priority set to: minor

2 years ago

Since we are required to release a new upstream tarball no later than Friday Oct-20, I'm moving tickets that will not be closed by that date to the next milestone, 1.16.1

Metadata Update from @jhrozek:
- Issue set to the milestone: SSSD 1.16.1 (was: SSSD 1.16.0)

2 years ago

Metadata Update from @lslebodn:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata