Learn more about these different git repos.
Other Git URLs
At boot time sssd was not started, apparently due to an old sssd.pid file:
# ps -ef | grep sssd root 1253 430 0 09:00 pts/4 00:00:00 grep --color=auto sssd # cat /var/run/sssd.pid 62 # ls -l /proc/62/exe lrwxrwxrwx 1 root root 0 Jan 15 08:33 /proc/62/exe -> /usr/bin/dbus-daemon
Do you think it would be possible to improve the check whether /var/run/sssd.pid is valid?
Platform is Centos 7.4, sssd 1.15.2-50.
Thanx in advance Harri
I do not think it is due to old /var/run/sssd.pid because /run is tmpfs on el7
sh# mount -l | grep run tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=388216k,mode=700) sh# readlink -f /var/run/ /run
I would recommend to follow one of following pages (2nd one is preferred :-)
Its /var/run/sssd.pid, i.e. / on my systems.
But /var/run is a symbolic link to /run on rhel7 by default and /run is a tmpfs by default
/var/run
/run
tmpfs
sh# find /var/ -maxdepth 1 -type l /var/run /var/lock /var/mail sh# readlink -f /var/run/ /run
Please follow links in my previous comment or at least provide more info otherwise we can't help you.
On my hosts /var/run is a directory on /. ext4.
I doubt that this discussion is reasonable. Even if /var/run is a tmpfs on some Centos hosts, you cannot be sure about this on other linux distros.
On my hosts /var/run is a directory on /. ext4. I doubt that this discussion is reasonable.
I agree that there are communication issues. But we would like to get as much info as possible to confirm/reproduce a bug. I'm sorry we do not have an access to your machine for troubleshooting. (and we do not want such access :-)
Your assumption about /var/run/sssd.pid might be correct but there is a question why there file exist after reboot. sssd should be properly stopped by systemd and thus pid file should not exist.
/var/run/sssd.pid
Please either follow following link https://docs.pagure.org/SSSD.sssd/users/reporting_bugs.html or at least provide as much info as possible.
The ideal state would be to provide exact steps to reproduce the problem.
I do not know where the sssd.pid file came from, but I would assume this was an unclean shutdown of the previous session. By chance the 62 was in use by another service on the next session.
There is no easy, reliable way to reproduce the problem. It should be possible to fake it, though: Add some temporary code to copy the service.pid file of an unrelated service (e.g. lvmetad) to /var/run/sssd.pid before the startup script for sssd is run. lvmetad is part of lvm2 and it is run very early at boot time, AFAIR.
The sssd startup script should recognize that the pid it has read from sssd.pid points to a different service, delete /var/run/sssd.pid and then start sssd as usual.
Just a suggestion, of course. Regards Harri
Could you check in journald or /var/log/messages? Maybe you could see crash there or something similar. If there is a crash do you have a coredump?
There is no easy, reliable way to reproduce the problem.
That's good. I though you hit this quite frequently due to other bug in sssd. therefore I asked for more info.
If you do hot hit other bug in sssd then it seems to be a duplicate of #3528 and there is a trivial workaround mentioned in comment. https://pagure.io/SSSD/sssd/issue/3528#comment-470285
Thank you very much for bug report.
@harri have you find a reason why sssd.pid was not removed?
Becasue we woudl like to close this a duplicate of #3528
There is an important difference to #3528: In #3617 /var/run/sssd.pid was invalid, but the pid listed in this file was in use (by dbus). Since Linux doesn't support randomized PIDs there is a pretty high chance that this comes up again on the next unclean shutdown and boot.
I not not think that it is a difference between #3528 and this ticket. In case of OOM sssd would be killed; sssd will not have a chance to remove pid file; then some other service is restarted and might have the as PID as it is stored in /var/run/sssd.pid.
The only difference is that dbus is not restart only reloaded (due to deep integration with system) and the simplest way how to get new pid for dbus is to reboot machine.
Anyway, I would prefer to focus to different think then sssd pid file. It is already fixed in #3528. I would appreciate if you could find out what happened to sssd that pid file was not removed.
The problem did not come up on a restart of sssd, but at boot time after an unclean shutdown.
On (22/01/18 11:12), Harald Dunkel wrote:
May I know what do you mean by unclean shutdown ?
Do you mean power off ?
Thats the part I haven't found yet. I would guess something took too much time to stop and the container was killed, without removing sssd.pid first. There was no indication for an oom error.
Ahh, in case of container situation is a little bit complicated and thats exactly a reason why it was solved in #3528 only by changing systemd unit file.
In theory, sssd could try to find out whether PID stored in sssd.pid file match some existing process and do some cleanup in case of non-existing process with such PID. However, there can be such process in different PID namespace(container) and it would be impossible to find out such situation. So we would end with accessing the same files by different processes in different namespaces. This is a reason why does not do cleanup itself and just logs a message + fails to start.
With the latest explanation, can be close this ticket?
About 60% of my servers run with sysvinit (Debian). I wonder if src/sysv/sssd.in is covered in #3528 as well?
On (25/01/18 06:53), Harald Dunkel wrote:
IIUC it is not possible to do cleanup in init scripts. At least I am not aware of way hot to detect OOM or pkill -9 sssd
Correct me if I am wrong. Or if you know how to do it would you mind to provide a patch?
LS
The init script can surely write the pid file (supporting daemons without internal pid file handling) , so I don't see any reason why it shouldn't be able to delete it. Something like this should do:
PID_FILE=/var/run/sssd.pid if [ -f $PID_FILE ]; then x=`cat $PID_FILE` [ -d /proc/$x -a "x`readlink /proc/$x/exe`" != x/usr/sbin/sssd ] && rm -f $PIDFILE fi
Can you open a PR on github with your proposed change?
Metadata Update from @jhrozek: - Issue set to the milestone: SSSD Patches welcome
A quick mockup of something that reads the file and checks if it's a process that is running....
It's quick, it's hacky, it's based on my memory of C - and I'm a bit pressed for time ATM
I suspect that it can be done a lot nicer and that there is some nice utility functions that I'm missing but if someone points me in the right direction then we can work on this faster.
UPDATE: Uploading file since the markdown editor broke things
<img alt="test.diff" src="/SSSD/sssd/issue/raw/files/c2e375587ae422440a5965bed37c72e27e6723a2792ef253961b420e19590357-test.diff" />
Also, this should most likely reside in src/util/check_and_open.c or so
Usually kill(0, pid) is used to test the validity of a PID (see man 3 kill). I guess you're proposing to rm the pidfile if there's no process with that PID? One of the previous comments by @lslebodn explains why we don't do that and currently I can't think of a counterargument, so I tend to agree with Lukas and I think we just shouldn't do any cleanup in the deamon.
kill(0, pid)
Normally this should be the job of the init system and systemd is already equipped to do that since commit f4b808c in issue #3528
Hmm I also noticed this issue is still open, I propose to close it as wontfix.
And now I realized the previous comment might have sounded too harsh. I just don't think the daemon itself can do anything about this case because it simply has no context. The entity starting the daemon might, so the only good idea I have is to return a special error code in this case. We already sort of do that:
3045 /* Check if the SSSD is already running */ 3046 ret = check_file(SSSD_PIDFILE, 0, 0, S_IFREG|0600, 0, NULL, false); 3047 if (ret == EOK) { 3048 DEBUG(SSSDBG_FATAL_FAILURE, 3049 "pidfile exists at %s\n", SSSD_PIDFILE); 3050 ERROR("SSSD is already running\n"); 3051 return 2; 3052 }
We just don't document the return code anywhere. Would documenting that whoever invokes sssd should handle return code 2 as 'cannot start because PID exists' help?
All unix daemons has handled stale pidfiles - it's pretty common - why can't sssd do it?
As I stated in the other bug report I did, it should preferably be done using a UNIX socket to verify that SSSD is actually running. I only did the proc approach since I hadn't had time to dig through the source code to see if there was a signal handler and make sure that it didn't just die on kill(pid, 0)....
Also, IMHO fixing it in the init system is not the solution - it would have to be fixed once per init system instead of just fix it once in the daemon that actually should have more context, esp if you do the unix socket approach.
This is not a new problem. It's been handled and using init to workaround it is not it^tm.
Most kernels avoids reusing pids - so it would have to be quite some time between the file being created, the process crashed and the final restart of the process - it is very unlikely for the pids to clash.... And namespaces doesn't even come in to it since it's started by the same script with the same paths.
Basically, linux wrapsaround pids in 32k pid uses - and if systemd can't ensure a stable namespace for it's init scrips and pid files, then it will not be usable with any standard unix daemon.
So, either sssd works as expected or it's broken, even if you can nicely workaround it in one init system it's not a good generic solution and also, it surprises me that it would be an acceptable solution.
SSSD is a daemon which should ideally run forever. But in case of issues(OOM) init system need to decide whether pid file should be removed and daemon restarted or just daemon should be killed.
SSSD can run in containers without any issue. So it could easily happen that accidentally two containers with the same parameters are executed. They would share the same resources which could case many problems. SSSD itself does not have enough context to determine whether PID in pidfile can be valid in different PID namespace. The only safe action is to refuse starting daemon.
There is not a generic solution for any init system + all possible container use-cases. OOM case (or other critical issue with sbin/sssd) is fixed in #3528
2, Yes, it can run in containers - but containers has different filesystems so this is not an issue. Also namespaces is not a issue since it should push different paths, else you'll still run in to trouble with how sssd is implemented.
3, It doesn't need the context, if you provide the same path to every instance then you're doing it wrong!
Also, if you don't deamonize - why create a pidfile? Is there something external that needs this? In this case the pid file has to always be correct else it can just be omitted when not deamonized.
sss_process.c does some weird things with the pid file and I don't know the structures well enough to make the changes easily - but IMHO if we set interactive then we shouldn't create a pid file. This also means that systemd will not/won't have to care about pidfiles at all in this case.
Small patch making interactive not use pid files but it doesn't handle src/tools/common/sss_process.c - which still wants a pid file
<img alt="interactive.diff" src="/SSSD/sssd/issue/raw/files/e8728e581d77070bc3075df186c70000700ace5d3a586a681edef2d8fd7ddc0a-interactive.diff" />
And something more common for checking the pid for the -D use case.
<img alt="check-pid.diff" src="/SSSD/sssd/issue/raw/files/58021c578dc4258ff5a58e10647c22083e8589e10b9f6a25256611a806362fcd-check-pid.diff" />
Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.
Given that we are unable to fulfill this request I am closing the issue as wontfix.
If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.
Thank you for understanding.
Metadata Update from @pbrezina: - Issue close_status updated to: wontfix - Issue status updated to: Closed (was: Open)
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/4638
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.