#236 buildmaster.service doesn't start on production (PID file not readable)
Closed: Fixed 6 years ago Opened 6 years ago by kparal.

Today after fixing #235 on taskotron01.qa and rebooting, buildmaster can no longer be started:

Oct 02 07:46:14 taskotron01.qa.fedoraproject.org systemd[1]: Starting Buildmaster for taskbot...
Oct 02 07:46:22 taskotron01.qa.fedoraproject.org buildbot[1126]: Following twistd.log until startup finished..
Oct 02 07:46:22 taskotron01.qa.fedoraproject.org buildbot[1126]: The buildmaster appears to have (re)started correctly.
Oct 02 07:46:22 taskotron01.qa.fedoraproject.org systemd[1]: buildmaster.service: PID file /srv/buildmaster/master/twistd.pid not readable (yet?) after start: Permission denied
Oct 02 07:47:44 taskotron01.qa.fedoraproject.org systemd[1]: buildmaster.service: Start operation timed out. Terminating.
Oct 02 07:47:44 taskotron01.qa.fedoraproject.org systemd[1]: Failed to start Buildmaster for taskbot.
Oct 02 07:47:44 taskotron01.qa.fedoraproject.org systemd[1]: buildmaster.service: Unit entered failed state.
Oct 02 07:47:44 taskotron01.qa.fedoraproject.org systemd[1]: buildmaster.service: Failed with result 'timeout'.

If you check the time stamps, my theory is that twistd.pid is created with incorrect permissions first and updated later. But systemd reads it immediately, and can't (thus "not readable", instead of "not found"). But it works on taskotron-dev just fine, so, I don't know.

I was able to work around the issue temporarily by editing /usr/lib/systemd/system/buildmaster.service and commenting out:

#PIDFile=/srv/buildmaster/master/twistd.pid

We need to find the reason why this happens and fix it (or remove the PIDFIle line if nothing else works).


Metadata Update from @kparal:
- Issue priority set to: High
- Issue tagged with: infra

6 years ago

This is still affecting production. And not just buildmaster now, but also buildslaves!

systemd[1]: Starting Buildslave for taskotron...
buildslave[31803]: Following twistd.log until startup finished.
buildslave[31803]: The buildslave took more than 10 seconds to 
buildslave[31803]: confirm that it started correctly. Please 't
buildslave[31803]: line that says 'configuration update complete' to verify correct startup.
systemd[1]: buildslave@qa13.qa-1.service: Control process exited, code=exited status=1

I had to edit /usr/lib/systemd/system/buildslave@.service and do the same change in there. This is getting really serious. We need to commit at least this workaround to ansible, before we figure out where the problem is (and if we really need that PIDFile line).

I commented out PIDFile lines in all our buildmaster/buildslave service files and deployed everywhere except stg client hosts (because they don't seem to be working atm anyway).

Let's close this as resolved and reopen if we find an issue with the missing PIDFile line.

Metadata Update from @kparal:
- Issue close_status updated to: Fixed

6 years ago

Login to comment on this ticket.

Metadata