PR#3040: Restart kojid and kojira services automatically - koji

koji

#3040 Restart kojid and kojira services automatically

Merged 2 years ago by tkopecek. Opened 2 years ago by vvvelichkov.

vvvelichkov/koji systemd_restart_always into master

Restart kojid and kojira services automatically

Vasil Velichkov • 2 years ago

f26fe54

builder/kojid.service

file modified

		`@@ -10,6 +10,8 @@`
		`--force-lock \`
		`--verbose`
		`ExecReload=/bin/kill -USR1 $MAINPID`
		`+ Restart=on-failure`
		`+ RestartSec=60s`

		`[Install]`
		`WantedBy=multi-user.target`

util/kojira.service

file modified

		`@@ -9,6 +9,8 @@`
		`--fg \`
		`--force-lock \`
		`--verbose`
		`+ Restart=on-failure`
		`+ RestartSec=60s`

		`[Install]`
		`WantedBy=multi-user.target`

vvvelichkov commented 2 years ago

Wait 60sec before restarting the service

tkopecek commented 2 years ago

I'm not sure about "always" vs. "on-failure". It is not a big difference here, so maybe it could stay as it is. @mikem ?

mikem commented 2 years ago

I'm a little concerned about fast failure loops here. It should be very unusual for kojid to exit from an error (we log and ignore all exceptions in the main loop). Such cases are interesting enough that human intervention might be best.

vvvelichkov commented 2 years ago

Hi @mikem,

I'm a little concerned about fast failure loops here.

What is your particular concern? I put a 5 second timeout so it should not be that fast.

It should be very unusual for kojid to exit from an error (we log and ignore all exceptions in the main loop).

Here is one particular problems from few days ago - the kojra service failed with the following error, was not restarted and as a result createrepo tasks where not executed for few days. The only thing I did was to connect and start the service and in my opinion this is something that systemd should do.

Sep 19 22:57:06 koji kojira[1175]: File "/usr/sbin/kojira", line 831, in <module>

Sep 19 22:57:06 koji kojira[1175]: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='koji.opencode.com', port=443): Max retries exceeded with url: /kojihub
Sep 19 22:57:06 koji systemd[1]: kojira.service: main process exited, code=exited, status=1/FAILURE
Sep 19 22:57:06 koji systemd[1]: Unit kojira.service entered failed state.
Sep 19 22:57:06 koji systemd[1]: kojira.service failed.

Such cases are interesting enough that human intervention might be best.

In my opinion a better option is to have an automatic recovery (or at least an automatic attempt to recover).

ktdreyer commented 2 years ago

This seems like a sane thing to add because it's doing what a human would do anyway, and it's not throwing away information.

mikem commented 2 years ago

5s is still fairly fast. The services themselves wait longer in their main loops each pass. I think 60s or longer would be more than sufficient to help out with the hopefully rare cases.

Kojid and kojira have their own while True loops that catch almost all errors. Your example appears to be a RetryError, which is about the only error that the main loop will re-raise.

There is definitely value in having an external restart, I'm just being cautious. In particular, I think Tomas is right that we might want "on-failure" here. There is at least one case where kojid shuts down deliberately. The 'shutdown' task tells kojid to stop. However, if we're going to to go that way, we need to be more careful in these daemons about which cases report an non-zero status.

Edited 2 years ago by mikem

rebased onto 84cd90951c68600bbb68c43366430248cf2d6962

2 years ago

vvvelichkov commented 2 years ago

Hi @mikem,

Thanks for the feedback.

5s is still fairly fast. The services themselves wait longer in their main loops each pass. I think 60s or longer would be more than sufficient to help out with the hopefully rare cases.

I've changed it to 60s.

There is definitely value in having an external restart, I'm just being cautious. In particular, I think Tomas is right that we might want "on-failure" here.

I've change this to on-failure.

tkopecek commented 2 years ago

:thumbsup:

rebased onto f26fe54

2 years ago

tkopecek commented 2 years ago

placeholder issue: #3081

vvvelichkov commented 2 years ago

Just changed the pull request title and the commit message to better describe the changes.

ktdreyer commented 2 years ago

:thumbsup:

mikem commented 2 years ago