#3040 Restart kojid and kojira services automatically
Merged 2 years ago by tkopecek. Opened 2 years ago by vvvelichkov.
vvvelichkov/koji systemd_restart_always  into  master

file modified
+2
@@ -10,6 +10,8 @@ 

         --force-lock \

         --verbose

  ExecReload=/bin/kill -USR1 $MAINPID

+ Restart=on-failure

+ RestartSec=60s

  

  [Install]

  WantedBy=multi-user.target

file modified
+2
@@ -9,6 +9,8 @@ 

         --fg \

         --force-lock \

         --verbose

+ Restart=on-failure

+ RestartSec=60s

  

  [Install]

  WantedBy=multi-user.target

Wait 60sec before restarting the service

I'm not sure about "always" vs. "on-failure". It is not a big difference here, so maybe it could stay as it is. @mikem ?

I'm a little concerned about fast failure loops here. It should be very unusual for kojid to exit from an error (we log and ignore all exceptions in the main loop). Such cases are interesting enough that human intervention might be best.

Hi @mikem,

I'm a little concerned about fast failure loops here.

What is your particular concern? I put a 5 second timeout so it should not be that fast.

It should be very unusual for kojid to exit from an error (we log and ignore all exceptions in the main loop).

Here is one particular problems from few days ago - the kojra service failed with the following error, was not restarted and as a result createrepo tasks where not executed for few days. The only thing I did was to connect and start the service and in my opinion this is something that systemd should do.

Sep 19 22:57:06 koji kojira[1175]: File "/usr/sbin/kojira", line 831, in <module>

Sep 19 22:57:06 koji kojira[1175]: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='koji.opencode.com', port=443): Max retries exceeded with url: /kojihub
Sep 19 22:57:06 koji systemd[1]: kojira.service: main process exited, code=exited, status=1/FAILURE
Sep 19 22:57:06 koji systemd[1]: Unit kojira.service entered failed state.
Sep 19 22:57:06 koji systemd[1]: kojira.service failed.

Such cases are interesting enough that human intervention might be best.

In my opinion a better option is to have an automatic recovery (or at least an automatic attempt to recover).

This seems like a sane thing to add because it's doing what a human would do anyway, and it's not throwing away information.

5s is still fairly fast. The services themselves wait longer in their main loops each pass. I think 60s or longer would be more than sufficient to help out with the hopefully rare cases.

Kojid and kojira have their own while True loops that catch almost all errors. Your example appears to be a RetryError, which is about the only error that the main loop will re-raise.

There is definitely value in having an external restart, I'm just being cautious. In particular, I think Tomas is right that we might want "on-failure" here. There is at least one case where kojid shuts down deliberately. The 'shutdown' task tells kojid to stop. However, if we're going to to go that way, we need to be more careful in these daemons about which cases report an non-zero status.

rebased onto 84cd90951c68600bbb68c43366430248cf2d6962

2 years ago

Hi @mikem,

Thanks for the feedback.

5s is still fairly fast. The services themselves wait longer in their main loops each pass. I think 60s or longer would be more than sufficient to help out with the hopefully rare cases.

I've changed it to 60s.

There is definitely value in having an external restart, I'm just being cautious. In particular, I think Tomas is right that we might want "on-failure" here.

I've change this to on-failure.

rebased onto f26fe54

2 years ago

Just changed the pull request title and the commit message to better describe the changes.

Filed #3084 for follow-up

Metadata Update from @tkopecek:
- Pull-request tagged with: no_qe

2 years ago

Commit 9400ed2 fixes this pull-request

Pull-Request has been merged by tkopecek

2 years ago