#191 Please recommend setting Restart=always in normal services
Closed: Fixed None Opened 7 years ago by lennart.

Please update the Fedora packaging guidelines to recommend using Restart= in the the [Service] section of normal services, so that services that terminate are automatically restarted.

Usage of Restart= should only be a recommendation, not a requirement, to leave room for individual decisions for the various packages.

We should clarify that this option should not be used for Type=oneshot services, as it makes no sense for that.

We should probably recommend Restart=on-failure, but optionally allow Restart=on-abort resp. Restart=always, too, depending on the service. Why pick Restart=on-failure by default? Simply because otherwise services that might exit due to user request that bypassed systemd would automatically restart which is probably not what the user would like it to do. (Example: user sends SIGTERM by hand, which probably means he wants the service to exit, and doesn't want systemd to immediately restart it).


Why don't you just change the systemd default behaviour?

As Tom Lane posted in fedora-devel list, please see rhbz#832029 before accepting as a hard rule.

Given this is only a recommendation and not a "hard rule", that shouldn't be a problem.

Besides, "You are not authorized to access bug #832029.", so it's hard to comment further.

What is the behaviour of systemd when a service gets into a failure loop? (Also... I looked on the systemd.units man page on 0pointer but didn't see restart= documented. Is it somewhere else or is this just a new feature that's not yet in the man page?)

Replying to [comment:5 toshio]:

What is the behaviour of systemd when a service gets into a failure loop?

All service spawns are by default rate limited, and there's a pause of 100ms before we retry. This should hence be quite safe. (and for people who want to fiddle with the paramaters here, there are StartLimitInterval=, StartLimitBurst=, RestartSec= for that in each .service file).

(Also... I looked on the systemd.units man page on 0pointer but didn't see restart= documented. Is it somewhere else or is this just a new feature that's not yet in the man page?)

This is a setting for service units only, not all units. Hence it is documented in systemd.service(5):

http://www.freedesktop.org/software/systemd/man/systemd.service.html

Okay. Looking at the documentation and asking a few sysadmins for feedback, couple thoughts:

  • Restart=on-failure sounds like a decent recommendation (as given in the body of this request as opposed to the subject of this request)
  • People are concerned about the documented behaviour of Restart= to keep trying after the StartLimitInterval= ends. For instance, one concern was that it would inflate the logfiles and possibly mask the root cause/initial change that started the failures.

Could we have this functionality changed so our default behaviour with this recommendation (whether upstream default as james asked about or with configuration in the file) would be for the automatic restart to be attempted for the StartLimitBurst number of times and if exceeded, require a manual start by the admin? It would look something like this:

{{{

Same idea as now

Restart=on-failure

Interval to count the number of restarts within. This is half the

functionality of the current StartLimitInterval

A larger value than 10s seems more appropriate as some services take

longer to come up. A well behaved service should not be bouncing

multiple times within a minute (or even ten minutes), so it seems

reasonable to increase it

StartLimitInterval=60s

Number of times the service is allowed to be automatically restarted

within a StartLimitInterval before it fails and stays off until

explicitly restarted (this is decreased over the current default as

slower starting services might not be able to restart 5 times in a

minute even if they're in a failure loop)

StartLimitBurst=3
}}}

Also, it's unclear from the documentation whether the StartLimitInterval and StartLimitBurst only apply to automatic Restarts or if they also apply to explicit start/restart from systemctl commands. If they include the latter, there should be a second set of options that only apply to the automatic restart functionality (as 1 minute and 3 restarts seems more reasonable for automatic restarts but might be wide enough to catch sysadmins that are making small changes to configs and then restarting).

Also, this change should get into the release notes (possibly via the Feature process) as it's a change to the behaviour of services that sysadmins should be aware of.

Catching up on reading the thread on the mailing list http://lists.fedoraproject.org/pipermail/devel/2012-June/169334.html It seems like on-abort seems like it's a little more conservative. Not sure if we might want to recommend that as the default instead... probably need to think a little more about it in the meeting. Could you tell us a little more about what this portion of on-failure's behaviour does: "when an operation times out or when the configured watchdog timeout is triggered."

Replying to [comment:7 toshio]:

Okay. Looking at the documentation and asking a few sysadmins for feedback, couple thoughts:

  • Restart=on-failure sounds like a decent recommendation (as given in the body of this request as opposed to the subject of this request)
  • People are concerned about the documented behaviour of Restart= to keep trying after the StartLimitInterval= ends. For instance, one concern was that it would inflate the logfiles and possibly mask the root cause/initial change that started the failures.

Hmm, this is a misunderstanding. The StartLimitInterval=/StartLimitBurst= logic simply defines a maximum restart rate. if that rate is hit we don't attempt to restart the daemon anymore (until the user restarts it manually again). This is intended precisely to avoid spamming logs and burning CPU.

Example:

StartLimitInterval=10s
StartLimitBurst=20

Means: if we try to restart the service more often than 20 times per 10s, just stop doing that and don't try to start the daemon anymore.

A scheme like this is BTW very close to what sysvinit does here.

I have added another sentence to the docs now to clarify this.

Could we have this functionality changed so our default behaviour with this recommendation (whether upstream default as james asked about or with configuration in the file) would be for the automatic restart to be attempted for the StartLimitBurst number of times and if exceeded, require a manual start by the admin? It would look something like this:

The StartLimit applies to all starts, not just automatic restarts (that's why the option is called StartLimit, not RestartLimit.)

Also, it's unclear from the documentation whether the StartLimitInterval and StartLimitBurst only apply to automatic Restarts or if they also apply to explicit start/restart from systemctl commands. If they include the latter, there should be a second set of options that only apply to the automatic restart functionality (as 1 minute and 3 restarts seems more reasonable for automatic restarts but might be wide enough to catch sysadmins that are making small changes to configs and then restarting).

We really shouldn't make this anymore more complex than it is, i.e. try to avoid even more options and settings. Note that the rate limit counter is automatically forgotten when the user uses "systemctl reset-failed", which I think is a nicer way for the user to force that all ratelimiting/failure data is forgotten about a service.

Also, this change should get into the release notes (possibly via the Feature process) as it's a change to the behaviour of services that sysadmins should be aware of.

Well, in the past FESCO refused to bless features which they though were in the area of FPC.

Replying to [comment:8 toshio]:

Catching up on reading the thread on the mailing list http://lists.fedoraproject.org/pipermail/devel/2012-June/169334.html It seems like on-abort seems like it's a little more conservative. Not sure if we might want to recommend that as the default instead... probably need to think a little more about it in the meeting. Could you tell us a little more about what this portion of on-failure's behaviour does: "when an operation times out or when the configured watchdog timeout is triggered."

I am fine with either on-abort or on-failure. I think on-failure is the nicer choice for daemons that are written somewhat cleanly, but on-abort is the safer choice.

A failure in the sense of "on-failure" is five different things:

  • the main daemon exits and dumps core
  • the main daemon exits with an abortive signal
  • the main daemon exits with an exit code != 0
  • the watchdog feature is enabled for a service and a service fails to respond
  • some action such as service reload is executed but this times out.

OTOH "on-abort" only restarts a service on the first two cases, and just moves the services into a "failed" state otherwise.

(the watchdog logic is explained in more detail here: http://0pointer.de/blog/projects/watchdog.html )

Replying to [comment:10 lennart]:

A failure in the sense of "on-failure" is five different things:
<snip>
- the main daemon exits with an exit code != 0
I'd expect this to happen on invalid configuration, and in that case restarting is undesirable...
- the watchdog feature is enabled for a service and a service fails to respond
... but the natural action to take on a watchdog failure is to restart the service first (and only let the hardware watchdog trigger, or forcibly reboot the system, if restarting the service didn't help).

So, I'm not sure what a good default would be. Yet Another Config Option perhaps?

Replying to [comment:11 mitr]:

Replying to [comment:10 lennart]:

A failure in the sense of "on-failure" is five different things:
<snip>
- the main daemon exits with an exit code != 0
I'd expect this to happen on invalid configuration, and in that case restarting is undesirable...

But this would happen only in the startup phase of the service (i.e. as opposed to the runtime phase of the service, which begins after the parent exits for double-forking Unix services) where we don't restart anyway.

  • the watchdog feature is enabled for a service and a service fails to respond
    ... but the natural action to take on a watchdog failure is to restart the service first (and only let the hardware watchdog trigger, or forcibly reboot the system, if restarting the service didn't help).

Note that watchdog support is useful only for very special, patched services. We have exactly zero of these Fedora right now, so I wouldn't want to think about that too much in this context.

So, I'm not sure what a good default would be. Yet Another Config Option perhaps?

Well, I am open to adding one or two more on-xxx settings for Restart= with combinations that make sense.

Would be +1 on on-abort or on-failure, but would prefer on-abort.

Lennart, the FPC would strongly prefer it if systemd set the unit default to Requires=on-abort, and did it in a way that did not require it to be boilerplated into each unit file in Fedora. (+1:5, 0:1, -1:0)

Replying to [comment:14 spot]:

Lennart, the FPC would strongly prefer it if systemd set the unit default to Requires=on-abort, and did it in a way that did not require it to be boilerplated into each unit file in Fedora. (+1:5, 0:1, -1:0)

Humm, implying a default of Requires=on-abort is something I really think we should avoid, simply because restarting is an additional feature that applies only to a subset of services, and not all. For example, for all services that are socket or bus activatable automatic restart is not desirable, since this would break exit-on-idle logic (which is actually something we want more people to use). I mean, think of stuff like SSHD or CUPS. This doesn't have to run all the time. Starting it when somebody actually connects via ssh or tries to print is much preferable, and having it exit after a while is really nice too. And then there's a lot of one-time units where automatic restarting makes little sense too. Stuff like NetworkManager-wait-online.service, or fedora-wait-storage.service or things like that.

I am pretty sure that restarting should be opt-in, not opt-out. And FPC should just recommend to use it, but leave it to the packagers to actually make use of it.

Also, changing systemd to default to auto restart would be a major change in behaviour and would break an immense number of systems, simply because software suddenly automaticaly restarts that didn't use to, and a lot of projects do rely on exit-on-idle working correctly. Even if it was a good idea to do restart as the implied default (which I don't think it is): it's too late for that..

Why would Restart=on-abort break exit-on-idle or socket activation? Why would I not want all three of those features?

Seriously if on-abort is broken like that, I'd be -1 on anything to do with it.

I can see how it's true for "Type=oneshot" ... but obviously don't have that default for those! And having that default in systemd is probably much easier than writing Fedora policy saying everyone should do that and getting everyone to do it.

I am pretty sure that restarting should be opt-in, not opt-out.

I'd disagree strongly ... I can't see many people who will want their services to disappear until they've noticed (Oh, it could have fixed itself but didn't ... gee thanks).

Your entire ticket is basically "please get all services to add this magic line which will make systemd do the obviously right thing, except in this (currently) one case where it's obviously the wrong thing to do".

Lennart, the FPC would appreciate it if you could provide specific draft text for us to consider which describes the various Restart= use cases, and the scenarios in which Fedora unit files should be using them.

It is unclear to us why a default of "on-abort" (which can be overridden on a per-unit basis) is a problem, but hopefully this requested draft text will make the situation clear to us.

FYI the topic here is a bit misleading.

With my admin hat on I say we should not recommend or set default Restart=$foo behaviour until systemd has a method to notify users when it occurs either via email and or via some type of notification to the *DE he's running.

There is also only very specific type of scenarios where "Restart=always" is needed and we should never suggest it's usage because when a service fails it fails with a reason and that needs to be looked into both from a administrative and development point of view.

Quite frankly I'm amazed that Lennart suggest "Restart=always" I guess he has never had to deal with the bugs against various components or those that have been filed directly against systemd because users/administrators added it without realizing that they stopped being able to stop <--- the service manually once they did...

Now the proper default behaviour is "Restart=on-failure" and we should default to it once some kind of Restart=on-failure notification process is in place in systemd when it occurs either by default in systemd or by patching all those units in roughly 600 components ( which I can do myself if necessary )

Please update this ticket regarding its continued relevance, providing any
information requested. If this is not done within the next two weeks, this
ticket may be closed due to inactivity. Thank you!

his ticket is being closed due to inactivity. If the issue referenced has
not been resolved, please reopen the ticket and provide the information
requested. Thank you!

OK, let's try this again.

systemd in rawhide now has a new Restart=on-abnormal setting, as discussed around comment 12. If set it will cause the services to restart on all "abnormal" failures, which includes unclean signal exits, coredumps, watchdog and other timeouts, but which does not include clean or unclean exit codes, or clean signal exits. For daemons that shall be able to escape the constant cycle of restarts and indicate so with an exit code, this is a good option.

I have also changed the systemd upstream docs to recommend to use Restart=on-failure or Restart=on-abnormal (depending on wether the service shall be able to escape the restart cycle with an error or not) for all long-running services, and I think Fedora should suggest the same in its packaging guidelines.

The man page now contains a table explaining which settings cause a restart on which conditions:

http://www.freedesktop.org/software/systemd/man/systemd.service.html

And before somebody brings this up again: Restart= is a setting useful for long-running services only (which usually excludes socket-activated exit-on-idle services). Also, it has traditionally defaulted to off. This is why we cannot and will not enable this gloablly and implicitly for all services. Also, in some cases it might be a good idea to avoid restarts altogether, for example when software is known to corrupt is runtime data if it is simply restarted after a crash. This also means that it depends on the specific daemon which restart setting is the right one to choose. Hence, restarting must be opt-in, not opt-out.

Hence, please add something like this to the policy:

"If you package a long-running service, please consider enabling systemd's automatic restart feature for it, to improve reliability by making sure the system automatically attempts recovering a failing daemon. Please use Restart=on-failure or Restart=on-abnormal for this. The former will tell systemd to restart the daemon as soon as it fails regardless of the precise reason. It's a good choice for most long-running services. Some daemons require a way to escape constant restarting by exiting with any non-zero exit code. For those services use Restart=on-abnormal, which will still restart the daemon when it fails "abnormally", on unclean signal, core dump, timeout or watchdog exits, but not on unclean exit codes. It is recommended to to enable automatic restarts for all long-running services, but which setting is the right one, and whether it is useful at all depends on the specific service. Please consult the systemd.service(5) man page for more information on the various settings."

Anyway, would be great to get this into the policy. I am also pushing the Debian folks to do the same, so that upstream folks start shipping their daemons with this in place.

(Oh, and the policy should only make this recommendation for F21 and newer, as mentioned Restart=on-abnormal doesn't exist earlier)

additional recommendation text around restart values approved (+1:6, 0:0, -1:0)

has the packaging guidelines been updated yet?

Guidelines updated. Sorry for the delay.

Metadata Update from @lennart:
- Issue assigned to spot

2 years ago

Login to comment on this ticket.

Metadata