top | item 39047260

(no title)

> I would guess the developers wanted to prevent laptops running out of battery too quickly

And I would guess sysadmins also don't like their logging facilities filling the disks just because a service is stuck in a start loop. There are many reasons to think a service failing to start multiple times in a row won't start. Misconfiguration is probably the most frequent reason for that.

discuss

twic|2 years ago

Exactly. If a service crashes within a second ten times in a row, it's not going to come up cleanly an eleventh time. The right thing to do is stay down, and let monitoring get the attention of a human operator who can figure out what the problem is. Continually rebooting is just going to fill up logs, spam other services, and generally make trouble.

I'm sure there are exceptions to this. For those, set Restart=always. But it's an absolutely terrible default.

BenjiWiebe|2 years ago

It might actually, if a network connection is temporarily down.

growse|2 years ago

Interestingly, the kubernetes approach is the opposite one. Dependencies between pods / software components are encouraged to be a little softer, so that the scheduler is simpler.

Starting up, noticing that the environment doesn't have what you need yet and dying quickly appears to be The Kubernetes Way. A scheduler will eventually restart you and you'll have another go. Repeat until everything is up.

The kubelet operates the same way afair. On a node that hasn't joined a cluster yet, it sits in a fail/restart loop until it's provisioned.

deathanatos|2 years ago

Heh. We used syslog at one place, with it configured to push logs into ELK. The ingestion into ELK broke … which caused syslog to start logging that it couldn't forward logs. Now that might seem like screaming into a void, but that log went to local disk, and syslog retried it as fast as disk would otherwise allow, so instantly every machine in the fleet started filling up its disks with logs.

(You can guess how we noticed the problem…)

Also logrotate. (And bounded on size.)

freedomben|2 years ago

it's wild how easy it is to misconfigure (or not configure) logrotate properly and have a log file fill up the disk. Out of memory and/or out of disk are the two error cases that have led to the most pain in my career. I think most people who started with docker in the early days (long before there was a docker system prune) had this happen where old docker containers/images filled up the disk and wreaked havoc at an unsuspecting point.

melolife|2 years ago

I've seen bad service design having e.g.

Before=systemd-user-sessions.service

This means that as long as systemd is trying to (re)start the service, nobody can log in. Which is a problem with infinite restarts.

It's still pretty easy to accidentally set up an infinite restart loop with the default settings if your service takes more than 2s to crash.