I've been bitten by the surprising amount of time it takes for Kubernetes to update loadbalancer target IPs in some configurations. For me, 90% of the graceful shutdown battle was just ensuring that traffic was actually being drained before pod termination.
Adding a global preStop hook with a 15 second sleep did wonders for our HTTP 503 rates. This creates time between when the loadbalancer deregistration gets kicked off, and when SIGTERM is actually passed to the application, which in turn simplifies a lot of the application-side handling.
Yes. Prestop sleep is the magic SLO solution for high quality rolling deployments.
IMHO, there are two things that kubernetes could improve on:
1. Pods should be removed from Endoints _before_ initiating the shutdown sequence. Like the termination grace, there should be an option for termination delay.
2. PDB should allow an option for recreation _before_ eviction.
another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.
it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.
Is it me or are observability stacks kind of ridiculous. Logs, metrics, and traces, each with their own databases, sidecars, visualization stacks. Language-specific integration libraries written by whoever felt like it. MASSIVE cloud bills.
Then after you go through all that effort most of the data is utterly ignored and rarely are the business insights much better then the trailer park version ssh'ing into a box and greping a log file to find the error output.
Like we put so much effort into this ecosystem but I don't think it has paid us back with any significant increase in uptime, performance, or ergonomics.
Jfyi, I'm doing exactly this (and more) in a platform library; it covers the issues I've encountered during the last 8+ years I've been working with Go highload apps. During this time developing/improving the platform and rolling was a hobby of mine in every company :)
It (will) cover the stuff like "sync the logs"/"wait for ingresses to catch up with the liveness handler"/etc.
The docs are sparse and some things aren't covered yet; however I'm planning to do the first release once I'm back from a holiday.
In the end, this will be a meta-platform (carefully crafted building blocks), and a reference platform library, covering a typical k8s/otel/grpc+http infrastructure.
> another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.
Have you come across any convenient solution for this? If my scrape interval is 15 seconds, I don't exactly have 30 seconds to record two scrapes.
This behavior has sort of been the reason why our services still use statsd since the push-based model doesn't see this problem.
And i believe that so much that I don't even consider graceful shutdown in design. Components should be able to safely (and even frequently) hard-crash and so long as a critical percentage of the system is WAI then it shouldn't meaningfully impact the overall system.
The only way to make sure a system can handle components hard crashing, is if hard crashing is a normal thing that happens all the time.
Yeah. However, I do not need to pull the plug to shut things down even if the software was designed to tolerate that.
In a second thought though, maybe I do. That might be the only way to ensure the assumption is true. Like the Netflix's chaos monkey thing a couple years ago.
Relying on graceful exit and supporting it are two different things. You want to support it so you can stop serving clients without giving them nasty 5xx errors.
I was hoping the article describe how to perform the application restart without dropping a single incoming connections when a new service instance receives the listening socket from the old instance.
It is relatively straightforward to implement under systemd. And nginx has been supporting that for over 20 years. Sadly Kuberenets and Docker have no support for that assuming it is done in load balancer or the reverse proxy.
This is one of the things I think Elixir is really smart in handling. I'm not very experienced in it, but it seems to me that having your processes designed around tiny VM processes that are meant to panic, quit and get respawned eliminates the need to have to intentionally create graceful shutdown routines, because this is already embedded in the application architecture.
I find that I typically have a few services that I need to start-up and sometimes they have different mechanisms for start-up and shutdown. Sometimes you need to instantiate an object first, sometimes you have a context you want to cancel, other times you have a "Stop" method to call.
I designed the library to help my consolidate this all in one place with a unified API.
Haha, I had the exact same idea, though my API looks a bit less elegant. Maybe it's because it allows the caller to set-up multiple signals to handle and in which way to do it.
> After updating the readiness probe to indicate the pod is no longer ready, wait a few seconds to give the system time to stop sending new requests.
> The exact wait time depends on your readiness probe configuration
A terminating pod is not ready by definition. The service will also mark the endpoint as terminating (and as not ready). This occurs on the transition into Terminating; you don't have to fail a readiness check to cause it.
(I don't know about the ordering of the SIGTERM & the various updates to the objects such as Pod.status or the endpoint slice; there might be a small window after SIGTERM where you could still get a connection, but it isn't the large "until we fail a readiness check" TFA implies.)
(And as someone who manages clusters, honestly that infintesimal window probably doesn't matter. Just stop accepting new connections, gracefully close existing ones, and terminate reasonably fast. But I feel like half of the apps I work with fall into either a bucket of "handle SIGTERM & take forever to terminate" or "fail to handle SIGTERM (and take forever to terminate)".
We've adopted Google Wire for some projects at JustWatch, and it's been a game changer. It's surprisingly under the radar, but it helped us eliminate messy shutdown logic in Kubernetes. Wire forces clean dependency injection, so now everything shuts down in order instead... well who knows :-D
I tend to use a waitgroup plus context pattern. Any internal service which needs to wind down for shutdown gets a context which it can listen to in a goroutine to start shutting down, and a waitgroup to indicate that it is finished shutting down.
Then the main app goroutine can close the context when it wants to shutdown, and block on the waitgroup until everything is closed.
If you look at the article, it presents some additional niceties, like having middleware that is aware of the shutdown - though they didn't detail exactly how the WithCancellation() function is doing that.
So if you send a SIG-INT/-TERM signal to the server there's a delay to clean up resources, during which the new requests get served a response that doesn't try to access them and fail in unexpected ways, but a configurable "not in service" error.
[+] [-] zdc1|10 months ago|reply
Adding a global preStop hook with a 15 second sleep did wonders for our HTTP 503 rates. This creates time between when the loadbalancer deregistration gets kicked off, and when SIGTERM is actually passed to the application, which in turn simplifies a lot of the application-side handling.
[+] [-] rdsubhas|10 months ago|reply
IMHO, there are two things that kubernetes could improve on:
1. Pods should be removed from Endoints _before_ initiating the shutdown sequence. Like the termination grace, there should be an option for termination delay. 2. PDB should allow an option for recreation _before_ eviction.
[+] [-] LazyMans|10 months ago|reply
[+] [-] evil-olive|10 months ago|reply
it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.
[+] [-] tmpz22|10 months ago|reply
Then after you go through all that effort most of the data is utterly ignored and rarely are the business insights much better then the trailer park version ssh'ing into a box and greping a log file to find the error output.
Like we put so much effort into this ecosystem but I don't think it has paid us back with any significant increase in uptime, performance, or ergonomics.
[+] [-] utrack|10 months ago|reply
It (will) cover the stuff like "sync the logs"/"wait for ingresses to catch up with the liveness handler"/etc.
https://github.com/utrack/caisson-go/blob/main/caiapp/caiapp...
https://github.com/utrack/caisson-go/tree/main/closer
The docs are sparse and some things aren't covered yet; however I'm planning to do the first release once I'm back from a holiday.
In the end, this will be a meta-platform (carefully crafted building blocks), and a reference platform library, covering a typical k8s/otel/grpc+http infrastructure.
[+] [-] RainyDayTmrw|10 months ago|reply
[+] [-] PrayagS|10 months ago|reply
Have you come across any convenient solution for this? If my scrape interval is 15 seconds, I don't exactly have 30 seconds to record two scrapes.
This behavior has sort of been the reason why our services still use statsd since the push-based model doesn't see this problem.
[+] [-] karel-3d|10 months ago|reply
[+] [-] wbl|10 months ago|reply
[+] [-] Rhapso|10 months ago|reply
The only way to make sure a system can handle components hard crashing, is if hard crashing is a normal thing that happens all the time.
All glory to the chaos monkey!
[+] [-] ikiris|10 months ago|reply
[+] [-] smcleod|10 months ago|reply
[+] [-] XorNot|10 months ago|reply
That my application went down from sig int makes a big difference compared to kill.
Blue-Green migrations for example require a graceful exit behavior.
[+] [-] eknkc|10 months ago|reply
In a second thought though, maybe I do. That might be the only way to ensure the assumption is true. Like the Netflix's chaos monkey thing a couple years ago.
[+] [-] icedchai|10 months ago|reply
[+] [-] Thaxll|10 months ago|reply
[+] [-] fpoling|10 months ago|reply
It is relatively straightforward to implement under systemd. And nginx has been supporting that for over 20 years. Sadly Kuberenets and Docker have no support for that assuming it is done in load balancer or the reverse proxy.
[+] [-] joaohaas|10 months ago|reply
[+] [-] giancarlostoro|10 months ago|reply
[+] [-] danhau|10 months ago|reply
[+] [-] amelius|10 months ago|reply
[+] [-] gchamonlive|10 months ago|reply
[+] [-] cle|10 months ago|reply
[+] [-] eberkund|10 months ago|reply
I find that I typically have a few services that I need to start-up and sometimes they have different mechanisms for start-up and shutdown. Sometimes you need to instantiate an object first, sometimes you have a context you want to cancel, other times you have a "Stop" method to call.
I designed the library to help my consolidate this all in one place with a unified API.
[+] [-] mariusor|10 months ago|reply
https://pkg.go.dev/git.sr.ht/~mariusor/wrapper#example-Regis...
[+] [-] pseidemann|10 months ago|reply
[+] [-] deathanatos|10 months ago|reply
> The exact wait time depends on your readiness probe configuration
A terminating pod is not ready by definition. The service will also mark the endpoint as terminating (and as not ready). This occurs on the transition into Terminating; you don't have to fail a readiness check to cause it.
(I don't know about the ordering of the SIGTERM & the various updates to the objects such as Pod.status or the endpoint slice; there might be a small window after SIGTERM where you could still get a connection, but it isn't the large "until we fail a readiness check" TFA implies.)
(And as someone who manages clusters, honestly that infintesimal window probably doesn't matter. Just stop accepting new connections, gracefully close existing ones, and terminate reasonably fast. But I feel like half of the apps I work with fall into either a bucket of "handle SIGTERM & take forever to terminate" or "fail to handle SIGTERM (and take forever to terminate)".
[+] [-] cientifico|10 months ago|reply
https://go.dev/blog/wire https://github.com/google/wire
[+] [-] Savageman|10 months ago|reply
[+] [-] liampulles|10 months ago|reply
Then the main app goroutine can close the context when it wants to shutdown, and block on the waitgroup until everything is closed.
[+] [-] mariusor|10 months ago|reply
So if you send a SIG-INT/-TERM signal to the server there's a delay to clean up resources, during which the new requests get served a response that doesn't try to access them and fail in unexpected ways, but a configurable "not in service" error.
[+] [-] gitroom|10 months ago|reply