Introduction to Golang Preemption Mechanisms

__turbobrew__|1 year ago

Are there any proposals to make the golang runtime cgroup aware? Last time I checked the go runtime will spawn a OS process for each cpu it can see even if it is running in a cgroup which only allows 1 CPU of usage. On servers with 100+ cores I have seen scheduling time take over 10% of the program runtime.

The fix is to inspect the cgroupfs to see how many CPU shares you can utilize and then set gomaxprocs to match that. I think other runtime like Java and .NET do this automatically.

It is the same thing with GOMEMLIMIT, I don’t see why the runtime does not inspect cgroupfs and set GOMEMLIMIT to 90% of the cgroup memory limit.

lcof|1 year ago

On Linux, go uses sched_getaffinity to know how many cpu core it is allowed to run on:

https://cs.opensource.google/go/go/+/master:src/runtime/os_l...

jrockway|1 year ago

I am guessing the API isn't stable enough for letting the runtime set maxprocs. I use https://pkg.go.dev/go.uber.org/automaxprocs and have had to update it periodically because Redhat and Debian have different defaults. (Should one even run k8s on Redhat? I say no, but Redhat says yes. That's how I know about this.)

This, I think, is cgroups 1 vs. cgroups 2 and everyone should have cgroups 2 now, but ... it would feel weird for the Go runtime to decide on one. To me, anyway.

EdSchouten|1 year ago

If you’re on Kubernetes, you can solve this/work around this by enabling the static CPU manager policy:

https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

unknown|1 year ago

[deleted]

metadat|1 year ago

> think about it - what if I suddenly stopped you while taking a dump? It would have been easier to have stopped you before, or after, but not in the middle of it. Not sure about the analogy, but you got it.

Gold.

ollien|1 year ago

Great post! One question that lingered for me is: what are asynchronous safe-points? The post goes into some detail about their synchronous counterparts

MathMonkeyMan|1 year ago

I don't know, but I remember hearing in a talk that the compiler had to be modified to insert them into the generated code.

Here's `isAsyncSafePoint`: https://github.com/golang/go/blob/d36353499f673c89a267a489be...

edit: The comments at the top of that file say:

    // 3. Asynchronous safe-points occur at any instruction in user code
    //    where the goroutine can be safely paused and a conservative
    //    stack and register scan can find stack roots. The runtime can
    //    stop a goroutine at an async safe-point using a signal.

gregors|1 year ago

If you'd like to see a really well done deep dive by the new Golang Tech Lead ( Austin Clements), check this out

GopherCon 2020: Austin Clements - Pardon the Interruption: Loop Preemption in Go 1.14

https://www.youtube.com/watch?v=1I1WmeSjRSw

zbentley|1 year ago

Interesting that it’s temporal (according to the article, you have around 10 microseconds before the signal-based preempter kicks in). How bad is performance if the load on the host is so high that double-preempting is common, I wonder? Or am I missing something and that question is not meaningful?

lcof|1 year ago

No it’s an interesting comment. This is not really about load, but about control flow: if goroutine is just spinning wild without going through any function prologue, it won’t even be aware of the synchronous preemption request. Asynchronous preemption (signal-based) is mainly (I say “mainly” because I am not sure I can say “only”) for this kind of situation.

I don’t have the link ready, but twitch had this kind of issue with base64 decoding in some kind of servers. The GC would try to STW, but there would always be one or a few goroutines decoding base64 in a tight loop for the time STW was attempted, delaying it again and again.

Asynchronous preemption is a solution to this kind of issue. Load is not the issue here, as long as you go through the runtime often enough.

jerf|1 year ago

You'll actually see that's a general concurrency pattern, and I mean, far beyond Go. It is certainly ideal to trigger off of some signal (in a very, very generic sense of the term, not OS signal) in order to trigger some other process (again in a very generic sense), but in general if you're programming concurrent code you should always have some sort of time-based fallback for any wait you are doing, because you will hit it out in the field. It's all kinds of no fun to have processes that will wait forever. (Unless you're really sure you need it.)

hiyer|1 year ago

This is a well-written article, but one thing that wasn't clear to me was how the runtime determines that it's at a safe point. Can someone shed some light on that?

MathMonkeyMan|1 year ago

The runtime never determines whether the goroutine is at a safe point. It "poisons" the stack guard so that the next time the goroutine reaches a function prologue, which is a safe point, it examines the stack guard and knows that it has been preempted.

Then there's the async case for tight loops that I remember reading about back in 2020 (it uses unix signals), but don't yet fully grok the specifics.

28 comments