One thing that really goes against my intuition is that user space threads (lightweight treads, goroutines) are faster than kernel threads. Without knowing too much assembly, I would assume any modern processor would make a context switch a one instruction affair. Interrupt -> small scheduler code picks the thread to run -> LOAD THREAD instruction and the processor swaps in all the registers and the instruction pointer.
You probably can't beat that in user space, especially if you want to preempt threads yourself. You'd have to check after every step, or profile your own process or something like that. And indeed, Go's scheduler is cooperative.
But then, why can't you get the performance of Goroutines with OS threads? Is it just because of legacy issues? Or does it only work with cooperative threading, which requires language support?
One thing I'm missing from that article is how the cooperativeness is implemented. I think in Go (and in Java's Project Loom), you have "normal code", but then deep down in network and IO functions, you have magic "yield" instructions. So all the layers above can pretend they are running on regular threads, and you avoid the "colored function problem", but you get runtime behavior similar to coroutines. Which only works if really every blocking IO is modified to include yielding behavior. If you call a blocking OS function, I assume something bad will happen.
It hasn't been cooperative for a few versions now, the scheduler became preemptive in 1.14. And before that there were yield points at every function prolog (as well as all IO primitives) so there were relatively few situations where cooperation was necessary.
> Without knowing too much assembly, I would assume any modern processor would make a context switch a one instruction affair.
Any context switch (to the kernel) is expensive, and way more than a single operation. The kernel also has a ton of stuff to do, it's not just "picks the thread to run", you have to restore the ip and sp, but also may have to restore FPU/SSE/AVX state (AVX512 is over 2KB of state), traps state.
There are a few reasons why context switching in user mode could be faster, but that has little if anything to do with the performance benefits of usermode threads. The performance benefit of usermode threads is a result of their quantity, due to Little's law. They're not "faster", just more numerous, and that's what you need for higher throughput. More precisely, OS threads, because of their scarcity, introduce an artificial bound on throughput that's lower than what the hardware can support, and usermode threads remove that bound.
As to why it's hard for the OS to allow that many threads, the OS would need to keep thread stacks small and resizable, and that is hard to do if you don't know the specifics of how the language uses the stack. For example, to accommodate low-level languages that allow pointers into the stack you would need to manipulate virtual memory (to keep addresses valid), but that only works at a page granularity, or to introduce split stacks, which would require a new kind of ABI known to compilers (and would probably have a cost to performance).
> I would assume any modern processor would make a context switch a one instruction affair. Interrupt -> small scheduler code picks the thread to run -> LOAD THREAD instruction and the processor swaps in all the registers and the instruction pointer.
It can't be a single instruction, since the details of what a "context" contains depends on the OS and ABI. For example on Linux, the signal mask is a part of the OS thread context (but usually not user thread contexts) which requires a syscall to retrieve it from kernel memory before saving it in the context.
The reason why user threads are so much faster than OS threads is precisely because it can be reduced to a handful of instructions without caring about all the details that OS threads need to care about.
> Which only works if really every blocking IO is modified to include yielding behavior. If you call a blocking OS function, I assume something bad will happen.
That's exactly what Go does, they introduce yield points into function prologues and i/o ops. You don't have direct FFI calls in Go so it's not as big of an issue. It's roughly the same problem as GC safepoints in multithreaded interpreters that support FFI.
There is a ton of context associated with OS/kernel threads. Virtual memory, security, I/O. While there is some hardware acceleration for those in modern processors there isn't anything like LOAD THREAD and even with CPU support it's still very expensive.
You get an interrupt, then the kernel needs to load its own context (the tables it needs to access), then the kernel needs to do the expensive switch.
In user space you have a lot less context. The actual switching is pretty much the cost of a function call. If you need preemption that's a different story and mostly depends on what facilities are available for that. Inserting preemption checks is a little hacky (hello Go ;) ) but what can you do.
EDIT: It's worthwhile noting there's indirect costs like caches being stale. Lightweight/green threads will often work on shared data structures so the caches are more likely to have something useful in them. They may even share the code.
One of the issues is that OS schedulers are complex, and actually much more expensive than context switch itself. You can mitigate this with user-mode scheduling of kernel threads: https://www.youtube.com/watch?v=KXuZi9aeGTw
I don't have any specific recommendations to give you, but skim through an operating systems text book, or college course that puts its slides and whatnot online, when it comes to kernel context switching. It'll give you an idea of what kind of work a kernel must do when context switching between threads and processes, and why userspace multitasking can be more efficient.
>I would assume any modern processor would make a context switch a one instruction affair.
Has been the historic assumption, has been proven to be wrong by every possible benchmark.
Consider tech empower[0] for raw stack performance , runtime level threads outperform IO threads since OS thread were designed to be mapped on physicals cores.
This is very expensive and inefficient.
Creating one thread for every request you have ( Apache + PHP ) will exhaust the hardware after a few thousands/qps target.
Runtime can indeed have millions of those “lightweight threads” without killing your machine since they create a pool from physical threads and tap into IO events to efficiently switch or resume contexts. This is by far much faster.
On Linux, 99% of the time, you want kernel threads. They can be configured to use very little stack memory. They can be configured to reduce the scheduling overhead by using NOOP scheduling which works for certain workloads. They are rock solid.
Spawning millions of unnecessary user-space threads because they are supposed to be more efficient than kernel threads is rarely the best solution to any problem imho.
> modern processor would make a context switch a one instruction affair.
Reduced instruction set (complexity) is a hallmark of modern processor designs, not the other way around.
You might want to read about what is involved in a task switch (either "thread" with same memory mapping, or "process") but it is not something that is conducive to reasonably carry out in one instruction.
I mean it has many diagrams and logical explanation of Goroutines and concurrency concepts in general but it is definitely not under the hood descriptions.
> A newly minted goroutine is given a few kilobytes
a line later
> It is practical to create hundreds of thousands of goroutines in the same address space
So it's not practical to create 100s of Ks of goroutines - it's possible, sure, but because you incur GBs of memory overhead if you are actually creating that many goroutines means that for any practical problem you are going to want to stick to a few thousand goroutines. I can almost guarantee you that you have something better to do with those GBs of memory than store goroutine stacks.
Asking the scheduler to handle scheduling 100s of Ks of goroutines is also not a great idea in my experience either.
> So it's not practical to create 100s of Ks of goroutines - it's possible, sure, but because you incur GBs of memory overhead if you are actually creating that many goroutines means that for any practical problem you are going to want to stick to a few thousand goroutines. I can almost guarantee you that you have something better to do with those GBs of memory than store goroutine stacks.
You lost me in a couple places:
1) "GBs of memory overhead" being a lot. A rule of thumb I've seen in a datacenter situation is that (iirc) 1 hyperthread and 6 GiB of RAM are roughly equivalent in cost. (I'm sure it varies over processor/RAM generations, so you should probably check this on your platform rather than take my word for it.) I think most engineers are way too stingy with RAM. It often makes sense to use more of it to reduce CPU, and to just spend it on developer convenience. Additionally, often one goroutine matches up to one incoming or outgoing socket connection (see below). How much RAM are you spending per connection on socket buffers? Probably a lot more than a few kilobytes...
2) The idea that you target a certain number of goroutines. They model some activity, often a connection or request. I don't target a certain number of those; I target filling the machine. (Either the most constrained resource of CPU/RAM/SSD/disk/network if you're the only thing running there, or a decent chunk of it with Kubernetes or whatever, bin-packing to use all dimensions of the machine as best as possible.) Unless the goroutines' work is exclusively CPU-bound, of course, then you want them to match the number of CPUs available, so thousands is too much already.
Why is spending GB on stack space a bad thing? Ultimately, in a server, you need to store state for each request. Whether that's on the stack or heap, it's still memory that necessarily has to be used.
If you asked me what “a few kb” times “hundreds of thousands” is, I’d have characterized it as “more than a few hundreds of thousands of kb”, not necessarily “gigabytes,” and that doesn’t sound impractical at all. My JVM heaps are usually 16GB.
And go actually does a pretty good job of scheduling hundreds of thousands of threads. 6 months ago I did some fairly abusive high-thread-count experiments solving problems in silly ways that required all N goroutines to participate and I didn’t see much perf falloff on my laptop until I got 1.5-2 million goroutines.
In addition to the other comments about memory usage, I’ll mention that there is a proposal (that’s either going to make it into Go 1.19 or 1.20?) that uses heuristics to determine a good starting stack size for goroutines.
My experience is that whatever you’re doing with the go routine is usually a bottleneck before the go routine itself. E.g. if you make a network request, you become network bound before memory bound from go routines.
While the blog is a great introductory post, https://www.youtube.com/watch?v=KBZlN0izeiY is a great watch if you're interested in the magical optimizations in goroutine scheduling.
Having used both in production for many years, Go’s model is waayyyy better, mostly because Python’s model results in a bunch of runtime bugs.
The least of which are type error things like forgetting to await an async function—these can be caught with a type checker (although this means you need to have a type checker running in your CI and annotations for all of your dependencies).
The most serious are the ones where someone calls a sync or CPU-heavy function (directly or transitively) and it starves the event loop causing timeouts in unrelated endpoints and eventually bringing down the entire application (load shedding can help mitigate this somewhat). Go dodges these problems by not having sync functions at all (everything is async under the covers) and parallelism means CPU-bound workloads don’t block the whole event loop.
>"Go run-time scheduler multiplexes goroutines onto threads and when a thread blocks, the run-time moves the blocked goroutines to another runnable kernel thread to achieve the highest efficiency possible."
Why would the Go run-time move the blocked goroutines to another runnable kernel thread? If it is currently blocked it won't be schedulable regardless no?
I haven't read the article but, generally, main thread pool is designed to utilise the whole processor. So on a 12-core system there will be 12 OS threads. No point in having more. But, if the program opens just 12 big files all threads become blocked and that's obviously tragic: other routines are starved, network sessions time out, timers don't run - chaos! Thus, whenever a thread in the main pool is blocked, you move the thread outside and spawn a fresh new one that can keep going through the compute.
There's also a subtler benefit. Each user thread has a context, e.g. its local run queue. Now, if thread blocks, others need to help it out and steal its work. Go improves that by having a nice handover system, so no random stealing is necessary. Further, by taking context off blocked threads, it keeps all the tasks more centralised. There is probably at most a few tens of processors on any common hardware but there could be thousands threads. It's better to tie runqueues to the former rather than the latter.
captainmuon|3 years ago
You probably can't beat that in user space, especially if you want to preempt threads yourself. You'd have to check after every step, or profile your own process or something like that. And indeed, Go's scheduler is cooperative.
But then, why can't you get the performance of Goroutines with OS threads? Is it just because of legacy issues? Or does it only work with cooperative threading, which requires language support?
One thing I'm missing from that article is how the cooperativeness is implemented. I think in Go (and in Java's Project Loom), you have "normal code", but then deep down in network and IO functions, you have magic "yield" instructions. So all the layers above can pretend they are running on regular threads, and you avoid the "colored function problem", but you get runtime behavior similar to coroutines. Which only works if really every blocking IO is modified to include yielding behavior. If you call a blocking OS function, I assume something bad will happen.
masklinn|3 years ago
It hasn't been cooperative for a few versions now, the scheduler became preemptive in 1.14. And before that there were yield points at every function prolog (as well as all IO primitives) so there were relatively few situations where cooperation was necessary.
> Without knowing too much assembly, I would assume any modern processor would make a context switch a one instruction affair.
Any context switch (to the kernel) is expensive, and way more than a single operation. The kernel also has a ton of stuff to do, it's not just "picks the thread to run", you have to restore the ip and sp, but also may have to restore FPU/SSE/AVX state (AVX512 is over 2KB of state), traps state.
Kernel-level context switching costs on the order of 10x what userland context switching does: https://eli.thegreenplace.net/2018/measuring-context-switchi...
> LOAD THREAD
There is no load thread instruction
pron|3 years ago
More here: https://inside.java/2020/08/07/loom-performance/
As to why it's hard for the OS to allow that many threads, the OS would need to keep thread stacks small and resizable, and that is hard to do if you don't know the specifics of how the language uses the stack. For example, to accommodate low-level languages that allow pointers into the stack you would need to manipulate virtual memory (to keep addresses valid), but that only works at a page granularity, or to introduce split stacks, which would require a new kind of ABI known to compilers (and would probably have a cost to performance).
duped|3 years ago
It can't be a single instruction, since the details of what a "context" contains depends on the OS and ABI. For example on Linux, the signal mask is a part of the OS thread context (but usually not user thread contexts) which requires a syscall to retrieve it from kernel memory before saving it in the context.
The reason why user threads are so much faster than OS threads is precisely because it can be reduced to a handful of instructions without caring about all the details that OS threads need to care about.
> Which only works if really every blocking IO is modified to include yielding behavior. If you call a blocking OS function, I assume something bad will happen.
That's exactly what Go does, they introduce yield points into function prologues and i/o ops. You don't have direct FFI calls in Go so it's not as big of an issue. It's roughly the same problem as GC safepoints in multithreaded interpreters that support FFI.
YZF|3 years ago
You get an interrupt, then the kernel needs to load its own context (the tables it needs to access), then the kernel needs to do the expensive switch.
In user space you have a lot less context. The actual switching is pretty much the cost of a function call. If you need preemption that's a different story and mostly depends on what facilities are available for that. Inserting preemption checks is a little hacky (hello Go ;) ) but what can you do.
EDIT: It's worthwhile noting there's indirect costs like caches being stale. Lightweight/green threads will often work on shared data structures so the caches are more likely to have something useful in them. They may even share the code.
garaetjjte|3 years ago
gpderetta|3 years ago
heavyset_go|3 years ago
asien|3 years ago
Has been the historic assumption, has been proven to be wrong by every possible benchmark.
Consider tech empower[0] for raw stack performance , runtime level threads outperform IO threads since OS thread were designed to be mapped on physicals cores.
This is very expensive and inefficient.
Creating one thread for every request you have ( Apache + PHP ) will exhaust the hardware after a few thousands/qps target.
Runtime can indeed have millions of those “lightweight threads” without killing your machine since they create a pool from physical threads and tap into IO events to efficiently switch or resume contexts. This is by far much faster.
[0] https://www.techempower.com/benchmarks/#section=data-r20&hw=...
seunosewa|3 years ago
Spawning millions of unnecessary user-space threads because they are supposed to be more efficient than kernel threads is rarely the best solution to any problem imho.
danachow|3 years ago
Reduced instruction set (complexity) is a hallmark of modern processor designs, not the other way around.
You might want to read about what is involved in a task switch (either "thread" with same memory mapping, or "process") but it is not something that is conducive to reasonably carry out in one instruction.
geodel|3 years ago
geertj|3 years ago
jeffbee|3 years ago
aaronbwebber|3 years ago
> A newly minted goroutine is given a few kilobytes
a line later
> It is practical to create hundreds of thousands of goroutines in the same address space
So it's not practical to create 100s of Ks of goroutines - it's possible, sure, but because you incur GBs of memory overhead if you are actually creating that many goroutines means that for any practical problem you are going to want to stick to a few thousand goroutines. I can almost guarantee you that you have something better to do with those GBs of memory than store goroutine stacks.
Asking the scheduler to handle scheduling 100s of Ks of goroutines is also not a great idea in my experience either.
scottlamb|3 years ago
You lost me in a couple places:
1) "GBs of memory overhead" being a lot. A rule of thumb I've seen in a datacenter situation is that (iirc) 1 hyperthread and 6 GiB of RAM are roughly equivalent in cost. (I'm sure it varies over processor/RAM generations, so you should probably check this on your platform rather than take my word for it.) I think most engineers are way too stingy with RAM. It often makes sense to use more of it to reduce CPU, and to just spend it on developer convenience. Additionally, often one goroutine matches up to one incoming or outgoing socket connection (see below). How much RAM are you spending per connection on socket buffers? Probably a lot more than a few kilobytes...
2) The idea that you target a certain number of goroutines. They model some activity, often a connection or request. I don't target a certain number of those; I target filling the machine. (Either the most constrained resource of CPU/RAM/SSD/disk/network if you're the only thing running there, or a decent chunk of it with Kubernetes or whatever, bin-packing to use all dimensions of the machine as best as possible.) Unless the goroutines' work is exclusively CPU-bound, of course, then you want them to match the number of CPUs available, so thousands is too much already.
uluyol|3 years ago
hamburglar|3 years ago
And go actually does a pretty good job of scheduling hundreds of thousands of threads. 6 months ago I did some fairly abusive high-thread-count experiments solving problems in silly ways that required all N goroutines to participate and I didn’t see much perf falloff on my laptop until I got 1.5-2 million goroutines.
barsonme|3 years ago
sumy23|3 years ago
rounakdatta|3 years ago
qwertox|3 years ago
This article (which I have not read but just skimmed) made me search for a simple example, and I landed at "A Tour of Go - Goroutines"[0]
That is one of the cleanest examples I've ever seen on this topic, and it shows just how well integrated they are in the language.
[0] https://go.dev/tour/concurrency/1
throwaway894345|3 years ago
The least of which are type error things like forgetting to await an async function—these can be caught with a type checker (although this means you need to have a type checker running in your CI and annotations for all of your dependencies).
The most serious are the ones where someone calls a sync or CPU-heavy function (directly or transitively) and it starves the event loop causing timeouts in unrelated endpoints and eventually bringing down the entire application (load shedding can help mitigate this somewhat). Go dodges these problems by not having sync functions at all (everything is async under the covers) and parallelism means CPU-bound workloads don’t block the whole event loop.
unknown|3 years ago
[deleted]
bogomipz|3 years ago
>"Go run-time scheduler multiplexes goroutines onto threads and when a thread blocks, the run-time moves the blocked goroutines to another runnable kernel thread to achieve the highest efficiency possible."
Why would the Go run-time move the blocked goroutines to another runnable kernel thread? If it is currently blocked it won't be schedulable regardless no?
kangda123|3 years ago
There's also a subtler benefit. Each user thread has a context, e.g. its local run queue. Now, if thread blocks, others need to help it out and steal its work. Go improves that by having a nice handover system, so no random stealing is necessary. Further, by taking context off blocked threads, it keeps all the tasks more centralised. There is probably at most a few tens of processors on any common hardware but there could be thousands threads. It's better to tie runqueues to the former rather than the latter.
hoosieree|3 years ago
But any other tasks queued up behind the blocked task could be moved to another OS thread where they'll have a chance of running.
guenthert|3 years ago
klodolph|3 years ago
There have been various implementations of M:N threads for some time now. The concept is simple, but the devil is in the details.
yellow_lead|3 years ago
unknown|3 years ago
[deleted]