Simon Tatham, author of Putty, has quite a detailed blog post [0] on using the C++20's coroutine system. And yep, it's a lot to do on your own, C++26 really ought to give us some pre-built templates/patterns/scaffolds.
You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly. It's a matter of saving a few registers and switching the stack pointer, minicoro [1] is a pretty good C library that does it. I like this model a lot more than C++20 coroutines:
1. C++20 coros are stackless, in the general case every async "function call" heap allocates.
2. If you do your own stackful coroutines, every function can suspend/resume, you don't have to deal with colored functions.
3. (opinion) C++20 coros are very tasteless and "C++-design-commitee pilled". They're very hard to understand, implement, require the STL, they're very heavy in debug builds and you'll end up with template hell to do something as simple as Promise.all
> You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly
I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C. And the obvious consequence is that it stops being portable. Minicoro only supports three architectures. Granted, those are the three most popular ones, but other architectures exist.
(just double checked and it doesn't do Windows/ARM, for example. Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon, but they have at least some of it)
Hmm. I'm fairly certain that most of that assembly code for saving/restoring registers can be replaced with setjmp/longjmp, and only control transfer itself would require actual assembly. But maybe not.
That's the problem with register machines, I guess. Interestingly enough, BCPL, its main implementation being a p-code interpreter of sorts, has pretty trivially supported coroutines in its "standard" library since the late seventies — as you say, all you need to save is the current stack pointer and the code pointer.
C++ destructors and exception safety will likely wreak havoc with any "simple" assembly/longjmp-based solution, unless severely constraining what types you can use within the coroutines.
That it has to heap-allocate if non-inlined is a misconception. This is only the default behavior.
One can define:
void *operator new(size_t sz, Foo &foo)
in the coro's promise type, and this:
- removes the implicitly-defined operator new
- forces the coro's signature to be CoroType f(Foo &foo), and forwards arguments to the "operator new" one defined
Therefore, it's pretty trivial to support coroutines even when heap cannot be used, especially in the non-recursive case.
Yes, green threads ("stackful coroutines") are more straightforward to use, however:
- they can't be arbitrarily destroyed when suspended (this would require stack unwinding support and/or active support from the green thread runtime)
- they are very ABI dependent. Among the "few registers" one has to save FPU registers. Which, in the case of older Arm architectures, and codegen options similar to -mgeneral-regs-only (for code that runs "below" userspace). Said FPU registers also take a lot of space in the stack frame, too
Really, stackless coros are just FSM generators (which is obvious if one looks at disasm)
Stackful makes for cute demos, but you need huge per-thread stacks if you actually end up calling into Linux libc, which tends to assume typical OS thread stack sizes (8MB). (I don't disagree that some of the other tradeoffs are nice, and I have no love for C++20 coroutines myself.)
Actually you don't even need ASM at all. Just need to have smart use of compiler built-in to make it truly portable. See my composable continuation implementation: https://godbolt.org/z/zf8Kj33nY
As an x-gamedev, suspect/resume/stackful coroutines made them too heavy to have several thousand of them running during a game loop for our game. At the time we used GameMonkey Script: https://github.com/publicrepo/gmscript
That was over 20 years ago. No idea what the current hotness is.
generator fib() {
a, b = 1, 2
while (a<100) {
b, a = a, a+b
yield a
}
yield a-1
}
Becomes this:
struct fibState {
a,
b,
position
}
int fib(fibState state) {
switch (fibState.postion) {
case 0:
fibState.a, fibState.b = 1,2
while (a<100) {
fibState.b, fibState.a = fibState.a, fibState.a+fibState.b
// switching the context
fibState.position = 1;
return fibState.a;
case 1:
}
fibState.position = 2;
return fibState.a-1
case 2:
fibState.position = -1;
}
}
The ugly state machine example presented in the article is also a manual implementation of a generator. It's as palatable to the normal programmer as raw compiler output. Being written in C++ makes it even uglier and more complicated.
The programming language I made is a concrete example of what programming these things manually is like. I had to write every primitive as a state machine just like the one above.
This is one reason why I built coroutines into my game programming language Easel (https://easel.games). I think they let you keep the flow of the code matching the flow of the your logic (top-to-bottom), rather than jumping around, and so I think they are a great tool for high-level programming. The main thing is stopping the coroutines when the entity dies, and in Easel that is done by implying ownership from the context they are created in. It is quite a cool way of coding I think, avoids the state machines like the OP stated, keeps everything straightforward step-by-step and so all the code feels more natural in my opinion. In Easel they are called behaviors if anyone is interested in more detail: https://easel.games/docs/learn/language/behaviors
Not an expert in game development, but I'd say the issue with C++ coroutines (and 'colored' async functions in general) is that the whole call stack must be written to support that. From a practical perspective, that must in turn be backed by a multithreaded event loop to be useful, which is very difficult to write performantly and correctly. Hence, most people end up using coroutines with something like boost::asio, but you can do that only if your repo allows a 'kitchen sink' library like Boost in the first place.
More broadly the dimension of time is always a problem in gamedev, where you're partially inching everything forward each frame and having to keep it all coherent across them.
It can easily and often does lead to messy rube goldberg machines.
There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.
This is more evident in games/simulations but the same problem arises more or less in any software: batch jobs and DAGs, distributed systems and transactions, etc.
This what Rich Hickey (Clojure author) has termed “place oriented programming”, when the focus is mutating memory addresses and having to synchronize everything, but failing to model time as a first class concept.
I’m not aware of any general purpose programming language that successfully models time explicitly, Verilog might be the closest to that.
> There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.
Sounds interesting. If it's not too much of an effort, could you dig up a reference?
As the author lays out, the thing that made coroutines click for me was the isomorphism with state machine-driven control flow.
That’s similar to most of what makes C++ tick: There’s no deep magic, it’s “just” type-checked syntactic sugar for code patterns you could already implement in C.
(Occurs to me that the exceptions to this … like exceptions, overloads, and context-dependent lookup … are where C++ has struggled to manage its own complexity.)
I've been doing a lot of work with ECS/Dots recently and once I wrapped my head around it - amazing.
I recall working on a few VR projects - where it's imperative that you keep that framerate solid or risk making the user physically sick - this is where really began using coroutines for instantiating large volumes of objects and so on (and avoiding framerate stutter).
ECS/Dots & the burst compiler makes all of this unnecessary and the performance is nothing short of incredible.
Thankfully they are actively working towards upgrading, Unity 6.8 (they're currently on 6.4) is supposed to move fully towards CoreCLR, and removing Mono. We'll then finally be able to move to C# 14 (from C# 9, which came out in 2020), as well as use newer .NET functionality.
Not that ancient, they just haven't bothered to update their coroutine mechanism to async/await. The Stride engine does it with their own scheduler, for example.
Coroutines generally imply some sort of magic to me.
I would just go straight to tbb and concurrent_unordered_map!
The challenge of parallelism does not come from how to make things parallel, but how you share memory:
How you avoid cache misses, make sure threads don't trample each other and design the higher level abstraction so that all layers can benefit from the performance without suffering turnaround problems.
My challenge right now is how do I make the JVM fast on native memory:
1) Rewrite my own JVM.
2) Use the buffer and offset structure Oracle still has but has deprecated and is encouraging people to not use.
We need Java/C# (already has it but is terrible to write native/VM code for?) with bottlenecks at native performance and one way or the other somebody is going to have to write it?
This is quite understandable when you know the history behind how C++ coroutines came to be.
They were initially proposed by Microsoft, based on a C++/CX extension, that was inspired by .NET async/await implementation, as the WinRT runtime was designed to only support asynchronous code.
Thus if one knows how the .NET compiler and runtime magic works, including custom awaitable types, there will be some common bridges to how C++ co-routines ended up looking like.
"Just" is doing a lot of work there. I've use callback-based async frameworks in C++ in the past, and it turns into pure hell very fast. Async programming is, basically, state machines all the way down, and doing it explicitly is not nice. And trying to debug the damn thing is a miserable experience
Not necessarily. A coroutine encapsulates the entire state machine, which might pe a PITA to implement otherwise. Say, if I have a stateful network connection, that requires initialization and periodic encryption secret renewal, a coroutine implementation would be much slimmer than that of a state machine with explicit states.
Lol, no thanks. People are using coroutines exactly to avoid callback hell. I have rewritten my own C++ ASIO networking code from callback to coroutines (asio::awaitable) and the difference is night and day!
You can structure coroutines with a context so the runtime has an idea when it can drop them or cancel them. Really nice if you have things like game objects with their own lifecycles.
The 'primitive' SCUMM language used for writing Adventure Games like Maniac Mansion had coroutines - an ill fated attempt to convert to using Python was hampered by Python (at the time) having no support for yield.
I don't know, I'm not convinced with this argument.
The "ugly" version with the switch seems much preferable to me.
It's simple, works, has way less moving parts and does not require complex machinery to be built into the language. I'm open to being convinced otherwise but as it stands I'm not seeing any horrible problems with it.
[+] [-] Joker_vD|22 hours ago|reply
[0] https://web.archive.org/web/20260105235513/https://www.chiar...
[+] [-] zozbot234|16 hours ago|reply
[+] [-] matt_d|15 hours ago|reply
[+] [-] nananana9|22 hours ago|reply
1. C++20 coros are stackless, in the general case every async "function call" heap allocates.
2. If you do your own stackful coroutines, every function can suspend/resume, you don't have to deal with colored functions.
3. (opinion) C++20 coros are very tasteless and "C++-design-commitee pilled". They're very hard to understand, implement, require the STL, they're very heavy in debug builds and you'll end up with template hell to do something as simple as Promise.all
[1] https://github.com/edubart/minicoro
[+] [-] pjc50|22 hours ago|reply
I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C. And the obvious consequence is that it stops being portable. Minicoro only supports three architectures. Granted, those are the three most popular ones, but other architectures exist.
(just double checked and it doesn't do Windows/ARM, for example. Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon, but they have at least some of it)
[+] [-] Joker_vD|21 hours ago|reply
That's the problem with register machines, I guess. Interestingly enough, BCPL, its main implementation being a p-code interpreter of sorts, has pretty trivially supported coroutines in its "standard" library since the late seventies — as you say, all you need to save is the current stack pointer and the code pointer.
[+] [-] Sharlin|20 hours ago|reply
[+] [-] TuxSH|15 hours ago|reply
> require the STL
That it has to heap-allocate if non-inlined is a misconception. This is only the default behavior.
One can define:
void *operator new(size_t sz, Foo &foo)
in the coro's promise type, and this:
- removes the implicitly-defined operator new
- forces the coro's signature to be CoroType f(Foo &foo), and forwards arguments to the "operator new" one defined
Therefore, it's pretty trivial to support coroutines even when heap cannot be used, especially in the non-recursive case.
Yes, green threads ("stackful coroutines") are more straightforward to use, however:
- they can't be arbitrarily destroyed when suspended (this would require stack unwinding support and/or active support from the green thread runtime)
- they are very ABI dependent. Among the "few registers" one has to save FPU registers. Which, in the case of older Arm architectures, and codegen options similar to -mgeneral-regs-only (for code that runs "below" userspace). Said FPU registers also take a lot of space in the stack frame, too
Really, stackless coros are just FSM generators (which is obvious if one looks at disasm)
[+] [-] loeg|10 hours ago|reply
[+] [-] Trung0246|9 hours ago|reply
[+] [-] socalgal2|16 hours ago|reply
That was over 20 years ago. No idea what the current hotness is.
[+] [-] MisterTea|17 hours ago|reply
The stack save/restore happens in: https://swtch.com/libtask/asm.S
[+] [-] nottorp|9 hours ago|reply
Why are people afraid of state machines? There's been sooo much effort spent on hiding them from the programmer...
[+] [-] matheusmoreira|8 hours ago|reply
For example, generators. Also known as semicoroutines.
https://langdev.stackexchange.com/a/834
This:
Becomes this: The ugly state machine example presented in the article is also a manual implementation of a generator. It's as palatable to the normal programmer as raw compiler output. Being written in C++ makes it even uglier and more complicated.The programming language I made is a concrete example of what programming these things manually is like. I had to write every primitive as a state machine just like the one above.
https://www.matheusmoreira.com/articles/delimited-continuati...
[+] [-] BSTRhino|10 hours ago|reply
[+] [-] cherryteastain|22 hours ago|reply
[+] [-] abcde666777|22 hours ago|reply
It can easily and often does lead to messy rube goldberg machines.
There was a game AI talk a while back, I forget the name unfortunately, but as I recall the guy was pointing out this friction and suggesting additions we could make at the programming language level to better support that kind of time spanning logic.
[+] [-] manoDev|20 hours ago|reply
This what Rich Hickey (Clojure author) has termed “place oriented programming”, when the focus is mutating memory addresses and having to synchronize everything, but failing to model time as a first class concept.
I’m not aware of any general purpose programming language that successfully models time explicitly, Verilog might be the closest to that.
[+] [-] syncurrent|20 hours ago|reply
[+] [-] repelsteeltje|22 hours ago|reply
Sounds interesting. If it's not too much of an effort, could you dig up a reference?
[+] [-] truepricehq|21 hours ago|reply
[deleted]
[+] [-] twoodfin|21 hours ago|reply
That’s similar to most of what makes C++ tick: There’s no deep magic, it’s “just” type-checked syntactic sugar for code patterns you could already implement in C.
(Occurs to me that the exceptions to this … like exceptions, overloads, and context-dependent lookup … are where C++ has struggled to manage its own complexity.)
[+] [-] HarHarVeryFunny|21 hours ago|reply
[+] [-] appstorelottery|9 hours ago|reply
I recall working on a few VR projects - where it's imperative that you keep that framerate solid or risk making the user physically sick - this is where really began using coroutines for instantiating large volumes of objects and so on (and avoiding framerate stutter).
ECS/Dots & the burst compiler makes all of this unnecessary and the performance is nothing short of incredible.
[+] [-] wiseowise|19 hours ago|reply
[+] [-] pjc50|22 hours ago|reply
[+] [-] tyleo|22 hours ago|reply
[+] [-] Deukhoofd|21 hours ago|reply
https://discussions.unity.com/t/coreclr-scripting-and-ecs-st...
[+] [-] Philip-J-Fry|21 hours ago|reply
Is that a hack? Is that not just exactly what IEnumerable and IEnumerator were built to do?
[+] [-] debugnik|22 hours ago|reply
Edit: Nevermind, they eventually bothered.
[+] [-] ahoka|21 hours ago|reply
[+] [-] repelsteeltje|22 hours ago|reply
[+] [-] bullen|20 hours ago|reply
I would just go straight to tbb and concurrent_unordered_map!
The challenge of parallelism does not come from how to make things parallel, but how you share memory:
How you avoid cache misses, make sure threads don't trample each other and design the higher level abstraction so that all layers can benefit from the performance without suffering turnaround problems.
My challenge right now is how do I make the JVM fast on native memory:
1) Rewrite my own JVM. 2) Use the buffer and offset structure Oracle still has but has deprecated and is encouraging people to not use.
We need Java/C# (already has it but is terrible to write native/VM code for?) with bottlenecks at native performance and one way or the other somebody is going to have to write it?
[+] [-] pjmlp|20 hours ago|reply
This is quite understandable when you know the history behind how C++ coroutines came to be.
They were initially proposed by Microsoft, based on a C++/CX extension, that was inspired by .NET async/await implementation, as the WinRT runtime was designed to only support asynchronous code.
Thus if one knows how the .NET compiler and runtime magic works, including custom awaitable types, there will be some common bridges to how C++ co-routines ended up looking like.
[+] [-] mgaunard|21 hours ago|reply
I never understood the value. Just use lambdas/callbacks.
[+] [-] usrnm|20 hours ago|reply
"Just" is doing a lot of work there. I've use callback-based async frameworks in C++ in the past, and it turns into pure hell very fast. Async programming is, basically, state machines all the way down, and doing it explicitly is not nice. And trying to debug the damn thing is a miserable experience
[+] [-] affenape|20 hours ago|reply
[+] [-] spacechild1|19 hours ago|reply
Lol, no thanks. People are using coroutines exactly to avoid callback hell. I have rewritten my own C++ ASIO networking code from callback to coroutines (asio::awaitable) and the difference is night and day!
[+] [-] socalgal2|16 hours ago|reply
[+] [-] jayd16|19 hours ago|reply
For simple callback hell, not so much.
[+] [-] Sharlin|20 hours ago|reply
[+] [-] duped|17 hours ago|reply
[+] [-] bradrn|21 hours ago|reply
[+] [-] sagebird|16 hours ago|reply
Appreciate this humor -- absurd, tasteful.
[+] [-] Animats|9 hours ago|reply
[+] [-] djmips|7 hours ago|reply
[+] [-] troad|6 hours ago|reply
[+] [-] nice_byte|12 hours ago|reply
The "ugly" version with the switch seems much preferable to me. It's simple, works, has way less moving parts and does not require complex machinery to be built into the language. I'm open to being convinced otherwise but as it stands I'm not seeing any horrible problems with it.
[+] [-] maltyxxx|21 hours ago|reply
[deleted]
[+] [-] sta1n|5 hours ago|reply
[deleted]