> For example when submitting a write operation, the memory location of those bytes must not be deallocated or overwritten.
> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.
I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.
IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
This actually one of my many gripes about Rust async and why I consider it a bad addition to the language in the long term. The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP) and it has heavily influenced the async design (sometimes indirectly through other languages).
Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.
For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.
There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.
Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.
There is, I think, an ownership model that Rust's borrow checker very poorly supports, and for lack of a better name, I've called it hot potato ownership. The basic idea is that you have a buffer which you can give out as ownership in the expectation that the person you gave it to will (eventually) give it back to you. It's a sort of non-lexical borrowing problem, and I very quickly discovered when trying to implement it myself in purely safe Rust that the "giving the buffer back" is just really gnarly to write.
> IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
Well, I think there is interest, but mostly for file IO.
For file IO, the situation is pretty simple. We already have to implement that using spawn_blocking, and spawn_blocking has the exact same buffer challenges as io_uring does, so translating file IO to io_uring is not that tricky.
On the other hand, I don't think tokio::net's existing APIs will support io_uring. Or at least they won't support the buffer-based io_uring APIs; there is no reason they can't register for readiness through io_uring.
I think the right way to build a safe interface around io_uring would be to use ring-owned buffers, ask the ring for a buffer when you want one, and give the buffer back to the ring when initiating a write.
It’s annoying but possible to do this correctly and not have the API be too bad. The “happy path” of a clean success or error is fine if you accept that buffers can’t just be simple &[u8] slices. Cancellation can be handled safely with something like the following API contract:
Have your function signature be async fn read(buffer: &mut Vec<u8>) -> Result<…>’ (you can use something more convenient like ‘&mut BytesMut’ too). If you run the future to completion (success or failure), the argument holds the same buffer passed in, with data filled in appropriately on success. If you cancel/drop the future, the buffer may point at an empty allocation instead (this is usually not an annoying constraint for most IO flows, and footgun potential is low).
The way this works is that your library “takes” the underlying allocation before starting the operation out of the variable, replacing it with the default unallocated ‘Vec<u8>’. Once the buffer is no longer used by the IO system, it puts it back before returning. If you cancel, it manages the buffer in the background to release it when safe and the unallocated buffer is left in the passed variable.
I wish I could have been paid to work on SPARK specification around io_uring so that one could have built on it. Or to work on SPARK-to-eBPF (there's already a llvm backend for gnat) and have some form of guarantees at the seams... alas.
This was a good read and great work. Can't wait to see the performance tests.
Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface
I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.
I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.
It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.
I am patient to wait for the benchmarks so take your time ,but I honestly love how the author doesn't care about benchmarks right now and wanted to clean the code first.
Its kinda impressive that there are people who have such line of thinking in this world where benchmarks gets maxxed and whole project's sole existence is to satisfy benchmarks.
Really a breath of fresh air and honestly I admire the author so much for this. It was such a good read, loved it a lot thank you. Didn't know ktls existed or Io_uring could be used in such a way.
Unfortunately io_uring is disabled by default on most cloud workload orchestrators, like CloudRun, GKE, EKS and even local Docker.
Hope this will change soon, but until then it will remain very niche.
Anybody know what the state of kTLS is? I asked one of the Cilium devs about it a while ago'cause I'd seen Thomas Graf excitedly talking about it and he told me that kernel support in many distros was lacking so they aren't ready to enable it by default.
That's a shame. How hard is it to enable? Do you need a custom kernel, or can you enable it at runtime?
On FreeBSD, its been in the kernel / openssl since 13, and has been one runtime toggle (sysctl kern.ipc.tls.enable=1) away from being enabled. And its enabled by default in the upcoming FreeBSD-15.
We (at Netflix) have run all of our tls encrypted streaming over kTLS for most of a decade.
I really want to see the benchmarks on this ; tried it like 4 days ago and then built a standard epoll implementation ; I could not compete against nginx using uring but that's not the easiest task for an arrogant night so I really hope you get some deserved sweet numbers ; mine were a sad deception but I did not do most of your implementation - rather simply tried to "batch" calls. Wish you the best of luck and much fun
Note that C++ coroutines use heap allocation to avoid the problems that Pin is solving, which is a pretty big carve-out from the "zero overhead principle" that C++ usually aims for. The long development time of async traits has also been related to Rust not heap allocating futures. Whether that performance+portability-vs-complexity tradeoff is worth it for any given project is, of course, a different question.
The facts that Send/Sync bounds model are still relevant in all the other languages, the absence of Send/Sync just means it's easier to write subtly incorrect code.
If you are fine with writing "good enough" high-level Rust code (that will potentially still beat out most other languages in terms of performance) and are fine with using the mid-level primitives that other people have built, you don't really have to understand most of those things.
Rust: Well yes. Rust does force you to understand the things, or it won't compile. It does have drawbacks.
Go: goroutines are not async. And you can't understand goroutines without understanding channels. And channels are weirdly implemented in Go, where the semantics of edge cases, while well defined, are like rolling a D20 die if you try to reason from first principles.
Go doesn't force you to understand things. I agree with that. It has pros and cons.
I see what you mean but "cheap threads" is not the same thing as async. More like "current status of massive concurrency". Except that's not right either. tarweb, the subject of the blog post in question, is single threaded and uses io_uring as an event loop. (the idea being to spin up one thread per CPU core, to use full capacity)
So it's current status of… what exactly?
Cheap threads have a benefit over an async loop. The main one being that they're easier to reason about. It also has drawbacks. E.g. each thread may be light weight, but it does need a stack.
Also there is napi support in uring which uses polled io on sockets instead of interrupt based io from what I understand. You can see examples using it in liburing github
Isolating a core and then pinning a single thread is the way to go to get both low latency and high throughput, sacrificing efficiency.
This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.
A mistake people make with thread-per-core (TPC) architecture is thinking you can pick and choose the parts you find convenient, when in reality it is much closer to "all or nothing". It may be worse to half-ass a TPC implementation than to not use TPC at all. However, TPC is more efficient in just about all contexts if you do it correctly.
Most developers are unfamiliar with the design idioms for TPC e.g. how to properly balance and shed load between cores.
One thread per core if you're CPU-bound and not IO-bound.
In this very specific case, it seems as though the vast majority of the webserver's work is asynchronous and event-based, so the actual webserver is never waiting on I/O input or output - once it's ready you dump it somewhere the kernel can get to it and move on to the next request if there is one.
I think this gets this specific project close to the platonic ideal of a one-thread-per-core workload if indeed you're never waiting on I/O or any syscalls, but I feel as though it should come with extreme caveats of "this is almost never how the real world works so don't go artificially limiting your application to `nproc` threads without actually testing real-world use cases first".
Pretty cool! Adding kTLS is definitely an improvement. I made an actually zero-syscall per request server a few years ago (and blogged about it at https://wjwh.eu/posts/2021-10-01-no-syscall-server-iouring.h...) but as TFA notes it comes at a heavy cost of constantly busy-looping.
io_uring is very cool tech though and has been progressing at an impressive pace the last few years.
This is impressive but it’s also an amazing amount of complexity and difficult programming to work around the fact that syscalls are so slow.
It seems like there’s these fundamental things in OSes that we just can’t improve, or I suppose can’t without breaking too much backward compatibility, so we are forced to do this.
I don't think it has to be. Conceptually it's just a couple of queues.
There's a software equivalent of the Peter Principle where software or an API becomes increasingly complex to the point where no one understands it. They then attempt to fix that by adding more functionality (complexity).
> In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue.
Under load it's zero syscall (barring any rare allocations inside rustls for the handshake. I can't guarantee that it never does).
Without load the overhead of calling (effectively) sleep() is, while technically true, not relevant.
But sure, you can tweak the busyloop timers and burn 100% CPU on kernel and user side indefinitely if you want to avoid that sleep-when-idle syscall. It's just… not a good idea.
> This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, strace will show nothing.
Like all polling I/O models (that don't spin) it also means you have to wait milliseconds in the worst case to start servicing a request. That's a long time.
For comparison a read/write over a TCP socket on loopback between two process is a few microseconds using BSD sockets API.
FWIW Rust advice is maybe 15% of the bottom of the article, most of the decisions apply equally to C and the article is a fairly sensible survey of APIs.
I think rusts glacial compile times prevent it from being a useful platform for web apps. Yes it's a nice language, and very performant, but it's horrible devex to have to wait seconds for your server to recompile after a change.
Seattle3503|6 months ago
> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.
I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.
IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
[1] https://boats.gitlab.io/blog/post/io-uring/
newpavlov|6 months ago
Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.
For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.
There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.
Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.
jcranmer|6 months ago
aliceryhl|6 months ago
Well, I think there is interest, but mostly for file IO.
For file IO, the situation is pretty simple. We already have to implement that using spawn_blocking, and spawn_blocking has the exact same buffer challenges as io_uring does, so translating file IO to io_uring is not that tricky.
On the other hand, I don't think tokio::net's existing APIs will support io_uring. Or at least they won't support the buffer-based io_uring APIs; there is no reason they can't register for readiness through io_uring.
JoshTriplett|6 months ago
ozgrakkurt|6 months ago
As an example this library I wrote before is cancel safe and doesn’t use lifetimes etc. for it.
https://github.com/steelcake/io2
johncolanduoni|6 months ago
Have your function signature be async fn read(buffer: &mut Vec<u8>) -> Result<…>’ (you can use something more convenient like ‘&mut BytesMut’ too). If you run the future to completion (success or failure), the argument holds the same buffer passed in, with data filled in appropriately on success. If you cancel/drop the future, the buffer may point at an empty allocation instead (this is usually not an annoying constraint for most IO flows, and footgun potential is low).
The way this works is that your library “takes” the underlying allocation before starting the operation out of the variable, replacing it with the default unallocated ‘Vec<u8>’. Once the buffer is no longer used by the IO system, it puts it back before returning. If you cancel, it manages the buffer in the background to release it when safe and the unallocated buffer is left in the passed variable.
touisteur|6 months ago
bmcahren|6 months ago
Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface
I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.
I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.
https://netflixtechblog.com/life-of-a-netflix-partner-engine...
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
kev009|6 months ago
commandersaki|6 months ago
Imustaskforhelp|6 months ago
I am patient to wait for the benchmarks so take your time ,but I honestly love how the author doesn't care about benchmarks right now and wanted to clean the code first. Its kinda impressive that there are people who have such line of thinking in this world where benchmarks gets maxxed and whole project's sole existence is to satisfy benchmarks.
Really a breath of fresh air and honestly I admire the author so much for this. It was such a good read, loved it a lot thank you. Didn't know ktls existed or Io_uring could be used in such a way.
alde|6 months ago
superb_dev|6 months ago
nicce|6 months ago
sandeep-nambiar|6 months ago
I can recommend writing even the BPF side of things with rust using Aya[1].
[1] - https://github.com/aya-rs/aya
phrotoma|6 months ago
drewg123|6 months ago
On FreeBSD, its been in the kernel / openssl since 13, and has been one runtime toggle (sysctl kern.ipc.tls.enable=1) away from being enabled. And its enabled by default in the upcoming FreeBSD-15.
We (at Netflix) have run all of our tls encrypted streaming over kTLS for most of a decade.
tempaccount420|6 months ago
6r17|6 months ago
npalli|6 months ago
Rust - you need to understand: Futures, Pin, Waker, async runtimes, Send/Sync bounds, async trait objects, etc.
C++20, coroutines.
Go, goroutines.
Java21+, virtual threads
oconnor663|6 months ago
K0nserv|6 months ago
hobofan|6 months ago
thomashabets2|6 months ago
Go: goroutines are not async. And you can't understand goroutines without understanding channels. And channels are weirdly implemented in Go, where the semantics of edge cases, while well defined, are like rolling a D20 die if you try to reason from first principles.
Go doesn't force you to understand things. I agree with that. It has pros and cons.
I see what you mean but "cheap threads" is not the same thing as async. More like "current status of massive concurrency". Except that's not right either. tarweb, the subject of the blog post in question, is single threaded and uses io_uring as an event loop. (the idea being to spin up one thread per CPU core, to use full capacity)
So it's current status of… what exactly?
Cheap threads have a benefit over an async loop. The main one being that they're easier to reason about. It also has drawbacks. E.g. each thread may be light weight, but it does need a stack.
ValtteriL|6 months ago
spaintech|6 months ago
https://www.usenix.org/system/files/atc23-zhu-lingjun.pdf
bullen|6 months ago
So to reimplement my foundation (with all the bugs) will not be worth it.
I will however compare Javas NIO (epoll) with the new Virtual Threads IO (without pinning).
http://github.com/tinspin/rupy
ozgrakkurt|6 months ago
https://github.com/axboe/liburing/wiki/io_uring-and-networki...
Also there is napi support in uring which uses polled io on sockets instead of interrupt based io from what I understand. You can see examples using it in liburing github
butterisgood|6 months ago
In my experience “oversubscribing” threads to cores (more threads than cores) provides a wall-clock time benefit.
I think one thread per core would work better without preemptive scheduling.
But then we aren’t talking about Unix.
gorset|6 months ago
This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.
jandrewrogers|6 months ago
Most developers are unfamiliar with the design idioms for TPC e.g. how to properly balance and shed load between cores.
wahern|6 months ago
danudey|6 months ago
In this very specific case, it seems as though the vast majority of the webserver's work is asynchronous and event-based, so the actual webserver is never waiting on I/O input or output - once it's ready you dump it somewhere the kernel can get to it and move on to the next request if there is one.
I think this gets this specific project close to the platonic ideal of a one-thread-per-core workload if indeed you're never waiting on I/O or any syscalls, but I feel as though it should come with extreme caveats of "this is almost never how the real world works so don't go artificially limiting your application to `nproc` threads without actually testing real-world use cases first".
j-krieger|6 months ago
zbentley|6 months ago
WJW|6 months ago
io_uring is very cool tech though and has been progressing at an impressive pace the last few years.
boredatoms|6 months ago
abrookewood|6 months ago
fuy|6 months ago
api|6 months ago
It seems like there’s these fundamental things in OSes that we just can’t improve, or I suppose can’t without breaking too much backward compatibility, so we are forced to do this.
j_seigh|6 months ago
There's a software equivalent of the Peter Principle where software or an API becomes increasingly complex to the point where no one understands it. They then attempt to fix that by adding more functionality (complexity).
hnaccountme|6 months ago
I am working on something like this for work. But with plain old C
klaussilveira|6 months ago
mgaunard|6 months ago
> In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue.
thomashabets2|6 months ago
Without load the overhead of calling (effectively) sleep() is, while technically true, not relevant.
But sure, you can tweak the busyloop timers and burn 100% CPU on kernel and user side indefinitely if you want to avoid that sleep-when-idle syscall. It's just… not a good idea.
KolmogorovComp|6 months ago
> This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, strace will show nothing.
nly|6 months ago
For comparison a read/write over a TCP socket on loopback between two process is a few microseconds using BSD sockets API.
up2isomorphism|6 months ago
[deleted]
kev009|6 months ago
Unirely01|6 months ago
[deleted]
evrennetwork|6 months ago
[deleted]
LAC-Tech|6 months ago
maeln|6 months ago
What a time to be alived that seconds to recompile is consider horrible devex.
j-krieger|6 months ago
dj_gitmo|6 months ago