top | item 41631982

What Is Io_uring?

91 points| todsacerdoti | 1 year ago |matklad.github.io | reply

29 comments

order
[+] jclulow|1 year ago|reply
This article, and indeed much of the discourse about the facility out in the wild, could really use some expansion on at least these two points:

> The application submits several syscalls by writing their codes & arguments to a lock-free shared-memory ring buffer.

That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?

> The kernel reads the syscalls from this shared memory and executes them at its own pace.

I think this leaves out most of the interesting information about how it actually works in practice. Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? Whose scheduling quantum is consumed by this?

Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.

[+] Tuna-Fish|1 year ago|reply
> That's not all, though, right? Unless you're using the dubious kernel-driven polling thread, which I gather is uncommon, you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue. Otherwise, how would the kernel know you'd done it?

Correct, the "check for new entries" system call is called io_uring_enter(). (0)

> Some system calls are not especially asynchronous: they perform some amount of work that isn't asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work?

A kernel thread. The submit system call can be optionally made to wait for completion, but by default it will always immediately return.

> Whose scheduling quantum is consumed by this?

That's a good question. The IO scheduler correctly sees them as belonging to the submitting thread, but if you issue a bunch of computation-heavy syscalls, I would not be surprised if they were not correctly accounted for.

(0) https://unixism.net/loti/ref-iouring/io_uring_enter.html#c.i...

[+] pdpi|1 year ago|reply
> you _also_ have to make a submit system call of some kind after you've finished appending stuff to your submit queue.

Sure. As I understand it, you need a handful of syscalls to get your async IO setup up and running (one io_uring_setup call and a few mmaps), and then you interact with it (both to submit new work and get the results from old work) through the io_uring_enter syscall.

The point is that you're batching things, so you only pay for the context switching of a single physical syscall to make several logical syscalls. It's effectively a solution to the n+1 request problem at the kernel level.

[+] Joker_vD|1 year ago|reply
> Otherwise, how would the kernel know you'd done it?

Well, you can make a design where a submission queue spans N+1 pages, and to notify the kernel you write something in the last page which is actually write-protected, so it triggers the kernel trap. I believe VirtIO has a similar scheme?

> What thread is actually performing that work?

None? You don't need really need a user-space thread to execute code in the kernel: otherwise, starting process 1 would have to be the very first thing the kernel does when booting while in reality, that's about the last thing it does in the booting process.

With multi-core systems we have today, arguably having a whole core dedicated exclusively for some core OS functionality could be more performant than having this core "constantly" switch contexts?

[+] davedx|1 year ago|reply
> Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.

This article seems to be an attempt to succinctly describe the high level goals of io_uring to people who don't know, and in that respect I find it succeeded. The questions you're asking seem more related to how io_uring (or its API) are implemented, which is something else. I would hope anyone deciding whether to build something on io_uring would then do more detailed research on the trade-offs before pushing anything to production.

I appreciated the brevity personally...

[+] gpderetta|1 year ago|reply
> dubious kernel-driven polling thread

I assume you are referring to IORING_SETUP_SQPOLL. Why is it dubious?

[+] markles|1 year ago|reply
I've been playing with io_uring for the last month or so, just for fun. I'm working on building an async runtime from scratch on top of it. I've been documenting (think of them more as notes to myself) the process thus far:

Creating bindings to io_uring (just to see the process): https://www.thespatula.io/rust/rust_io_uring_bindings/

Writing an echo server using those bindings: https://www.thespatula.io/rust/rust_io_uring_echo_server/

[+] larsrc|1 year ago|reply
This bit is unfortunate, I hope it improves: "You might want to avoid io_uring [if] you want to use features with a good security track record."

At least he's clear about it.

[+] landr0id|1 year ago|reply
I'm kind of curious what Alex meant by this, as the security problems relating to io_uring are, to my knowledge, unrelated to the user-space program. It makes sense if you want to disable the feature in your own kernel or remove potential sandbox escape attack surface, but it's like saying "You might want to avoid win32k if you want to use features with a good security track record" (and I know this is kind of apples to oranges but you get the point).
[+] watt|1 year ago|reply
io_uring also caused a ton of problems for our containers in Kubernetes when Node 20 had enabled it by default. They scrambled and turned it off by default in https://github.com/nodejs/node/commit/686da19abb
[+] clhodapp|1 year ago|reply
That sounds like a dubious integration contract for initialization between libuv and node, not an issue with io_uring.

I would assume the same thing would happen if they used traditional Posix APIs to open file handles before dropping their privileges.

[+] sapiogram|1 year ago|reply
I'm extremely curious, what kinds of problems? We had entire Kubernetes nodes becoming unresponsive, and I think io_uring in Node 18.18.0 was responsible, but they had lots of stuff running on them so I was never able to pinpoint the exact cause.
[+] v3gas|1 year ago|reply
> Oct 2, 2024

From the future!

[+] Brajeshwar|1 year ago|reply
Nice! I think he had a post marked for the future (the 32nd of September ), but his tool somehow published that;

  https://github.com/matklad/matklad.github.io/blob/master/content/posts/2024-09-32--what-is-io-uring.dj
[+] boarush|1 year ago|reply
I wonder if matklad has their scheduled blogs after penning them down?
[+] rwmj|1 year ago|reply
Although I'm a fan of io_uring, a reason to avoid io_uring is it involves quite major program restructuring (versus traditional event loops). If you don't want to forever depend on Linux >= 6 but also support other OSes then you'll have to maintain both versions.
[+] twen_ty|1 year ago|reply
So it's 2024 and Linux still doesn't have what Windows NT had 30 years ago?
[+] jart|1 year ago|reply
Are you saying io_uring is basically IOCP for Linux? If that's true, then I'm not happy about it being in the kernel. IOCP only goes 2x faster in my experience (vs. threads and blocking i/o) and that isn't worth it for the ugly error prone code you have to write. System calls are also a lot faster on Linux than they are on Windows, so I'd be surprised if io_uring manages even that.