Lord of the io_uring: io_uring tutorial, examples and reference

[+] geofft|5 years ago|reply

One thing this writeup made me realize is, if I have a misbehaving I/O system (NFS or remote block device over a flaky network, dying SSD, etc.), in the pre-io_uring world I'd probably see that via /proc/$pid/stack pretty clearly - I'd see a stack with the read syscall, then the particular I/O subsystem, then the physical implementation of that subsystem. Or if I looked at /proc/$pid/syscall I'd see a read call on a certain fd, and I could look in /proc/$pid/fd/ and see which fd it was and where it lived.

However, in the post-io_uring world, I think I won't see that, right? If I understand right, I'll at most see a call to io_uring_enter, and maybe not even that.

How do I tell what a stuck io_uring-using program is stuck on? Is there a way I can see all the pending I/Os and what's going on with them?

How is this implemented internally - does it expand into one kernel thread per I/O, or something? (I guess, if you had a silly filesystem which spent 5 seconds in TASK_UNINTERRUPTIBLE on each read, and you used io_uring to submit 100 reads from it, what actually happens?)

[+] Matthias247|5 years ago|reply

I think that's a very reasonable concern. It however isn't really about io_uring - it applies to all "async" solutions. Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked. If you attach a debugger, you might most likely see something being blocked on epoll - but a callstack to the problematic application code is nowhere in sight.

Even if pause execution while inside the application code there might not be a great stack which contains all relevant data. It will only contain the information since the last task resumption (e.g. through a callback). Depending on your solution (C callbacks, C++ closures, C# or Kotlin async/await, Rust async/await) the information will be between not very helpful and somewhat understandable, but never on par with a synchronous call.

[+] cyphar|5 years ago|reply

You would want to start using the more modern debugging tools, namely dynamic tracing tools like bpftrace[1]. Though in fairness, it might be a tad tricky to get a trace for a specific file without some more complicated scripts.

[1]: https://github.com/iovisor/bpftrace

[+] shuss|5 years ago|reply

This is such a great point. Never thought how async I/O could be a problem this way. In the SQ polling example, I used BPF to "prove" that the process does not make system calls:

https://unixism.net/loti/tutorial/sq_poll.html

Could be a good idea to use BPF to expose what io_uring is doing. Just a wild thought.

[+] matheusmoreira|5 years ago|reply

Good point. Would be great if the submission and completion ring buffers were accessible via procfs.

[+] ecnahc515|5 years ago|reply

Could eBPF be used? I'm really not sure myself.

[+] dirtydroog|5 years ago|reply

Use timeouts?

[+] tyingq|5 years ago|reply

There are some benchmarks that show io_uring as a significant boost over aio: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5....

I see that nginx accepted a pull request to use it, mid last year: https://github.com/hakasenyang/openssl-patch/issues/21

Curious if it's also been adopted by other popular IO intensive software.

[+] shuss|5 years ago|reply

Oh, yeah. QEMU 5.0 already uses io_uring. In fact, it uses liburing. Check out the changelog: https://wiki.qemu.org/ChangeLog/5.0

[+] frevib|5 years ago|reply

Echo server benchmarks, io_uring vs epoll: https://github.com/frevib/io_uring-echo-server/blob/io-uring...

[+] jandrewrogers|5 years ago|reply

I have not adopted io_uring yet because it isn't clear that it will provide useful performance improvements over linux aio in cases where the disk I/O subsystem is already highly optimized. Where io_uring seems to show a benefit relative to linux aio is more naive software design, which adds a lot of value but is a somewhat different value proposition than has been expressed.

For software that is already capable of driving storage hardware at its theoretical limit, the benefit is less immediate and offset by the requirement of having a very recent Linux kernel.

[+] jra_samba|5 years ago|reply

Samba can optionally use it if you explicitly load the vfs_io_uring module, but it exposed a bug for us (see my comment above). We're fixing it right now.

[+] eMSF|5 years ago|reply

A comment from the cat example:

>/* For each block of the file we need to read, we allocate an iovec struct which is indexed into the iovecs array. This array is passed in as part of the submission. If you don't understand this, then you need to look up how the readv() and writev() system calls work. */

I have to say, I don't really understand why the author chose to individually allocate (up to millions of) single kilobyte buffers for each file. Perhaps there is a reason for it, but I think they should elaborate the choice. Anyway, I guess the first example is too simplified, which is why what follows after is not built on top of it in any way, hence they feel disjointed.

The bigger problem here is that I don't know the author, or how talented they are. Choices like that, or writing non-async-signal-safe signal handlers don't help in estimating it, either. Is the rest of the advice sound?

[+] shuss|5 years ago|reply

The author here: All examples in the guide are aimed at throwing light at the io_uring and liburing interfaces. They are not very useful or very real-worldish examples. The idea with this example in particular is to show the difference how readv/writev work synchronously vs how they would be "called" io_uring. May be I should call out the fact that these programs are more tuned towards explaining the io_uring interface a lot more in the text. Thanks for the feedback.

[+] matheusmoreira|5 years ago|reply

So awesome... The ring buffer is like a generic asynchronous system call submission mechanism. The set of supported operations is already a subset of available Linux system calls:

https://github.com/torvalds/linux/blob/master/include/uapi/l...

It almost gained support for ioctl:

https://lwn.net/Articles/810414/

Wouldn't it be cool if it gained support for other types of system calls? Something this awesome shouldn't be restricted to I/O...

[+] diegocg|5 years ago|reply

The author seems to be planning to expand it to be usable as a generic way of doing asynchronous syscalls

[+] ignoramous|5 years ago|reply

Anyone familiar with the Infiniband's approach to exposing IO via rx/tx queues [0] comment whether it seems similar to io_uring's ring-buffers [1]? How do these contrast against each other?

[0] https://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-10...

[1] https://news.ycombinator.com/item?id=19846261

[+] DmitryOlshansky|5 years ago|reply

Very limited experience with Infiniband but it seems similar, a bit more flexible (esp recently with more syscalls supported).

Also similar to but more general than RIO Sockets of Win8+:

https://docs.microsoft.com/en-us/previous-versions/windows/i...

[+] throw7|5 years ago|reply

The site pushes really hard that you shouldn't use the low-level system calls in your code and that you should (always?) be using a library (liburing).

What exactly is liburing bringing to the table that I shouldn't be using the uring syscalls directly?

[+] matheusmoreira|5 years ago|reply

You absolutely can use system calls in your code. The kernel has an awesome header that makes this easy and allows you to eliminate all dependencies:

https://github.com/torvalds/linux/blob/master/tools/include/...

This system call avoidance dogma exists because libraries generally have more convenient interfaces and are therefore easier to use. They're not strictly necessary though.

It should be noted that using certain system calls may cause problems with the libraries you're using. For example, glibc needs to maintain complete control over the threading model in order to implement thread-local storage. By issuing a clone system call directly, the glibc threading model is broken and even something simple like errno is likely to break.

In my opinion, libraries shouldn't contain thread-local or global variables in the first place. Unfortunately, the C language is old and these problems will never be fixed. It's possible to create better libraries in freestanding C or even freestanding Rust but replacing what already exists is a lifetime of work.

> What exactly is liburing bringing to the table that I shouldn't be using the uring syscalls directly?

It's easier to use compared to the kernel interface. For example, it handles submission queue polling automatically without any extra code.

[+] shuss|5 years ago|reply

The raw io_uring interface, once you ignore the boilerplate initialization code, is actually a super-simple interface to use. liburing is itself only a very thin wrapper on top of io_uring. I feel that if you ever used io_uring, after a while you'll end up with a bunch of convenience functions. liburing looks more like a collection of those functions to me today.

One place where a slightly high-level interface is provided by liburing is in the function io_uring_submit(). It determines among other things if there is a need to call the io_uring_enter() system call depending on whether you are in polling mode or not, for example. You can read more about it here:

https://unixism.net/loti/tutorial/sq_poll.html

Otherwise, at least at this time, liburing is a simple wrapper.

[+] andoma|5 years ago|reply

io_uring requires userspace to access it using a well-defined load/store memory ordering. Care must be taken to make sure the compiler does not reorder instructions but also to use the correct load/store instructions so hardware doesn't reorder loads and stores. This is easier to (accidentally) get correct on x86 as it has stronger ordering guarantees. In other words, if you are not careful your code might be correct on x86 but fail on Arm, etc. Needless to say the library handles all of this correctly.

[+] jra_samba|5 years ago|reply

io_uring still has its wrinkles.

We are scrambling right now to fix a problem due to change in behavior exposed to user-space from the io_uring kernel module in later kernels.

Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic #44-Ubuntu SMP) io_uring will not return short reads/writes (that's where you ask for e.g. 8k, but there's only 4k in the buffer cache, so the call doesn't signal as complete and blocks until all 8k has been transferred). In later kernels (not sure when the behavior changed, but the one shipped with Fedora 32 has the new behavior) io_uring returns partial (short) reads to user space. e.g. You ask for 8k but there's only 4k in the buffer cache, so the call signals complete with a return of only 4k read, not the 8k you asked for.

Userspace code now has to cope with this where it didn't before. You could argue (and kernel developers did :-) that this was always possible, so user code needs to be aware of this. But it didn't used to do that :-). Change for user space is bad, mkay :-).

[+] magicalhippo|5 years ago|reply

I know nothing about io_uring but looking at the man page[1] of readv I see it returns number of bytes read. For me as a developer that's an unmistakable flag that partial reads is possible.

Was readv changed? The man page also states that partial reads is possible, but I guess that might have been added later?

If it always returned bytes read, it would hardly be the first case where the current behavior is mistaken for the specification. My fondest memory of that is all the OpenGL 1.x programs that broke when OpenGL 2.x was released.

[1]: http://man7.org/linux/man-pages/man2/readv.2.html

[+] jra_samba|5 years ago|reply

It was really interesting how this was found.

A user started describing file corruption when copying to/from Windows with the io_uring VFS module loaded.

Tests using the Linux kernel cifsfs client and the Samba libsmbclient libraries/smbclient user-space transfer utility couldn't reproduce the problem, neither could running Windows against Samba on Ubuntu 19.04.

What turned out to be happening was a combination of things. Firstly, the kernel changed so an SMB2_READ request against Samba with io_uring loaded was sometimes hitting a short read, where some of file data was already in the buffer cache, so io_uring now returned a short read to smbd.

We returned this to the client, as in the SMB2 protocol it isn't an error to return a short read, the client is supposed to check read returns and then re-issue another read request for any missing bytes. The Linux kernel cifsfs client and Samba libsmbclient/smbclient did this correctly.

But it turned out that Windows10 clients and MacOSX Catalina (maybe earlier versions of clients too, I don't have access to those) clients have a horrible bug, where they're not checking read returns when doing pipeline reads.

When trying to read a 10GB file for example, they'll issue a series of 1MB reads at 1MB boundaries, up to their SMB2 credit limit, without waiting for replies. This is an excellent way to improve network file copy performance as you fill the read pipe without waiting for reply latency - indeed both Linux cifsfs and smbclient do exactly the same.

But if one of those reads returns a short value, Windows10 and MacOSX Catalina DON'T GO BACK AND RE-READ THE MISSING BYTES FROM THE SHORT READ REPLY !!!! This is catastrophic, and will corrupt any file read from the server (the local client buffer cache fills the file contents I'm assuming with zeros - I haven't checked, but the files are corrupt as checked by SHA256 hashing anyway).

That's how we discovered the behavior and ended up leading back to the io_uring behavior change. And that's why I hate it when kernel interfaces expose changes to user-space :-).

[+] beagle3|5 years ago|reply

Is there any intention to optimize work done, rather than just the calling interface?

E.g., running an rsync if a 10m files hierarchy usually requires 10m synchronous stat calls. Using io-uring would make them asynchronous, but they could potentially be done more efficiently (e.g. convert file names to inodes in blocks of 20k, and then stat those 20k inodes in a batch).

That would require e.g. the VFS layer to support batch operations. But the io-uring would actually allow that without a user space interface change.

[+] jcoffland|5 years ago|reply

Maybe I just missed this but can anyone tell me what kernel versions support io_uring. I ran the following test program on 4.19.0 and it is not supported:

    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/utsname.h>
    #include <liburing.h>
    #include <liburing/io_uring.h>


    static const char *op_strs[] = {
      "IORING_OP_NOP",
      "IORING_OP_READV",
      "IORING_OP_WRITEV",
      "IORING_OP_FSYNC",
      "IORING_OP_READ_FIXED",
      "IORING_OP_WRITE_FIXED",
      "IORING_OP_POLL_ADD",
      "IORING_OP_POLL_REMOVE",
      "IORING_OP_SYNC_FILE_RANGE",
      "IORING_OP_SENDMSG",
      "IORING_OP_RECVMSG",
      "IORING_OP_TIMEOUT",
      "IORING_OP_TIMEOUT_REMOVE",
      "IORING_OP_ACCEPT",
      "IORING_OP_ASYNC_CANCEL",
      "IORING_OP_LINK_TIMEOUT",
      "IORING_OP_CONNECT",
      "IORING_OP_FALLOCATE",
      "IORING_OP_OPENAT",
      "IORING_OP_CLOSE",
      "IORING_OP_FILES_UPDATE",
      "IORING_OP_STATX",
      "IORING_OP_READ",
      "IORING_OP_WRITE",
      "IORING_OP_FADVISE",
      "IORING_OP_MADVISE",
      "IORING_OP_SEND",
      "IORING_OP_RECV",
      "IORING_OP_OPENAT2",
      "IORING_OP_EPOLL_CTL",
      "IORING_OP_SPLICE",
      "IORING_OP_PROVIDE_BUFFERS",
      "IORING_OP_REMOVE_BUFFERS",
    };


    int main() {
      struct utsname u;
      uname(&u);

      struct io_uring_probe *probe = io_uring_get_probe();
      if (!probe) {
        printf("Kernel %s does not support io_uring.\n", u.release);
        return 0;
      }

      printf("List of kernel %s's supported io_uring operations:\n", u.release);

      for (int i = 0; i < IORING_OP_LAST; i++ ) {
        const char *answer = io_uring_opcode_supported(probe, i) ? "yes" : "no";
        printf("%s: %s\n", op_strs[i], answer);
      }

      free(probe);
      return 0;
    }

[+] cesarb|5 years ago|reply

If you have a clone of the Linux kernel source tree, you just have to look at the history of the include/uapi/linux/io_uring.h file. From a quick look here: everything up to IORING_OP_POLL_REMOVE came with Linux 5.1; IORING_OP_SYNC_FILE_RANGE was added in Linux 5.2; IORING_OP_SENDMSG and IORING_OP_RECVMSG came with Linux 5.3; IORING_OP_TIMEOUT with Linux 5.4; everything up to IORING_OP_CONNECT is in Linux 5.5; everything up to IORING_OP_EPOLL_CTL is in Linux 5.6; and the last three are going to be in Linux 5.7.

[+] yxhuvud|5 years ago|reply

It is documented in the liburing man pages.

Furthermore, recent variants of io_uring have a probe-function that allows checking for capabilities.

Generally speaking though, you will need more recent kernels than 4.x

[+] unknown|5 years ago|reply

[deleted]

[+] shuss|5 years ago|reply

io_uring_get_probe() needs v5.6 at least.

[+] jmb001nyc|5 years ago|reply

Question: how does one detect socket push back using io_uring? For example, with libc "write/writev" for non-blocking socket would return less bytes than requested and allow code to poll for write readiness before writing more. This is quite useful to handle scenarios where there are impedance mismatches between processing speed and ability to send data over a network, e.g. processing needs to observe push back and handle it appropriately. Apologies: I posted this question to twitter before I read the redirect here.

[+] rwmj|5 years ago|reply

By coincidence I asked a few questions on the mailing list about io_uring this morning: https://lore.kernel.org/io-uring/20200510080034.GI3888@redha...

[+] qubex|5 years ago|reply

Unfortunately I misread the title as “Lord of the Urine” and... was concerned.

74 comments