I/O Is Faster Than CPU – Let’s Partition Resources and Eliminate OS Abstractions [pdf]

[+] Animats|6 years ago|reply

Mainframe designers had this problem under control by 1970. Mainframes had, and have, "channels". A channel is part of the processor architecture. It takes commands, sends them to a peripheral, and manages the data transfer in both directions. Channels have some privileged functions through which the OS tells them where the data is supposed to go in memory. The architecture of channels is well defined, and peripherals are built to talk to channels. The CPU has I/O instructions to control channels in a well defined way.

The peripheral never has access to main memory. There is no peripheral-controlled "direct memory access" (DMA). So it's possible to give control of a channel to a userland program without a memory security risk.

Minicomputers of the 1970s had low transistor counts and slow CPUs. So peripherals were usually put directly on the memory bus, with full access to memory. I/O operations were performed by storing into memory addresses, which caused bus transactions detected by the peripheral device. There were no CPU I/O instructions.

Microprocessors copied the minicomputer model. IBM's people knew this was a bad idea, and in the IBM PS/2, they introduced the "microchannel". Peripheral vendors, facing a new architecture that required more transistors, screamed. IBM backed down and went back to bus-oriented peripherals.

That model persists today, even though the few thousand transistors required for a channel controller are nothing today. Even though most modern CPUs have I/O channel like machinery, it's exposed to the program as registers the program stores into and memory accesses by the peripheral device.

So there's no standardization on how to talk to devices at the hardware level. Some CPUs have protection systems, an "I/O MMU", and there have been various channel-like interfaces, especially from Intel, but they have never caught on.

Instead, we mostly have heavy kernel mediation between the hardware and the user program. And way too many "drivers". This has become a problem with "solid state disk", which is really a random access memory device that doesn't write very fast. Mostly, it's used to emulate rotating disks.

Samsung makes a key/value store device which uses SSD-type memory devices but manages the key/value store itself. But you need a kernel between the device and the user program. You can't just open a channel to it and let the user program access it.

[+] jandrewrogers|6 years ago|reply

At least in database kernels, we noticeably reached this threshold around five years ago with typical server hardware. This is an interesting computer science problem in that virtually all of our database literature is based on the presumption that I/O is much slower than CPU.

If you cleanroom a database kernel design based on the assumption that I/O performance is not the bottleneck, you end up with an architecture that looks very different than the classic model you learn at university. It is always a tradeoff of burning a resource you have in abundance to optimize utilization of a resource that is scarce, and older database architectures are quite wasteful of resources that have become relatively scarce on newer hardware.

[+] magicalhippo|6 years ago|reply

Just skimmed the article, but it reminded me very much of how graphics APIs have evolved, due to similar motivation.

Just compare what it takes to draw a triangle in OpenGL 1.x[1] vs Vulkan[2].

[1]: http://nehe.gamedev.net/tutorial/your_first_polygon/13002/

[2]: https://vulkan-tutorial.com/Overview

[+] shereadsthenews|6 years ago|reply

I don't know ... we used to interleave our hard disk formats because a hard disk could stream data faster than an i386 CPU could ingest it, and there was plenty of database research done prior to 1990.

[+] thaumaturgy|6 years ago|reply

The paper reads like it's suggesting moving the burden of complexity in dealing with varying hardware interfaces from the kernel to userland so that userland can take direct advantage of higher performance hardware when it's available.

I could see that for some very small niches, but in general I think it would be a terrible development for the industry.

Hardware vendors don't like to share. They don't share code, they don't share common interfaces, they don't even share documentation. As it is now, these are all problems which most userland developers don't have to care about -- those problems get dealt with in the kernel, by developers who specialize in building support for uncooperative hardware.

The average application developer doesn't want to have to figure out how many queues are supported by a NIC just to open a connection on the network. Further: the average application developer isn't experienced enough to do this correctly.

Given the niche where these tradeoffs make sense, I'm not sure why the paper bothers to emphasize security at all.

[+] saltcured|6 years ago|reply

There is a constant dance at the fringe of high performance systems. It leads to a recurring pattern of "revolutionizing" with some kind of bypass or coprocessor architecture, then eventually reverting to traditional structures as the new performance realities reach commodity levels.

Part of it is the economics at the fringe can pursue speed at any cost. And part is the heady appeal of doing things differently for researchers and advanced practitioners. But, in the long view, I think you are right that it is a bad idea. If you care about maintenance and sustainability, you usually find that these bypass solutions get abandoned as soon as the more conventional approaches can approximate their speed on newer commodity hardware. So there is huge churn in these specialist devices with specialist APIs and tooling.

There is a recurring theme in high performance networking where crazy things are tried and all sorts of fancy protocol offloading written, then eventually deprecated because it is seen as a support burden and a source of bugs. Because each of these specialized stacks has a smaller user base, they are have less economy of scale to invest in maintenance and stabilization.

[+] josephg|6 years ago|reply

I don’t think anyone is suggesting that every application explicitly code for each network device. All of that tricky logic could be put into a userland library instead of the kernel. If we wanted, we could even replicate a kernel device driver style API in userland that network device drivers program against, done as a set of userland library files loaded dynamically based on detected hardware.

The tricky part wouldn’t be sharing code between applications. We know how to do that. The hard part would be figuring out a clean way to share the hardware between all running applications, given that any app could be terminated at any time, apps might be mutually untrustworthy and apps would have to play nice to share resources. I can imagine a hybrid approach where the kernel allocates network queues to applications and suggests userland device drivers. While running, the apps would have direct access to the hardware. And when the app is terminated the kernel would reclaim the assigned hardware for reuse by other applications.

[+] atq2119|6 years ago|reply

You should take a look at GPUs.

All of the complexity of 3D rendering is implemented in userspace, and that has been the case universally in production for more than 5 years now -- closer to 10, really. If you replace "all" by "almost all", you can go back much further than that, really all the way to the beginning of GPUs' existence. And yet normal application developers don't have to care, because the driver does it for them.

The point is, drivers don't have to live in kernel space, they can live in user space as well. Networking folks may start being more serious about this nowadays, but it has been the reality in GPUs for a long time.

[+] johnm1019|6 years ago|reply

Disclaimer: IANA kernel developer.

Is there any benefit to having those specialized developers create frameworks or libraries in Userland which other developers can leverage? This way they remain the interface to uncooperative hardware, but the code is in Userland so the bold folks can try their own approach.

[+] baruch|6 years ago|reply

At least for NVMe SSDs that is already solved, there is a standard that all NVMe SSDs implement and you only need one driver for all of them. If you want you can use SPDK or one of the few other drivers and you get the full speed block access.

What you don't get however is sharing the disk between multiple processes.

[+] robbyt|6 years ago|reply

As I was reading this, I remembered the days of my youth setting the IRQ and DMA address for my soundblaster (compatible) soundcard.

[+] unknown|6 years ago|reply

[deleted]

[+] baybal2|6 years ago|reply

> Hardware vendors don't like to share. They don't share code, they don't share common interfaces, they don't even share documentation. As it is now, these are all problems which most userland developers don't have to care about -- those problems get dealt with in the kernel, by developers who specialize in building support for uncooperative hardware.

Thus more money for us :) I think this fact almost begs to be taken advantage of. See, how things work in corporate storage products, with EMC being the prime example

[+] Q6T46nT668w6i3m|6 years ago|reply

I enjoyed the paper. My impression is that you’d shift the burden to the runtimes that, for many applications, currently sit between POSIX and applications (e.g. see the Q&A about POSIX).

[+] TheSoftwareGuy|6 years ago|reply

An OS is more than just a kernel.

OS’s would simply ship with user land drivers instead of kernel space drivers

[+] yingw787|6 years ago|reply

There was a great blog post I read a while back about constructing a caching layer across network by Dan Luu: https://danluu.com/infinite-disk/

I asked a friend who works in a quant firm and he was like yes it’s true, and it is pretty insane.

I think there’s research Microsoft and Google are doing for RDMA over 100G Ethernet for intra data center communication as well. Pretty neat.

[+] baybal2|6 years ago|reply

I worked on something similar like that 3 years ago as a sub-sub-subcontractor for a company making DCs for Alibaba. It took them almost 2.5 years since me signing off on work to roll it out in a limited commercial trial in their alicloud hosting.

The original idea was to let purpose made hardware be distributed across DC rather than every server having to carry it: video codecs on FPGAs, hardware wire speed crypto/compression, databases and k/v stores exposed over RDMA, and remote block storage on SSDs.

I was invited to the opening ceremony for the DC. When company's bosses were showed sfx infused 3D graphs allegedly representing their AI things running on it, I was unable to restrain myself from ruining the atmosphere by asking how it is running when all servers in the DC were shut down :D

[+] theincredulousk|6 years ago|reply

Yes! Was surprised the paper didn't specifically mention RDMA, or to a lesser degreee SRIOV, with all their focus on NICs.

Also, there may be ongoing research, but it isn't theory at all. HPCs, HFTs, and the could providers have been leveraging RDMA for a long time - e.g. Infiniband. Doing it over Ethernet (RRoCE) is relatively new, and it isn't necessarily any big leap that is happens over 100G instead of 40 or 1.

However, an interesting point as network links go to 100G+ (esp. for RDMA) is again on the storage/processing side. E.g. a wireshark capture on a 100G connection? ~12.5 GB/second, near max bandwidth for DDR3, and can fill 64GB of RAM in about 5 seconds at full fire-hose. So again the hot-potato of bottleneck will be passed, at least for maximum sustained performance situations.

Side note, AFAIK RoCE exists mostly due to non-technical arguments, particularly the inertia created by existing familiarity and deployment of Ethernet in data centers. I think Microsoft was the one flexing on a standards-body to push it through. It is somewhat of a kludge as Ethernet wasn't designed with RDMA in mind - no guaranteed predictable latency, frames can and will disappear if switch buffers overflow, etc. So IMO "research" into the topic isn't super profound - akin to studying how your sedan might be heavily modified to go off-roading almost (but not quite) as well as a pick-up truck.

Even now many that have the luxury are just going Infiniband from the get-go if RDMA/latency are the key priorities rather than tacked on later.

[+] HillaryBriss|6 years ago|reply

neat blog post! i like the table at the top comparing latencies. at one point the post says:

I'm paying $60/month for 100Mb, and if the trend of the last two decades continues, we should see another 50x increase in bandwidth per dollar over the next decade.

he seems to have more confidence in his bandwidth provider than I have in mine!

[+] AtlasBarfed|6 years ago|reply

all that huge scale distributed storage is going to run into CAP concerns with imperfect network links, nodes, storage units. I get a lot of that is consumer/social network crapdata with low guarantees to the end user. Dropbox would probably like a little more of a guarantee, and enterprise even more so.

[+] deRerum|6 years ago|reply

In the past (like around the time most programming languages were invented) memory speeds were faster than processor speeds. So all variable accesses were instantaneous. Languages like C did not have to worry about memory hierarchies.

If memory speed is 100ns then you would notice the memory bottleneck around the time when your processor speed is 10Mhz. This point was reached in the mid 1980s with the 286 processor. Yet through the addition of cache memory this bottleneck was hidden from most software. They continued to operate in a bubble as if they were still running on the hardware of the 1980s.

It’s a bit like life itself...we land mammals carry around bags of water under our skin and our cells are still batched in fluids as if we are still living in the environment of the oceans hundreds of millions of years ago.

Many programming languages have been invented since the 90s but as far as I know none of them explicitly model memory latency and make reference to memory hierarchies. It’s as if they still need to maintain the illusion that they are running on the hardware of the past.

(Note: I once read about a language called Sequioa developed at Stanford that explicitly modelled the memory hierarchy. I don’t know what happened to it).

[+] the8472|6 years ago|reply

Don't kTLS sockets[0] with crypto offloading[1], sendfile/vmsplice, device-to-device DMA transfers[2] and possibly io_uring solve all those things on linux? Granted, they're not POSIX, but they're incremental extensions.

Netflix implemented similar extensions in freebsd[3]

[0] https://www.kernel.org/doc/Documentation/networking/tls.txt [1] https://lwn.net/Articles/734030/ [2] https://lwn.net/Articles/767281/ [3] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf

[+] pulkitsh1234|6 years ago|reply

This paper was very accessible as compared to other academic papers, is there a way to find other papers like these ? Maybe its the lack of math equations and benchmarks.

I like how most of the statements are supported by examples, which makes it easier to understand (after some Googling ofc), especially for someone like me who is a million miles away from academia and a programmer who rarely has to think about kernel/CPU/memory intricacies, mostly due to working with higher level languages and abstractions on top of the OS itself.

My uneducated and naive thoughts on this paper: Instead of replacing the kernel with `parakernel`, is it possible to implement a POSIX compatible kernel layer over the parakernel itself ? So that drivers, linkers, and other abstractions don't have to be re-implemented again for the parakernel.

[+] bluetomcat|6 years ago|reply

We need entirely new OS abstractions to replace the dated notions of hierarchical file systems built around the metaphor of file cabinets, I/O as streams of bytes, terminals, process hierarchy. Essentially, say goodbye to the Unix model after 50 years. It would open up an entirely new world of software experimentation and craftsmanship.

[+] sytelus|6 years ago|reply

I am doubtful if hierarchical file systems have anything to do with slow disks. It a design pattern that you encounter all over in real life for human mind to manage large quantity of information. Using tags is another pattern but with its own pro and cons. Same goes for I/O steams, process hierarchy etc. There may be better designs patterns out there but I don’t see why these existing design patterns would become obsolete even if disks become as fast as RAM.

[+] AnimalMuppet|6 years ago|reply

Then... what? Do you have a positive to recommend, or just "not what we've been doing"? (It's not necessarily bad if you don't - when you're doing the wrong thing, the first step to improvement is to stop doing it.)

But it seems to me that much of this can be done within the existing structure. You don't want I/O as streams of bytes? Great. Whatever new thing you think it should be, you can build that on streams of bytes. Knock yourself out. (It may not be as fast as it would be if it were directly supported by the OS, but you can prove the value of the concept by building on top of the OS.)

Same thing with the metaphor of file cabinets (I presume you mean the hierarchical file system.) Well, does your OS let you read and write raw disk sectors? No? Fine. Create one giant file that takes up the whole disk, and manage it yourself. Try out whatever different way of managing that space that floats your boat. Again, it will be slower, but again, you can experiment and prove out your concepts right now. You don't need to wait.

[+] Spivak|6 years ago|reply

What would you replace it with? At the very core you need two things: the notion of a unit of data and some metadata to address it. To be as flexible as possible that unit of data would probably modeled as a sequence of individually addressable bytes but it could be more structured. Such a thing is dangerous because if your structure isn't sufficiently expressive 20 years down the road people will end up imposing their own structure on top of obj.binary and there goes all the work.

Once you have data and handles to access it you might start wanting some convenience features like access control, locking, namespacing, constraints, relationships.

I don't disagree that we can drop many of the current filesystem semantics but fundamentally all that really means is changing is the query language to access objects and manipulate their metadata.

I also don't disagree about process hierarchy. Being able to express relationships between processes beyond parent-child natively without farming out to an external scheduler would be awesome.

[+] olliej|6 years ago|reply

...except humans instinctively organize things by hierarchies (even something as counting is done hierarchically).

Just saying “we need to change this” without saying what the short comings you want to address isn’t a super useful statement. Multiple platforms have attempted to have the user interface to their data be tag based, but that simply doesn’t scale for the amount of data people can manage with hierarchies.

Finally what the heck are you talking about in that last sentence: how does changing representation result in a new age of experimentation and (???) craftsmanship? Why is craftsmanship gated on the user level abstraction to bytes? Experimentation already happens today, what does this change?

[+] fragmede|6 years ago|reply

PalmOS had this! Applications connected to databases and read and wrote records. Sqlite acts similar for the more modern era, with a traditional hierarchical filesystem underneath the sqlite DB.

Looking at modern app-based "file" access using Google docs and its ilk are that reimagining. The UI is a list of recent files, a small number of features, and then a search box. There's not File -> Save, nor am I forced to pick using a folder metaphor, where I want to put it.

That there's (likely) an underlying hierarchical filesystem somewhere below, in the stack seems like an implementation detail. As a programmer there's a library/middleware to be used to access resources, but once inside, object based access already exists. Looking at video game save files, that's been the case for a while, with the state of objects (in fact, visible objects that the user interacts with) being saved and restored from disk.

I agree it's not as satisfying as a total paradigm shift in computing on every single level, but the notion that file system, byte stream access is a holdover from a previous era ignores practical, user facing progress we've made since.

[+] dorlaor|6 years ago|reply

At Scylla we initially defined a new filesystem which adheres shard-per-core and thus a single physical hyperthread has the sole access to the data and thus there is no contention.

While it will be better than current XFS, we've made aio improvement to the later over the years and today it's good enough for ScyllaDB.

Practically, even though Scylla has its tcp/ip stack in userspace on top of DPDK, we learned over the years that it's ok to use the less efficient kernel tcp stack. Most of the overhead and the optimizations can still happen within the DB itself as long as it controls the memory, the cache and manages the networking queues

[+] jnurmine|6 years ago|reply

I think I don't understand. Is there something preventing people from experimenting and craftsmanship at this very moment?

[+] thaumaturgy|6 years ago|reply

I would like to see an updated file system architecture that's closer to a database, with tagging and all that. And I could see that extending to processes too.

But what are you imagining as alternatives for i/o and terminals?

[+] MathematicalArt|6 years ago|reply

See: “Designing better file organization around tags, not hierarchies”[1]

The author has done a thorough preliminary exploration on this matter. [1] https://www.nayuki.io/page/designing-better-file-organizatio...

[+] strictfp|6 years ago|reply

One idea I've been throwing around is to replace files with HTTP resources; everything is a resource. Effectively Plan9s idea, but the time might be right.

[+] taborj|6 years ago|reply

Every system is driven by user adoption. At this point, it will be nearly impossible to dethrone the current methodologies.

Not saying it shouldn't be done, just that it might fail.

[+] robbyt|6 years ago|reply

Correct me if I'm wrong, but isn't this what Plan 9 does?

[+] notduncansmith|6 years ago|reply

Why are these insufficient? What abstractions should we building on?

[+] Ericson2314|6 years ago|reply

Amen to that. Unix was never a good design, and now is severely out of date. We can no longer afford to hack around it.

[+] LorenPechtel|6 years ago|reply

This takes an idea I had years ago and goes much farther with it. My idea: Disk and file access is handled by the memory paging system. A 64-bit machine's segment registers can point to a space far bigger than the largest hard drives. Thus a drive ID would simply be a segment register value, the drive would be accessed by reading/writing memory at an offset from that. A file handle would likewise be a segment register value. The results of doing this would be the use of all surplus memory for disk cacheing and the paging system would take care of all disk buffering, you could efficiently read/write small chunks of data.

Now lets add their approach: When you cause a page fault from accessing stuff not in memory you get the context switch but the actual workload could be handled by an auxiliary controller, it need not be on the CPU.

Changes: Locking parts of a file would be on a friendly basis, you would be able to get around the rules. Access to remote files with small chunks of data would still be inefficient--but the vast majority of accesses are local and remote accesses are generally documents that are read in their entirety.

[+] ktpsns|6 years ago|reply

Strictly speaking, the sentence "I/O is faster then CPU", aka "memory access is faster then computations" is nonsense, because it compares apples with bananas. One could probably say "transfering x data between CPU and SSD is faster then performing the computation f(x) on the CPU", where still f remains undefined.

[+] MrTonyD|6 years ago|reply

Well, I spent a number of years writing drivers for PC systems. Some years the I/O chips were faster than CPU, and some years the CPU was faster than I/O chips. DMA was usually slower, just because release cycles for CPUs tended to be faster than release cycles for I/O controllers. Eventually, most driver writers decided that it was usually better to use CPU, even if the I/O controller was faster. That way, when the CPU got upgraded, you would automatically get a speed boost. While programming an I/O controller was both more arcane and more more likely to require a complete reimplementation in a couple of years (as well as customer complaints and market share losses.)

I'm not saying that things are the same today - but it kind of sounds to me like they are. Back in the days, people were always claiming that we should switch to the newest and fastest I/O controller since CPUs were more general purpose and would therefore always be slower. It just didn't work out that way in practice.

[+] pjc50|6 years ago|reply

Interesting. It's long been the case that a "computer" pretends to be a single processor to the programmer while in fact being a cloud of semi-general processors which communicate through messages. This makes that completely explicit, giving the programmer all the power and hassle involved in speaking as directly to the devices as possible while maintaining isolation. Similar esoteric architectures are already available (e.g. Tilera, or all the way back to the Inmos Transputer).

Given the allocation of particular hardware devices - NIC, RAM, NVMe - to particular processors running a (static?) application process, it's not clear how the filesystem abstraction would work or whether that's simply delegated to the application. This is very definitely a server-focused system as no mention is made of GPUs or interactive devices.

[+] laythea|6 years ago|reply

This is kinda like what they did in the graphics API world. Moving from OpenGL to Khronos in order to "cut the fat" between the user program and the hardware.

[+] bhouston|6 years ago|reply

Very interesting shift that happened over the last 2 decades.

We likely haven't designed OSes or CPUs to match this new reality.

[+] phkamp|6 years ago|reply

Congratulations!

You have reinvented the Mainframe Channel Processor!

Your next challenge: Try to avoid reinventing the 3745 Frontend Processor.

[+] oblio|6 years ago|reply

Is this true for most real life workloads? There's that famous rule-of-thumb indicator for latencies: https://www.prowesscorp.com/computer-latency-at-a-human-scal...

It doesn't seem to be that the orders of magnitude are so close as to require totally rethinking mainstream kernels.

Or am I looking at this the wrong way?

[+] ObscureScience|6 years ago|reply

I apologize for not reading much of it yet, but could someone give a quick comparison to the exokernel idea?

[+] ketralnis|6 years ago|reply

A real world example at Alipay: https://news.ycombinator.com/item?id=17814185

[+] sinisa_cyprus|6 years ago|reply

Only mainframes had channels. Best implementation thereof is in IBM machines. It is nothing like DMA or that Intel chip or anything else. No Unix machines nor PC and not even specialised hardware, like Tangent, had anything similar.

It is something like separate FPU or MMU units, built for the total control of the peripherals, so that CPU had little or no work to do. Don't forget that device drivers run on CPU.

[+] wmu|6 years ago|reply

BTW, does anybody know a paper about doing some DB ops directly on disc controllers? The other day my former colleague mentioned that he come across such paper (maybe a blog post?), but we couldn't find it. It's really interesting idea and I believe it's doable, although under very specific circumstances (disc vendor specific, sectors layout aligned to DB needs, etc.).

277 comments