Before the "rewrite it in Rust" comments take over the thread:
It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.
That said, if we eliminated the 70% of bugs that are memory safety issues, the SNR ratio for finding these deep logic bugs would improve dramatically. We spend so much time tracing segfaults that we miss the subtle corruption bugs.
> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions)
While the bugs you describe are indeed things that aren't directly addressed by Rust's borrow checker, I think the article covers more ground than your comment implies.
For example, a significant portion (most?) of the article is simply analyzing the gathered data, like grouping bugs by subsystem:
Subsystem Bug Count Avg Lifetime
drivers/can 446 4.2 years
networking/sctp 279 4.0 years
networking/ipv4 1,661 3.6 years
usb 2,505 3.5 years
tty 1,033 3.5 years
netfilter 1,181 2.9 years
networking 6,079 2.9 years
memory 2,459 1.8 years
gpu 5,212 1.4 years
bpf 959 1.1 years
Or by type:
Bug Type Count Avg Lifetime Median
race-condition 1,188 5.1 years 2.6 years
integer-overflow 298 3.9 years 2.2 years
use-after-free 2,963 3.2 years 1.4 years
memory-leak 2,846 3.1 years 1.4 years
buffer-overflow 399 3.1 years 1.5 years
refcount 2,209 2.8 years 1.3 years
null-deref 4,931 2.2 years 0.7 years
deadlock 1,683 2.2 years 0.8 years
And the section describing common patterns for long-lived bugs (10+ years) lists the following:
> 1. Reference counting errors
> 2. Missing NULL checks after dereference
> 3. Integer overflow in size calculations
> 4. Race conditions in state machines
All of which cover more ground than listed in your comment.
Furthermore, the 19-year-old bug case study is a refcounting error not related to highly concurrent state machines or hardware assumptions.
This is I think an under-appreciated aspect, both for detractors and boosters. I take a lot more “risks” with Rust, in terms of not thinking deeply about “normal” memory safety and prioritizing structuring my code to make the logic more obviously correct. In C++, modeling things so that the memory safety is super-straightforward is paramount - you’ll almost never see me store a std::string_view anywhere for example. In Rust I just put &str wherever I please, if I make a mistake I’ll know when I compile.
> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.
Rust is not just about memory safety. It also have algebraic data types, RAII, among other things, which will greatly help in catching this kind of silly logic bugs.
The concurrent state machine example looks like a locking error? If the assumption is that it shouldn't change in the meantime, doesn't it mean the lock should continue to be held? In that case rust locks can help, because they can embed the data, which means you can't even touch it if it's not held.
Rust has more features than just the borrow checker. For example, it has a a more featured type system than C or C++, which a good developer can use to detect some logic mistakes at compile time. This doesn't eliminate bugs, but it can catch some very early.
> race condition in unsafe logic that interacts with DMA
It's worth noting that if you write memory safe code but mis-program a DMA transfer, or trigger a bug in a PCIe device, it's possible for the hardware to give you memory-safety problems by splatting invalid data over a region that's supposed to contain something else.
I’ve seen too many embedded drivers written by well known companies not use spinlocks for data shared with an ISR.
At one point, I found serious bugs (crashing our product) that had existed for over 15 years. (And that was 10 years ago).
Rust may not be perfect but it gives me hope that some classes of stupidity will be either be avoided or made visible (like every function being unsafe because the author was a complete idiot).
> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker.
You are right about that, but even just using sum types eliminates a lot of logic errors, too.
Thanks for raising this. It feels like evangelists paint a picture of Rust basically being magic which squashes all bugs. My personal experience is rather different. When I gave Rust a whirl a few years ago, I happened to play with mio for some reason I can't remember yet. Had some basic PoC code which didn't work as expected. So while not being a Rust expert, I am still too much fan of the scratch your own itch philosophy, so I started to read the mio source code. And after 5 minutes, I found the logic bug. Submitted a PR and moved on. But what stayed with me was this insight that if someone like me can casually find and fix a Rust library bug, propaganda is probably doing more work then expected. The Rust craze feels a bit like Java. Just because a language baby-sits the developer doesn't automatically mean better quality. At the end of the day, the dev needs to juggle the development process. Sure, tools are useful, but overstating safety is likely a route better avoided.
Eh... Removing concurrence bugs is one of the main selling points for Rust. And algebraic types are a really boost for situations where you have lots of assumptions.
Interesting! We did a similar analysis on Content Security Policy bugs in Chrome and Firefox some time ago, where the average bug-to-report time was around 3 years and 1 year, respectively.
https://www.usenix.org/conference/usenixsecurity23/presentat...
Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.
Is the intention of the author to use the number of years bugs stay "hidden" as a metric of the quality of the kernel codebase or of the performance of the maintainers? I am asking because at some point the articles says "We're getting faster".
IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good. Unless the time represents how long it takes to reproduce and resolve a known bug, but in such case I would not say that "bug hides" in the kernel.
> IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority
Not really true. A lot of very severe bugs have lurked for years and even decades. Heartbleed comes to mind.
The reason these bugs often lurk for so long is because they very often don't cause a panic, which is why they can be really tricky to find.
For example, use after free bugs are really dangerous. However, in most code, it's a pretty safe bet that nothing dangerous happens when use after free is triggered. Especially if the pointer is used shortly after the free and dies shortly after it. In many cases, the erroneous read or write doesn't break something.
The same is true of the race condition problems (which are some of the longest lived bugs). In a lot of cases, you won't know you have a race condition because in many cases the contention on the lock is low so the race isn't exposed. And even when it is, it can be very tricky to reproduce as the race isn't likely to be done the same way twice.
> IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good.
It doesn't seem to indicate that. It indicates the bug just isn't in tested code or isn't reached often. It could still be a very severe bug.
The issue with longer lived bugs is that someone could have been leveraging it for longer.
It may be just my system, but the times look like hyperlinks but aren't for some reason. It is especially disappointing that the commit hashes don't link to the actual commit in the kernel repo.
They're <strong> tags with color:#79635c on hover in the CSS. A really weird style choice for sure, but semantically they aren't meant to be links at all.
The lesson here is that people have an unrealistic view of how complex it is to write correct and safe multithreaded code on multi-core, multi-thread, assymmetric core, out-of-order processors. This is no shade to kernel developers. Rather, I direct this at people who seem to you can just create a thread pool in C++ and solve all your concurrency problems.
One criticism of Rust (and, no, I'm not saying "rewrite it in Rust", to be clear) is that the borrow checker can be hard to use whereas many C++ engineers (in particular, for some reason) seem to argue that it's easier to write in C++. I have two things to say about that:
1. It's not easier in C++. Nothing is. C++ simply allows you to make mistakes without telling you. GEtting things correct in C++ is just as difficult as any other language if not more so due to the language complexity; and
2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.
This is I favor cooperative multitasking and using battle-tested concurrency abstractions whenever possible. For example the cooperative async-await of Hack and the model of a single thread responding to a request then discarding everything in PHP/Hack is virtually ideal (IMHO) for serving Web traffic.
I remember reading about Google's work on various C++ tooling including valgrind and that they exposed concurrency bugs in their own code that had lain dormant for up to a decade. That's Google with thousands of engineers and some very talented engineers at that.
Why? A sample size of 28% is positively huge compared to what most statistical studies have to work with. The accuracy of an extrapolation is mostly determined by underlying sampling bias, not the amount of data. If you have any basis to suggest that capturing "only bugs with fixes tags" creates a skewed sample, that would be grounds to distrust the extrapolation, but simply claiming "it's only 28%" does not make it worth noting.
Might be obviously, but there is definitely a lot of biases in the data here. It's unavoidable. E.g. many bugs will not be detected, but they will be removed when the code is rewritten. So code that is refactored more often will have lower age of fixed bugs. Components/subsystems that are heavily used will detect bugs faster. Some subsystems by their very nature can tolerate bugs more, while some by necessity will need to be more correct (like bpf).
My conclusion is that microkernels offer some protection from random reboots, but not much against hacking
Say the USB system runs in its own isolated process. Great, but if someone pwns the USB process they can change disk contents, intercept and inject keystrokes, etc. You can usually leverage that into a whole system compromise.
Same with most subsystems: GPU, network, file system process compromises are all easily leveraged to pwn the whole system.
Of course by now processor manufacturers decided that blowing holes into the CPUs security model to make it go faster was the way to go. So your micro kernel is stuck on a hardware security model that looks like swiss cheese and smells like Surströmming.
From the stats we see that most bugs effectively come from the limitations of the language.
Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.
I don't think the problem is the kernel. Kernel bugs stay hidden because no one runs recent Kernels.
My Pixel 8 runs kernel a stable minor from 6.1, which was released more than 4 years ago. Yes, fixes get backported to it, but the new features in 6.2->6.19 stay unused on that hardware. All the major distros suffer from the same problem, most people are not running them in production
Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.
Only tangentially related but maybe someone here can help me.
I have a server which has many peripherals and multiple GPUs. Now, I can use vfio and vfio-pcio to memory map and access their registers in user space. My question is, how could I start with kernel driver development? And I specifically mean the dev setup.
Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?
It's an easter egg on the website that usually goes unnoticed. It's our first time on the front page of HN, so it's a little overutilized right now. Capital-C clears it.
[+] [-] Fiveplus|2 months ago|reply
It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.
That said, if we eliminated the 70% of bugs that are memory safety issues, the SNR ratio for finding these deep logic bugs would improve dramatically. We spend so much time tracing segfaults that we miss the subtle corruption bugs.
[+] [-] aw1621107|2 months ago|reply
While the bugs you describe are indeed things that aren't directly addressed by Rust's borrow checker, I think the article covers more ground than your comment implies.
For example, a significant portion (most?) of the article is simply analyzing the gathered data, like grouping bugs by subsystem:
Or by type: And the section describing common patterns for long-lived bugs (10+ years) lists the following:> 1. Reference counting errors
> 2. Missing NULL checks after dereference
> 3. Integer overflow in size calculations
> 4. Race conditions in state machines
All of which cover more ground than listed in your comment.
Furthermore, the 19-year-old bug case study is a refcounting error not related to highly concurrent state machines or hardware assumptions.
[+] [-] johncolanduoni|2 months ago|reply
[+] [-] anon-3988|2 months ago|reply
Rust is not just about memory safety. It also have algebraic data types, RAII, among other things, which will greatly help in catching this kind of silly logic bugs.
[+] [-] the8472|2 months ago|reply
[+] [-] kubb|2 months ago|reply
Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.
[+] [-] john01dav|2 months ago|reply
[+] [-] pjc50|2 months ago|reply
It's worth noting that if you write memory safe code but mis-program a DMA transfer, or trigger a bug in a PCIe device, it's possible for the hardware to give you memory-safety problems by splatting invalid data over a region that's supposed to contain something else.
[+] [-] mgaunard|2 months ago|reply
In my experience it's closer to 5%.
[+] [-] BobbyTables2|2 months ago|reply
At one point, I found serious bugs (crashing our product) that had existed for over 15 years. (And that was 10 years ago).
Rust may not be perfect but it gives me hope that some classes of stupidity will be either be avoided or made visible (like every function being unsafe because the author was a complete idiot).
[+] [-] eru|2 months ago|reply
You are right about that, but even just using sum types eliminates a lot of logic errors, too.
[+] [-] keybored|2 months ago|reply
The Rust phantom zealotry is unfortunately real.
[1] Aha, but the chilling effect of dismissing RIR comments before they are even posted...
[+] [-] paulddraper|2 months ago|reply
Rewriting it all in Rust is extremely expensive, so it won't be done (soon).
[+] [-] lynx97|2 months ago|reply
[+] [-] IshKebab|2 months ago|reply
[+] [-] ramon156|2 months ago|reply
[+] [-] marcosdumay|2 months ago|reply
[+] [-] nibman|2 months ago|reply
[deleted]
[+] [-] DobarDabar|2 months ago|reply
[deleted]
[+] [-] gjfr|2 months ago|reply
Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.
[+] [-] staticassertion|2 months ago|reply
Sort of. They often don't.
[+] [-] giamma|2 months ago|reply
IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good. Unless the time represents how long it takes to reproduce and resolve a known bug, but in such case I would not say that "bug hides" in the kernel.
[+] [-] cogman10|2 months ago|reply
Not really true. A lot of very severe bugs have lurked for years and even decades. Heartbleed comes to mind.
The reason these bugs often lurk for so long is because they very often don't cause a panic, which is why they can be really tricky to find.
For example, use after free bugs are really dangerous. However, in most code, it's a pretty safe bet that nothing dangerous happens when use after free is triggered. Especially if the pointer is used shortly after the free and dies shortly after it. In many cases, the erroneous read or write doesn't break something.
The same is true of the race condition problems (which are some of the longest lived bugs). In a lot of cases, you won't know you have a race condition because in many cases the contention on the lock is low so the race isn't exposed. And even when it is, it can be very tricky to reproduce as the race isn't likely to be done the same way twice.
[+] [-] staticassertion|2 months ago|reply
It doesn't seem to indicate that. It indicates the bug just isn't in tested code or isn't reached often. It could still be a very severe bug.
The issue with longer lived bugs is that someone could have been leveraging it for longer.
[+] [-] NewsaHackO|2 months ago|reply
[+] [-] Telaneo|2 months ago|reply
[+] [-] jmyeet|2 months ago|reply
One criticism of Rust (and, no, I'm not saying "rewrite it in Rust", to be clear) is that the borrow checker can be hard to use whereas many C++ engineers (in particular, for some reason) seem to argue that it's easier to write in C++. I have two things to say about that:
1. It's not easier in C++. Nothing is. C++ simply allows you to make mistakes without telling you. GEtting things correct in C++ is just as difficult as any other language if not more so due to the language complexity; and
2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.
This is I favor cooperative multitasking and using battle-tested concurrency abstractions whenever possible. For example the cooperative async-await of Hack and the model of a single thread responding to a request then discarding everything in PHP/Hack is virtually ideal (IMHO) for serving Web traffic.
I remember reading about Google's work on various C++ tooling including valgrind and that they exposed concurrency bugs in their own code that had lain dormant for up to a decade. That's Google with thousands of engineers and some very talented engineers at that.
[+] [-] silver_sun|2 months ago|reply
Just worth noting that it is a significant extrapolation from only "28%" of fix commits to assume that the average is 2 years.
[+] [-] tremon|2 months ago|reply
[+] [-] sedatk|2 months ago|reply
[+] [-] ValdikSS|2 months ago|reply
It's not uncommon for the bugs they found to be rediscovered 6-7 years later.
https://xcancel.com/spendergrsec
[+] [-] michaelcampbell|2 months ago|reply
[+] [-] dpc_01234|2 months ago|reply
[+] [-] a3w|2 months ago|reply
[+] [-] snvzz|2 months ago|reply
One bug is all it takes to compromise the entire system.
The monolithic UNIX kernel was a good design in the 60s; Today, we should know better[0][1].
0. https://sel4.systems/
1. https://genode.org/
[+] [-] tlb|2 months ago|reply
Say the USB system runs in its own isolated process. Great, but if someone pwns the USB process they can change disk contents, intercept and inject keystrokes, etc. You can usually leverage that into a whole system compromise.
Same with most subsystems: GPU, network, file system process compromises are all easily leveraged to pwn the whole system.
[+] [-] bawolff|2 months ago|reply
[+] [-] josefx|2 months ago|reply
[+] [-] unknown|2 months ago|reply
[deleted]
[+] [-] eulgro|2 months ago|reply
Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.
[+] [-] redleader55|2 months ago|reply
My Pixel 8 runs kernel a stable minor from 6.1, which was released more than 4 years ago. Yes, fixes get backported to it, but the new features in 6.2->6.19 stay unused on that hardware. All the major distros suffer from the same problem, most people are not running them in production
Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.
[+] [-] sureglymop|2 months ago|reply
I have a server which has many peripherals and multiple GPUs. Now, I can use vfio and vfio-pcio to memory map and access their registers in user space. My question is, how could I start with kernel driver development? And I specifically mean the dev setup.
Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?
[+] [-] zkmon|2 months ago|reply
[+] [-] unknown|2 months ago|reply
[deleted]
[+] [-] GaryBluto|2 months ago|reply
[+] [-] kmavm|2 months ago|reply
[+] [-] eab-|2 months ago|reply
[+] [-] blueboo|2 months ago|reply
John Gall, The Systems Bible