I booted Linux 293k times in 21 hours

[+] Laremere|2 years ago|reply

Here they mention that each bisect ran a large number of times to try and catch the rare failure. Reminds me of a previous experience:

We had a large integration test suite. It made calls to an external service, and took ~45 minutes to fully run. Since it needed an exclusive lock on an external account, it could only run a few tests at a time. We started getting random failures, so we were in a tough spot: bisecting didn't work because the failure wasn't consistent, and you couldn't run a single version of a test enough times to verify that a given version definitely did or didn't have the failure in any practical way. I ended up triggering a spread of runs over night, and then used Bayesian statistics to hone in on where the failure was introduced. I felt mighty proud about figuring that out.

Unfortunately, it turns out the tests were more likely to pass at night when the systems were under less strain, so my prior for the failure rate was off and all the math afterwards pointed to the wrong range of commits.

Ultimately, the breakage got worse and I just read through a large number of changes trying to find a likely culprit. After finally finding the change, I went to fix it only to see that the breakage had been fixed by a different team a hour or so before. It turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used. So when the feature was on it broke our tests.

[+] ambicapter|2 years ago|reply

> Ultimately, it turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used.

Wow. I feel like this dependency should be named and shamed.

[+] yojo|2 years ago|reply

Yikes!

FWIW, I think best practice here is to hardcode all feature flags to off in the integration test suite, unless explicitly overwritten in a test. Otherwise you risk exactly these sorts of heisenbugs.

At a BigCo that’s probably going to require coordinating with an internal tools team, but worth getting it on their backlog. All tests should be as deterministic as possible, and this goes double for integration tests that can flake for reasons outside of the code.

[+] painted-now|2 years ago|reply

Man, this story sounds like you could be on my team :-) Pretty much experienced the same stuff working at BigCo!

In the end, I think the real problem is that you can't test all combinations of experiments. I don't trust "all off" or "all on" testing. In my book, you should indeed sample from the true distribution of experiments that real users see. Yes, you get flaky tests, but you also actually test what matters most, i.e. what users will - statistically - see.

[+] n49o7|2 years ago|reply

Probabilistic feature flags! Love it.

[+] ASinclair|2 years ago|reply

This is my daily life at BigCo. These bugs are the worst.

[+] tus666|2 years ago|reply

Eh this is what version pinning is for. Using edge can always lead to random breakagr, feature flags or not.

[+] rwmj|2 years ago|reply

If anyone would like to try reproducing the bug, I have a fairly solid reproducer here:

https://lore.kernel.org/lkml/[email protected]...

You will need a vmlinux or vmlinuz file from Linux 6.4 RC.

If these are the last two lines of output then congratulations you reproduced the bug:

  [    0.074993] Freeing SMP alternatives memory: 48K
  *** ERROR OR HANG ***

You could also try reverting f31dcb152a3 and rerunning the test to see if you get through 10,000 iterations.

[+] chenxiaolong|2 years ago|reply

I gave that reproducer a try and it failed after 1968 iterations.

* CPU: Intel(R) Core(TM) i9-9900KS

* qemu: qemu-kvm-7.2.1-2.fc38.x86_64

* host kernel: 6.3.6-200.fc38.x86_64

* guest kernel: 6.4.0-0.rc6.48.fc39.x86_64 (grabbed latest from mirrors.kernel.org/fedora since fedoraproject.org DNS is down and I can't access koji)

Log:

    <...>
    1966... 1967... 1968...

    [    0.075343] LSM: initializing lsm=lockdown,capability,yama,bpf,landlock,integrity
    [    0.075514] Yama: becoming mindful.
    [    0.075514] LSM support for eBPF active
    [    0.075514] landlock: Up and running.
    [    0.075514] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
    [    0.075514] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
    [    0.075514] x86/cpu: User Mode Instruction Prevention (UMIP) activated
    [    0.075514] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
    [    0.075514] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
    [    0.075514] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    [    0.075514] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
    [    0.075514] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
    [    0.075514] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
    [    0.075514] RETBleed: Mitigation: Enhanced IBRS
    [    0.075514] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
    [    0.075514] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
    [    0.075514] TAA: Mitigation: TSX disabled
    [    0.075514] MMIO Stale Data: Vulnerable: Clear CPU buffers attempted, no microcode
    [    0.075514] SRBDS: Unknown: Dependent on hypervisor status
    [    0.075514] Freeing SMP alternatives memory: 48K
    *** ERROR OR HANG ***

I'll try reverting f31dcb152a3 and testing again later. Happy to test anything else if needed.

[+] matja|2 years ago|reply

Host kernel: 6.1.33-1-lts (Arch) Guest kernel: 6.4-rc6 defconfig QEMU: 8.0.2 (Arch) CPU: AMD EPYC 74F3

1242 iterations :

  [    0.015088] printk: console [ttyS0] enabled
  [    0.055882] ACPI: Core revision 20230331
  [    0.056124] APIC: Switch to symmetric I/O mode setup
  [    0.056867] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e204823bcd,   max_idle_ns: 440795224253 ns
  [    0.057467] Calibrating delay loop (skipped) preset value.. 6399.99 BogoMIPS (lpj=3199998)
  [    0.057924] pid_max: default: 32768 minimum: 301
  [    0.058194] LSM: initializing lsm=capability,integrity
  [    0.058464] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
  [    0.058464] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
  [    0.058464] x86/cpu: User Mode Instruction Prevention (UMIP) activated
  [    0.058464] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
  [    0.058464] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
  [    0.058464] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
  [    0.058464] Spectre V2 : Mitigation: Retpolines
  [    0.058464] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
  [    0.058464] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT
  [    0.058464] Spectre V2 : Enabling Restricted Speculation for firmware calls
  [    0.058464] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
  [    0.058464] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
  [    0.058464] Freeing SMP alternatives memory: 48K
  *** ERROR OR HANG ***

After reverting commit f31dcb152a3d0816e2f1deab4e64572336da197d :

40000 iterations (4 runs) = "test ok"

[+] Twirrim|2 years ago|reply

I've been having flashbacks to troubleshooting some particularly thorny unreliable boot stuff several years ago. In the end tracked that one down to the fact that device order was changing somewhat randomly between commits (deterministically, though, so the same kernel from the same commit would always have devices return in the same order), and part of the early boot process was unwittingly dependent on particular network device ordering due to an annoying bug. The kernel has never made any guarantees about device ordering, so the kernel was behaving just fine.

That one was.. fun. First time I've ever managed to identify dozens of commits widely dispersed within a large range, all seem to be the "cause" of the bug, while clearly having nothing to do with anything related to it, and having commits all around them be good :)

[+] swordbeta|2 years ago|reply

I wasn't able to reproduce this with 10k iterations on arch, I'm probably doing something wrong. Does the host kernel matter?

Host kernel: 6.1.33

Guest kernel: 6.4-rc6

Guest config: http://oirase.annexia.org/tmp/config-bz2213346

QEMU: 8.0.2

Hardware: AMD Ryzen 7 3700X CPU @ 4.2GHz

[+] rossjudson|2 years ago|reply

Looks like you have a trigger, but no root cause (yet). Doesn't matter anyway...revert and work it out later. The root cause bug is still in there somewhere, waiting to be triggered another way...

[+] TechBro8615|2 years ago|reply

This reminded me of another story [0] (discussed on HN [1]) about debugging hanging U-Boot when booting from 1.8 volt SD cards, but not from 3.0 volt SD cards, where the solution involved a kernel patch that actually introduced a delay during boot, by "hardcoding a delay in the regulator setup code (set_machine_constraints)." (In fact it sounded so similar that I actually checked if that patch caused the bug in the OP, but they seem unrelated.)

The story is a wild one, and begins with what looks like a patch with a hacky workaround:

> The patch works around the U-Boot bug by setting the signal voltage back to 3.0V at an opportune moment in the Linux kernel upon reboot, before control is relinquished back to U-Boot.

But wait... it was "the weirdest placebo ever!" Turns out the only reason this worked was because:

> all this setting did was to write a warning to the kernel log... the regulator was being turned off and on again by regulator code, and that writing that line took long enough to be a proper delay to have the regulator reach its target voltage.

The full story is well worth a read.

[0] https://kohlschuetter.github.io/blog/posts/2022/10/28/linux-...

[1] https://news.ycombinator.com/item?id=33370882

[+] DerekBickerton|2 years ago|reply

Before clicking I thought someone kept note of how many times Linux booted in regard to their computing habits, and not testing software. I know for me I boot roughly 3 times a day into different machines, do my work, shutdown, then rinse & repeat.

Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.

[+] bbarn|2 years ago|reply

I had a developer that I inherited from a previous manager some years ago. Made tons of excuses about his machine, the complexity of the problem, etc. I offered to check his machine out and he refused because it had "private stuff" on it. He had the same machine as the rest of the team, so since he hadn't made a commit in two weeks on a relatively simple problem, refused help from anyone, etc., we ultimately let him go.

When we looked at his PC to see if there was anything useful from the project, his browser had around a thousand tabs open. Probably 80% of them were duplicates of other tabs, linking to the same couple stack overflow and C# sites for really basic stuff. The other 20% were... definitely "private stuff".

[+] coldtea|2 years ago|reply

>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.

If the OS and hardware drivers properly support sleep, you almost never need to do otherwise (except to install a new kernel driver or similar).

In macOS for example it hasn't been the case that you need reboot in your regular OS use for over 10+ years.

The "100+ Chrome tabs" or whatever mean nothing. They're paged out when not directly viewed anyway, and if you close just Chrome (not reboot the OS) the memory will be freed in any case...

[+] tasuki|2 years ago|reply

> Boggles my mind that people do that.

Why?

It boggles my mind that you'd reboot needlessly. My uptime is usually in the hundreds of days.

Sleep is good: I just close the lid. Next time I open the lid it immediately picks up where I left off. Why on earth would you want any other behaviour?

[+] aeyes|2 years ago|reply

> Boggles my mind that people do that.

:( I only reboot when my machine freezes or when updates require a reboot. I did a lot of on-call in my life and I saved tons of time by leaving everything open exactly as I left it during the day.

  ~> w
  11:19  up 18 days, 17:03, 9 users, load averages: 3.87 2.96 2.39

[+] ryanjshaw|2 years ago|reply

I used to shutdown regularly, then the power situation here in South Africa got so bad that we'd regularly have about 3 hours of power between interruptions.

Restoring all my work every couple of hours was becoming a pain, so I decided to re-enable hibernation support on Windows for the first time in 10 years... And surprisingly it works absolutely flawlessly.

Even on my 12yr old hardware, even if I'm running a few virtual machines. I honestly haven't seen any reason to reboot other than updates.

[+] vbezhenar|2 years ago|reply

I think that there are two types of people. One set of people (I guess, relatively small) don't trust software and prefer to reboot OS and even periodically reinstall it to keep it "uncluttered". Another set of people prefer to run and repair it forever.

I'm from the first set of people and the only reason I stopped shutting down my macbook is because I'm now keeping its lid closed (connected to display) and there's no way to turn it on without opening a lid which is very inconvenient. I still reboot it every few days, just in case.

[+] andrewaylett|2 years ago|reply

Conversely, it boggles my mind that people think 100+ tabs is a lot. I've got >500 open in Firefox at the moment, they won't go away just because I reboot or upgrade. I'll probably not look at most of them again, but they're not doing any harm just sitting there waiting to be cleaned up.

[+] bregma|2 years ago|reply

> Boggles my mind that people do that.

    $ uptime
     15:39:13 up 359 days,  2:02, 16 users,  load average: 0.09, 0.08, 0.15

16 users is 16 tmux sessions, all me doing different tasks.

[+] rmbyrro|2 years ago|reply

I get anxious just to think that restoring from sleep/hibernation may fail and I lose all my workspace state...

If there was no boot failure, nor the need to reboot after some upgrade, I'd never, ever reboot my system.

[+] Tubru3dhb22|2 years ago|reply

> Boggles my mind that people do that.

Why? I only restart my (linux) laptop every 3-4 months when I update software.

I can't think of any downside that I've experienced from this practice. I do a lot of work with data loaded in a REPL, so it's certainly saved me time having everything restored to as I left it.

[+] drbawb|2 years ago|reply

>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual.

I would never suspend to RAM or disk, far too error-prone in my experience. (Plus serializing out 128GiB of RAM is not great.) I just leave my machine running "all the time." My most recently retired disks (WD Black 6TB) have 309 power cycles with ~57,382 power-on hours. Seems like that works out to rebooting a little less than once per week. That tracks: I usually do kernel updates on the weekend, just in case the system doesn't want to reboot unattended.

[+] trashburger|2 years ago|reply

> Then you have those types who put their machine into hibernate with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.

Hey, I'm that guy (although I put it to sleep instead)! It honestly works really well and is in stark contrast to how Linux and sleep mode interacted just ~10 years ago. It's amazing for keeping your workspace intact.

(FWIW, I also don't reboot or shutdown my desktop where it acts as a mainframe for my "dumb" laptop.)

[+] eertami|2 years ago|reply

Sleep uses almost 0 power and works flawlessly. I'm never going to waste my time, however short, waiting for a machine to boot.

[+] kelnos|2 years ago|reply

Why would I ever reboot my laptop without a need to? I only reboot when there's a kernel update, or if I'm doing something where the laptop might get lost or stolen (since powering off will lock the disk encryption).

[+] sieabahlpark|2 years ago|reply

I just have it running 24/7 and never restart for weeks. I don't even have the 100 tab problem, I just like having the immediate availability without waiting for startup.

[+] rjmunro|2 years ago|reply

I wonder if bisect is the optimal algorithm for this kind of case. Checking for the error still existing still takes an average of ≈500 iterations before a fail, checking for the error not existing takes 10,000 iterations, 20 times longer, so maybe biasing the bisect to only skip 1/20th of the remaining commits, rather than half of them would be more efficient.

[+] hoten|2 years ago|reply

> For unclear reasons the bisect only got me down to a merge commit, I then had to manually test each commit within that which took about another day.

Having hit this before myself... does anyone know how to finagle git bisect to be useful for non-linear history?

[+] eknkc|2 years ago|reply

No disrespect to Peter Zijlstra, I'm sure he has been a lot more impactful on the open source community than I will ever be but his immmediate reply caught my attention:

>> [Being tracked in this bug which contains much more detail: >> https://gitlab.com/qemu-project/qemu/-/issues/1696 ]

> Can I please just get the detail in mail instead of having to go look at random websites?

Maybe it's me but if I did boot boot linux 292.612 times to find a bug, you might as well click a link to a repository of a major open source project on a major git hosting service.

Is it really that weird to ask people online to check a website? Maybe I don't know the etiquette of these mail lists so this is a geniune question. I guess it is better to keep all conversation in a single place, would that be the intention?

[+] w-m|2 years ago|reply

292,612 is not an interesting number, it's not contained in any known integer sequence. The search in OEIS only brings up sequence A292612 (https://oeis.org/search?q=292612&fmt=data&sort=number).

[+] high_pathetic|2 years ago|reply

> it's not contained in any known integer sequence

I think this makes it interesting!

[+] mkl|2 years ago|reply

Of course it's contained in known integer sequences. The positive integers in increasing order, for example: https://oeis.org/A000027. The search doesn't know about every term in every sequence, as most are infinite and many are mostly unknown (some well-defined sequences only have a few known terms).

[+] vintagedave|2 years ago|reply

> I found the culprit, a regression in the printk time feature: https://lkml.org/lkml/2023/6/13/733

The issue hasn't been fixed yet, but if it affects you the proximate cause is known and can be reverted locally.

[+] sp332|2 years ago|reply

To save anyone clicking through the email thread: there is no resolution in there so far.

[+] allanrbo|2 years ago|reply

Running binary search on something that's flaky is a pain. "Noisy binary search" or "robust binary search" can help here: https://github.com/adamcrume/robust-binary-search

[+] hoten|2 years ago|reply

That README is light on details. How is this different from selecting some N (and hoping it is high enough) and repeating your test case that many times? You just don't have to select a value for N using this tool?

EDIT: I missed the link to the white paper.

[+] parentheses|2 years ago|reply

It makes sense n-sect (rather than bi-sect) as long as these can be run in parallel. For example, if you're searching 1000 commits, a 10-sect will get you there with 30 tests, but only 3 iterations. OTOH, a 2-sect will take more than 3x the time, but require 10 iterations.

There's ofc always some sort of bayesian approach mentioned in other answers.

[+] eichin|2 years ago|reply

Yeah, I did a 4-way search like this on gcc back in the Cygnus days - way before git, and the build step involved "me setting up 4 checkouts to build at once and coming back in a few hours" so it was more about giving the human more to dig into at comparison time than actual computer time and usage. (It always amazes me that people have bright-line tests that make the fully automated version useful, but I've also seen "git bisect exists" used as encouragement to break up changes into more sensible components...)

[+] 7ewis|2 years ago|reply

Reminds me how cosmic rays were noted to have caused computer glitches. [0]

Impressive that they managed to discover this bug.

[0] - https://www.bbc.com/future/article/20221011-how-space-weathe...

[+] NelsonMinar|2 years ago|reply

What a fantastic bug report writeup this is. Both the linked post and the backing LKML and QEMU bug report.

[+] mitko|2 years ago|reply

This is the type of content that fills me with awe after a long day of meetings. Thank you OP!

It is not clear from the article if you booted linux 290K more times _after_ the bisect, or during the bisect.

[+] gfiorav|2 years ago|reply

I once had to bisect a Rails app between major versions and dependencies. Every bisect would require me to build the app, fix the dependency issues, and so on.

And I thought I had it bad!

[+] adverbly|2 years ago|reply

Make sure you add it to the integration test suite so it doesn't get re-introduced later ;)

[+] anotherhue|2 years ago|reply

Excellent 'obsessed detective' story

[+] ValdikSS|2 years ago|reply

Uuh…

I have an old VIA-based 32-bit x86 machine (VIA Eden Esther 1 GHz from 2006), and it hangs in different times, but I managed to create a reproducer which hangs the system not so long after the boot. About 1 in 20 boots are unsuccessful.

I noticed that verbose booting reduces the chance of hanging compared to quiet boot, but does not eliminate it completely.

The similar issue was present even on Dell servers back in 2008-2009, which are based on more recent x86_64 VIA CPUs, here's an attempt to bisect the issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=507845#84 The CPU seem to enter endless loop, as the machine becomes quite hot as if it's running full-speed.

All these years I believed this is a hardware implementation issue, related either to context switch or to SSE/SSE2 blocks, as running pentium-mmx-compiled OS seem to work fine, given that no other x86 system hangs the way VIA does.

However after this post and all LKML discussions, ticks/jiffies/HZ mentions, and how is it less an issue on Intel, I'm not so sure: the issue mentioned is related to time and printk, I also associate my problem with how chatty the kernel log is (at least partially), and the person in Debian bug tracker above also bisected the code related to printf, although in libc. It could be another software bug in the kernel. If that's the case, it is present since at least 2.6 times.

I would appreciate any suggestions to try, any workarounds to apply, any advice on debugging. If anyone have spare time and interest, I can setup the dedicated machine over SSH for testing. I have a bunch of VIA hardware which is reused for my new non-commercial project and I struggle to run these machines 100% stable.

[+] ineedasername|2 years ago|reply

If there is a platonic ideal of ‘uptime’ then this has got to be its opposite.

267 comments