Here they mention that each bisect ran a large number of times to try and catch the rare failure. Reminds me of a previous experience:
We had a large integration test suite. It made calls to an external service, and took ~45 minutes to fully run. Since it needed an exclusive lock on an external account, it could only run a few tests at a time. We started getting random failures, so we were in a tough spot: bisecting didn't work because the failure wasn't consistent, and you couldn't run a single version of a test enough times to verify that a given version definitely did or didn't have the failure in any practical way. I ended up triggering a spread of runs over night, and then used Bayesian statistics to hone in on where the failure was introduced. I felt mighty proud about figuring that out.
Unfortunately, it turns out the tests were more likely to pass at night when the systems were under less strain, so my prior for the failure rate was off and all the math afterwards pointed to the wrong range of commits.
Ultimately, the breakage got worse and I just read through a large number of changes trying to find a likely culprit. After finally finding the change, I went to fix it only to see that the breakage had been fixed by a different team a hour or so before. It turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used. So when the feature was on it broke our tests.
FWIW, I think best practice here is to hardcode all feature flags to off in the integration test suite, unless explicitly overwritten in a test. Otherwise you risk exactly these sorts of heisenbugs.
At a BigCo that’s probably going to require coordinating with an internal tools team, but worth getting it on their backlog. All tests should be as deterministic as possible, and this goes double for integration tests that can flake for reasons outside of the code.
Man, this story sounds like you could be on my team :-) Pretty much experienced the same stuff working at BigCo!
In the end, I think the real problem is that you can't test all combinations of experiments. I don't trust "all off" or "all on" testing. In my book, you should indeed sample from the true distribution of experiments that real users see. Yes, you get flaky tests, but you also actually test what matters most, i.e. what users will - statistically - see.
I've been having flashbacks to troubleshooting some particularly thorny unreliable boot stuff several years ago. In the end tracked that one down to the fact that device order was changing somewhat randomly between commits (deterministically, though, so the same kernel from the same commit would always have devices return in the same order), and part of the early boot process was unwittingly dependent on particular network device ordering due to an annoying bug.
The kernel has never made any guarantees about device ordering, so the kernel was behaving just fine.
That one was.. fun. First time I've ever managed to identify dozens of commits widely dispersed within a large range, all seem to be the "cause" of the bug, while clearly having nothing to do with anything related to it, and having commits all around them be good :)
Looks like you have a trigger, but no root cause (yet).
Doesn't matter anyway...revert and work it out later.
The root cause bug is still in there somewhere, waiting to be triggered another way...
This reminded me of another story [0] (discussed on HN [1]) about debugging hanging U-Boot when booting from 1.8 volt SD cards, but not from 3.0 volt SD cards, where the solution involved a kernel patch that actually introduced a delay during boot, by "hardcoding a delay in the regulator setup code (set_machine_constraints)." (In fact it sounded so similar that I actually checked if that patch caused the bug in the OP, but they seem unrelated.)
The story is a wild one, and begins with what looks like a patch with a hacky workaround:
> The patch works around the U-Boot bug by setting the signal voltage back to 3.0V at an opportune moment in the Linux kernel upon reboot, before control is relinquished back to U-Boot.
But wait... it was "the weirdest placebo ever!" Turns out the only reason this worked was because:
> all this setting did was to write a warning to the kernel log... the regulator was being turned off and on again by regulator code, and that writing that line took long enough to be a proper delay to have the regulator reach its target voltage.
Before clicking I thought someone kept note of how many times Linux booted in regard to their computing habits, and not testing software. I know for me I boot roughly 3 times a day into different machines, do my work, shutdown, then rinse & repeat.
Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
I had a developer that I inherited from a previous manager some years ago. Made tons of excuses about his machine, the complexity of the problem, etc. I offered to check his machine out and he refused because it had "private stuff" on it. He had the same machine as the rest of the team, so since he hadn't made a commit in two weeks on a relatively simple problem, refused help from anyone, etc., we ultimately let him go.
When we looked at his PC to see if there was anything useful from the project, his browser had around a thousand tabs open. Probably 80% of them were duplicates of other tabs, linking to the same couple stack overflow and C# sites for really basic stuff. The other 20% were... definitely "private stuff".
>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
If the OS and hardware drivers properly support sleep, you almost never need to do otherwise (except to install a new kernel driver or similar).
In macOS for example it hasn't been the case that you need reboot in your regular OS use for over 10+ years.
The "100+ Chrome tabs" or whatever mean nothing. They're paged out when not directly viewed anyway, and if you close just Chrome (not reboot the OS) the memory will be freed in any case...
It boggles my mind that you'd reboot needlessly. My uptime is usually in the hundreds of days.
Sleep is good: I just close the lid. Next time I open the lid it immediately picks up where I left off. Why on earth would you want any other behaviour?
:( I only reboot when my machine freezes or when updates require a reboot.
I did a lot of on-call in my life and I saved tons of time by leaving everything open exactly as I left it during the day.
~> w
11:19 up 18 days, 17:03, 9 users, load averages: 3.87 2.96 2.39
I used to shutdown regularly, then the power situation here in South Africa got so bad that we'd regularly have about 3 hours of power between interruptions.
Restoring all my work every couple of hours was becoming a pain, so I decided to re-enable hibernation support on Windows for the first time in 10 years... And surprisingly it works absolutely flawlessly.
Even on my 12yr old hardware, even if I'm running a few virtual machines. I honestly haven't seen any reason to reboot other than updates.
I think that there are two types of people. One set of people (I guess, relatively small) don't trust software and prefer to reboot OS and even periodically reinstall it to keep it "uncluttered". Another set of people prefer to run and repair it forever.
I'm from the first set of people and the only reason I stopped shutting down my macbook is because I'm now keeping its lid closed (connected to display) and there's no way to turn it on without opening a lid which is very inconvenient. I still reboot it every few days, just in case.
Conversely, it boggles my mind that people think 100+ tabs is a lot. I've got >500 open in Firefox at the moment, they won't go away just because I reboot or upgrade. I'll probably not look at most of them again, but they're not doing any harm just sitting there waiting to be cleaned up.
Why? I only restart my (linux) laptop every 3-4 months when I update software.
I can't think of any downside that I've experienced from this practice. I do a lot of work with data loaded in a REPL, so it's certainly saved me time having everything restored to as I left it.
>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual.
I would never suspend to RAM or disk, far too error-prone in my experience. (Plus serializing out 128GiB of RAM is not great.) I just leave my machine running "all the time." My most recently retired disks (WD Black 6TB) have 309 power cycles with ~57,382 power-on hours. Seems like that works out to rebooting a little less than once per week. That tracks: I usually do kernel updates on the weekend, just in case the system doesn't want to reboot unattended.
> Then you have those types who put their machine into hibernate with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
Hey, I'm that guy (although I put it to sleep instead)! It honestly works really well and is in stark contrast to how Linux and sleep mode interacted just ~10 years ago. It's amazing for keeping your workspace intact.
(FWIW, I also don't reboot or shutdown my desktop where it acts as a mainframe for my "dumb" laptop.)
Why would I ever reboot my laptop without a need to? I only reboot when there's a kernel update, or if I'm doing something where the laptop might get lost or stolen (since powering off will lock the disk encryption).
I just have it running 24/7 and never restart for weeks. I don't even have the 100 tab problem, I just like having the immediate availability without waiting for startup.
I wonder if bisect is the optimal algorithm for this kind of case. Checking for the error still existing still takes an average of ≈500 iterations before a fail, checking for the error not existing takes 10,000 iterations, 20 times longer, so maybe biasing the bisect to only skip 1/20th of the remaining commits, rather than half of them would be more efficient.
No disrespect to Peter Zijlstra, I'm sure he has been a lot more impactful on the open source community than I will ever be but his immmediate reply caught my attention:
> Can I please just get the detail in mail instead of having to go look at random websites?
Maybe it's me but if I did boot boot linux 292.612 times to find a bug, you might as well click a link to a repository of a major open source project on a major git hosting service.
Is it really that weird to ask people online to check a website? Maybe I don't know the etiquette of these mail lists so this is a geniune question. I guess it is better to keep all conversation in a single place, would that be the intention?
Of course it's contained in known integer sequences. The positive integers in increasing order, for example: https://oeis.org/A000027. The search doesn't know about every term in every sequence, as most are infinite and many are mostly unknown (some well-defined sequences only have a few known terms).
That README is light on details. How is this different from selecting some N (and hoping it is high enough) and repeating your test case that many times? You just don't have to select a value for N using this tool?
It makes sense n-sect (rather than bi-sect) as long as these can be run in parallel. For example, if you're searching 1000 commits, a 10-sect will get you there with 30 tests, but only 3 iterations. OTOH, a 2-sect will take more than 3x the time, but require 10 iterations.
There's ofc always some sort of bayesian approach mentioned in other answers.
Yeah, I did a 4-way search like this on gcc back in the Cygnus days - way before git, and the build step involved "me setting up 4 checkouts to build at once and coming back in a few hours" so it was more about giving the human more to dig into at comparison time than actual computer time and usage. (It always amazes me that people have bright-line tests that make the fully automated version useful, but I've also seen "git bisect exists" used as encouragement to break up changes into more sensible components...)
I once had to bisect a Rails app between major versions and dependencies. Every bisect would require me to build the app, fix the dependency issues, and so on.
I have an old VIA-based 32-bit x86 machine (VIA Eden Esther 1 GHz from 2006), and it hangs in different times, but I managed to create a reproducer which hangs the system not so long after the boot. About 1 in 20 boots are unsuccessful.
I noticed that verbose booting reduces the chance of hanging compared to quiet boot, but does not eliminate it completely.
The similar issue was present even on Dell servers back in 2008-2009, which are based on more recent x86_64 VIA CPUs, here's an attempt to bisect the issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=507845#84
The CPU seem to enter endless loop, as the machine becomes quite hot as if it's running full-speed.
All these years I believed this is a hardware implementation issue, related either to context switch or to SSE/SSE2 blocks, as running pentium-mmx-compiled OS seem to work fine, given that no other x86 system hangs the way VIA does.
However after this post and all LKML discussions, ticks/jiffies/HZ mentions, and how is it less an issue on Intel, I'm not so sure: the issue mentioned is related to time and printk, I also associate my problem with how chatty the kernel log is (at least partially), and the person in Debian bug tracker above also bisected the code related to printf, although in libc. It could be another software bug in the kernel. If that's the case, it is present since at least 2.6 times.
I would appreciate any suggestions to try, any workarounds to apply, any advice on debugging. If anyone have spare time and interest, I can setup the dedicated machine over SSH for testing. I have a bunch of VIA hardware which is reused for my new non-commercial project and I struggle to run these machines 100% stable.
[+] [-] Laremere|2 years ago|reply
We had a large integration test suite. It made calls to an external service, and took ~45 minutes to fully run. Since it needed an exclusive lock on an external account, it could only run a few tests at a time. We started getting random failures, so we were in a tough spot: bisecting didn't work because the failure wasn't consistent, and you couldn't run a single version of a test enough times to verify that a given version definitely did or didn't have the failure in any practical way. I ended up triggering a spread of runs over night, and then used Bayesian statistics to hone in on where the failure was introduced. I felt mighty proud about figuring that out.
Unfortunately, it turns out the tests were more likely to pass at night when the systems were under less strain, so my prior for the failure rate was off and all the math afterwards pointed to the wrong range of commits.
Ultimately, the breakage got worse and I just read through a large number of changes trying to find a likely culprit. After finally finding the change, I went to fix it only to see that the breakage had been fixed by a different team a hour or so before. It turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used. So when the feature was on it broke our tests.
[+] [-] ambicapter|2 years ago|reply
Wow. I feel like this dependency should be named and shamed.
[+] [-] yojo|2 years ago|reply
FWIW, I think best practice here is to hardcode all feature flags to off in the integration test suite, unless explicitly overwritten in a test. Otherwise you risk exactly these sorts of heisenbugs.
At a BigCo that’s probably going to require coordinating with an internal tools team, but worth getting it on their backlog. All tests should be as deterministic as possible, and this goes double for integration tests that can flake for reasons outside of the code.
[+] [-] painted-now|2 years ago|reply
In the end, I think the real problem is that you can't test all combinations of experiments. I don't trust "all off" or "all on" testing. In my book, you should indeed sample from the true distribution of experiments that real users see. Yes, you get flaky tests, but you also actually test what matters most, i.e. what users will - statistically - see.
[+] [-] n49o7|2 years ago|reply
[+] [-] ASinclair|2 years ago|reply
[+] [-] tus666|2 years ago|reply
[+] [-] rwmj|2 years ago|reply
https://lore.kernel.org/lkml/[email protected]...
You will need a vmlinux or vmlinuz file from Linux 6.4 RC.
If these are the last two lines of output then congratulations you reproduced the bug:
You could also try reverting f31dcb152a3 and rerunning the test to see if you get through 10,000 iterations.[+] [-] chenxiaolong|2 years ago|reply
* CPU: Intel(R) Core(TM) i9-9900KS
* qemu: qemu-kvm-7.2.1-2.fc38.x86_64
* host kernel: 6.3.6-200.fc38.x86_64
* guest kernel: 6.4.0-0.rc6.48.fc39.x86_64 (grabbed latest from mirrors.kernel.org/fedora since fedoraproject.org DNS is down and I can't access koji)
Log:
I'll try reverting f31dcb152a3 and testing again later. Happy to test anything else if needed.[+] [-] matja|2 years ago|reply
1242 iterations :
After reverting commit f31dcb152a3d0816e2f1deab4e64572336da197d :40000 iterations (4 runs) = "test ok"
[+] [-] Twirrim|2 years ago|reply
That one was.. fun. First time I've ever managed to identify dozens of commits widely dispersed within a large range, all seem to be the "cause" of the bug, while clearly having nothing to do with anything related to it, and having commits all around them be good :)
[+] [-] swordbeta|2 years ago|reply
Host kernel: 6.1.33
Guest kernel: 6.4-rc6
Guest config: http://oirase.annexia.org/tmp/config-bz2213346
QEMU: 8.0.2
Hardware: AMD Ryzen 7 3700X CPU @ 4.2GHz
[+] [-] rossjudson|2 years ago|reply
[+] [-] TechBro8615|2 years ago|reply
The story is a wild one, and begins with what looks like a patch with a hacky workaround:
> The patch works around the U-Boot bug by setting the signal voltage back to 3.0V at an opportune moment in the Linux kernel upon reboot, before control is relinquished back to U-Boot.
But wait... it was "the weirdest placebo ever!" Turns out the only reason this worked was because:
> all this setting did was to write a warning to the kernel log... the regulator was being turned off and on again by regulator code, and that writing that line took long enough to be a proper delay to have the regulator reach its target voltage.
The full story is well worth a read.
[0] https://kohlschuetter.github.io/blog/posts/2022/10/28/linux-...
[1] https://news.ycombinator.com/item?id=33370882
[+] [-] DerekBickerton|2 years ago|reply
Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
[+] [-] bbarn|2 years ago|reply
When we looked at his PC to see if there was anything useful from the project, his browser had around a thousand tabs open. Probably 80% of them were duplicates of other tabs, linking to the same couple stack overflow and C# sites for really basic stuff. The other 20% were... definitely "private stuff".
[+] [-] coldtea|2 years ago|reply
If the OS and hardware drivers properly support sleep, you almost never need to do otherwise (except to install a new kernel driver or similar).
In macOS for example it hasn't been the case that you need reboot in your regular OS use for over 10+ years.
The "100+ Chrome tabs" or whatever mean nothing. They're paged out when not directly viewed anyway, and if you close just Chrome (not reboot the OS) the memory will be freed in any case...
[+] [-] tasuki|2 years ago|reply
Why?
It boggles my mind that you'd reboot needlessly. My uptime is usually in the hundreds of days.
Sleep is good: I just close the lid. Next time I open the lid it immediately picks up where I left off. Why on earth would you want any other behaviour?
[+] [-] aeyes|2 years ago|reply
:( I only reboot when my machine freezes or when updates require a reboot. I did a lot of on-call in my life and I saved tons of time by leaving everything open exactly as I left it during the day.
[+] [-] ryanjshaw|2 years ago|reply
Restoring all my work every couple of hours was becoming a pain, so I decided to re-enable hibernation support on Windows for the first time in 10 years... And surprisingly it works absolutely flawlessly.
Even on my 12yr old hardware, even if I'm running a few virtual machines. I honestly haven't seen any reason to reboot other than updates.
[+] [-] vbezhenar|2 years ago|reply
I'm from the first set of people and the only reason I stopped shutting down my macbook is because I'm now keeping its lid closed (connected to display) and there's no way to turn it on without opening a lid which is very inconvenient. I still reboot it every few days, just in case.
[+] [-] andrewaylett|2 years ago|reply
[+] [-] bregma|2 years ago|reply
[+] [-] rmbyrro|2 years ago|reply
If there was no boot failure, nor the need to reboot after some upgrade, I'd never, ever reboot my system.
[+] [-] Tubru3dhb22|2 years ago|reply
Why? I only restart my (linux) laptop every 3-4 months when I update software.
I can't think of any downside that I've experienced from this practice. I do a lot of work with data loaded in a REPL, so it's certainly saved me time having everything restored to as I left it.
[+] [-] drbawb|2 years ago|reply
I would never suspend to RAM or disk, far too error-prone in my experience. (Plus serializing out 128GiB of RAM is not great.) I just leave my machine running "all the time." My most recently retired disks (WD Black 6TB) have 309 power cycles with ~57,382 power-on hours. Seems like that works out to rebooting a little less than once per week. That tracks: I usually do kernel updates on the weekend, just in case the system doesn't want to reboot unattended.
[+] [-] trashburger|2 years ago|reply
Hey, I'm that guy (although I put it to sleep instead)! It honestly works really well and is in stark contrast to how Linux and sleep mode interacted just ~10 years ago. It's amazing for keeping your workspace intact.
(FWIW, I also don't reboot or shutdown my desktop where it acts as a mainframe for my "dumb" laptop.)
[+] [-] eertami|2 years ago|reply
[+] [-] kelnos|2 years ago|reply
[+] [-] sieabahlpark|2 years ago|reply
[+] [-] rjmunro|2 years ago|reply
[+] [-] hoten|2 years ago|reply
Having hit this before myself... does anyone know how to finagle git bisect to be useful for non-linear history?
[+] [-] eknkc|2 years ago|reply
>> [Being tracked in this bug which contains much more detail: >> https://gitlab.com/qemu-project/qemu/-/issues/1696 ]
> Can I please just get the detail in mail instead of having to go look at random websites?
Maybe it's me but if I did boot boot linux 292.612 times to find a bug, you might as well click a link to a repository of a major open source project on a major git hosting service.
Is it really that weird to ask people online to check a website? Maybe I don't know the etiquette of these mail lists so this is a geniune question. I guess it is better to keep all conversation in a single place, would that be the intention?
[+] [-] w-m|2 years ago|reply
[+] [-] high_pathetic|2 years ago|reply
I think this makes it interesting!
[+] [-] mkl|2 years ago|reply
[+] [-] vintagedave|2 years ago|reply
The issue hasn't been fixed yet, but if it affects you the proximate cause is known and can be reverted locally.
[+] [-] sp332|2 years ago|reply
[+] [-] allanrbo|2 years ago|reply
[+] [-] hoten|2 years ago|reply
EDIT: I missed the link to the white paper.
[+] [-] parentheses|2 years ago|reply
There's ofc always some sort of bayesian approach mentioned in other answers.
[+] [-] eichin|2 years ago|reply
[+] [-] 7ewis|2 years ago|reply
Impressive that they managed to discover this bug.
[0] - https://www.bbc.com/future/article/20221011-how-space-weathe...
[+] [-] NelsonMinar|2 years ago|reply
[+] [-] mitko|2 years ago|reply
It is not clear from the article if you booted linux 290K more times _after_ the bisect, or during the bisect.
[+] [-] gfiorav|2 years ago|reply
And I thought I had it bad!
[+] [-] adverbly|2 years ago|reply
[+] [-] anotherhue|2 years ago|reply
[+] [-] ValdikSS|2 years ago|reply
I have an old VIA-based 32-bit x86 machine (VIA Eden Esther 1 GHz from 2006), and it hangs in different times, but I managed to create a reproducer which hangs the system not so long after the boot. About 1 in 20 boots are unsuccessful.
I noticed that verbose booting reduces the chance of hanging compared to quiet boot, but does not eliminate it completely.
The similar issue was present even on Dell servers back in 2008-2009, which are based on more recent x86_64 VIA CPUs, here's an attempt to bisect the issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=507845#84 The CPU seem to enter endless loop, as the machine becomes quite hot as if it's running full-speed.
All these years I believed this is a hardware implementation issue, related either to context switch or to SSE/SSE2 blocks, as running pentium-mmx-compiled OS seem to work fine, given that no other x86 system hangs the way VIA does.
However after this post and all LKML discussions, ticks/jiffies/HZ mentions, and how is it less an issue on Intel, I'm not so sure: the issue mentioned is related to time and printk, I also associate my problem with how chatty the kernel log is (at least partially), and the person in Debian bug tracker above also bisected the code related to printf, although in libc. It could be another software bug in the kernel. If that's the case, it is present since at least 2.6 times.
I would appreciate any suggestions to try, any workarounds to apply, any advice on debugging. If anyone have spare time and interest, I can setup the dedicated machine over SSH for testing. I have a bunch of VIA hardware which is reused for my new non-commercial project and I struggle to run these machines 100% stable.
[+] [-] ineedasername|2 years ago|reply