top | item 42609775

(no title)

victorNicollet | 1 year ago

One of the hardest bugs I've investigated required the extreme version of debugging with printf: sprinkling the code with dump statements to produce about 500GiB of compressed binary trace, and writing a dedicated program to sift through it.

The main symptom was a non-deterministic crash in the middle of a 15-minute multi-threaded execution that should have been 100% deterministic. The debugger revealed that the contents of an array had been modified incorrectly, but stepping through the code prevented the crash, and it was not always the same array or the same position within that array. I suspected that the array writes were somehow dependent on a race, but placing a data breakpoint prevented the crash. So, I started dumping trace information. It was a rather silly game of adding more traces, running the 15-minute process 10 times to see if the overhead of producing the traces made the race disappear, and trying again.

The root cause was a "read, decompress and return a copy of data X from disk" method which was called with the 2023 assumption that a fresh copy would be returned every time, but was written with the 2018 optimization that if two threads asked for the same data "at the same time", the same copy could be returned to both to save on decompression time...

discuss

order

overfl0w|1 year ago

Those are the kind of bugs one remembers for life.

victorNicollet|1 year ago

Indeed. Worst week of 2023 !

But I consider myself lucky that the issue could be reproduced on a local machine (arguably, one with 8 cores and 64GiB RAM) and not only on the 32 core, 256GiB RAM server. Having to work remotely on a server would have easily added another week of investigation.