(no title)
victorNicollet | 1 year ago
The main symptom was a non-deterministic crash in the middle of a 15-minute multi-threaded execution that should have been 100% deterministic. The debugger revealed that the contents of an array had been modified incorrectly, but stepping through the code prevented the crash, and it was not always the same array or the same position within that array. I suspected that the array writes were somehow dependent on a race, but placing a data breakpoint prevented the crash. So, I started dumping trace information. It was a rather silly game of adding more traces, running the 15-minute process 10 times to see if the overhead of producing the traces made the race disappear, and trying again.
The root cause was a "read, decompress and return a copy of data X from disk" method which was called with the 2023 assumption that a fresh copy would be returned every time, but was written with the 2018 optimization that if two threads asked for the same data "at the same time", the same copy could be returned to both to save on decompression time...
overfl0w|1 year ago
victorNicollet|1 year ago
But I consider myself lucky that the issue could be reproduced on a local machine (arguably, one with 8 cores and 64GiB RAM) and not only on the 32 core, 256GiB RAM server. Having to work remotely on a server would have easily added another week of investigation.
unknown|1 year ago
[deleted]