top | item 27647237

(no title)

cbzehner | 4 years ago

Incredible!

What was the hardest part?

What did you learn that you weren’t expecting?

discuss

order

halfer53|4 years ago

The hardest part is debugging concurrency and weird scheduling bug. It's hard to debug because I can't reproduce it, which is quite frustrating. Over time, I found that making educated guesses, or running the code in my brain, it a lot more efficient than debugging every line of the code.

grishka|4 years ago

As someone who wrote and debugged a lot of concurrent code, here's another advice: log everything. Log as much as you could in the part where you think the bug is. Log every line that's run if you have to. You'll then skim through the log file looking for any unexpected patterns.

This approach works better than using a debugger, even on a single-core system, because these kinds of bugs tend to be hard to reproduce and take many iterations. You don't want it to hit a breakpoint a zillion times before it finally shows itself.

And another one, tangential to what you said. Read your code line by line and ask yourself "what would break if a context switch happens right here" for each line.

sn41|4 years ago

You are right. Great work!

One somewhat related theoretical observation about concurrency is in some article by Dijkstra (I don't remember the reference right now): he says that debugging using traces (essentially printf) does not work for concurrency, since it is projecting multidimensional data (data present at the same time in multiple processes) unnecessarily linearized onto a single dimension (a sequence of printfs) and then trying to make sense of what is happening. It may not work, even if you print timestamps.

His view was to promote theoretical proofs of correctness of concurrent code, rather than debugging, but to me at least, this is much more difficult.

lowbloodsugar|4 years ago

Back on the N64, I updated the bit of code that swapped threads to write, to a ring buffer, the outgoing/incoming PCs, thread IDs and clock. Found tons of unexpected issues. In another thread you can print that or save it to disk or whatever. Or just wait till it crashes and read memory for it. Found the last crash bug with it. Meanwhile, a colleague took it, and drew color coded bars on the screen so we could see exactly what was taking the time. Those were the days. =)

inglor_cz|4 years ago

My tip: if you can pinpoint the place where the bug occurs, trigger a SIGSEGV there and run the entire thing under Valgrind. It shows you a lot of interesting data.

sn41|4 years ago

Related: I am in academia, and literally these are the questions that I ask prospective PhD students : what was one hard thing you understood in your area of interest? What was unexpected? And what was the final enlightenment when finally things clicked into place?

nh2|4 years ago

That's right!

After spending 5 years to write the OS, if you can spare 1-2 additional days to write down your experience, it'll be extra useful.

halfer53|4 years ago

Yeah, I'll definitely share that