top | item 39439597

(no title)

MertsA | 2 years ago

Disclaimer: I work at Meta and I know a couple of the authors of the paper but my work is completely unrelated to the subject of the paper.

That's not a technical error, they mean SRAM in the CPU itself. You're right about gcj but that's kind of a moot point when investigating some reproducible CPU bug like this. The paper mentions all the acrobatics they went through when trying to find the root cause but if gcj would have been practical then it also would have been immediately clear if the gcj output reproduced the error or not. If it didn't reproduce, no big deal, try another approach. You might be right about it being easier to root cause with gdb directly but I'm not so sure. Starting out, you have no idea which instructions under what state are triggering the issue so you'd be looking for a needle in a haystack. A crashdump or gdb doesn't let you bisect that haystack so good luck finding your needle.

discuss

SomeoneFromCA|2 years ago

GCJs implementation could be so vastly different from Hotspot, you could as well rewrite it in C and check if it is failing or not. ChatGPT would generate testcase within a minute.

It all depends how good you are with x64 assembly. If you are good enough, you can easily deduce what the instructions at the location do, and can potentially simply copy-paste into an asm file, compile it and check result. Would be much faster to me.

Bluntly speaking, people who are not familiar with low-level debugging make an honest and succesful attempt to investigate a low-level issue. A seasoned kernel developer or reverse engineer would have just used gdb straight away.

MertsA|2 years ago

>A seasoned kernel developer

I think you should take another look at the author list. Chris Mason counts as a seasoned kernel developer in my book. Either way I think you're missing the point. Yes gcj would be different, but there's a decent chance it could hand you a binary that reproduces the issue that you can bisect to the root cause from there. It's one thing to run it through gcj and see if it reproduces, rewriting it in C is a ton of work compared to gcj for something that might not pan out.