I work in embedded systems, and the best advice I can offer is: resist the urge to speculate when problems arise. Stay quiet, grab an oscilloscope, and start probing the problem area. Objective measurements beat conjecture every time. It's hard to argue with scope captures that clearly show what's happening. As Jack Ganssle says, "One test is worth a thousand opinions."
jdhwosnhw|6 months ago
In my experience, the most helpful approach to performing RCA on complicated systems involves several hours, if not days, of hypothesizing and modeling prior to test(s). The hypothesis guides the tests, and without a fully formed conjecture you’re practically guaranteed to fit your hypothesis to the data ex post facto. Not to mention that in complex systems there is usually 10 benign things wrong for every 1 real issue you might find - without a clear hypothesis, its easy to go chasing down rabbit holes with your testing.
seidleroni|6 months ago
AnimalMuppet|6 months ago
If you can grab an oscilloscope and gather meaningful data in 15 minutes, why would you spend several hours hypothesizing and modeling?
If you can't, then spending several hours or days modeling and hypothesizing is better than just guessing.
So I think that data beats informed opinions, but informed opinions beat pure guesses.
n_u|6 months ago
When you test part of the circuit with the scope, you are using prior knowledge to determine which tool to use and where to test. You don’t just take measurements blindly. You could test a totally different part of the system because there might be some crazy coupling but you don’t. In this system it seems like taking the measurement is really cheap and a quick analysis about what to measure is likely to give relevant results.
In a different system it could be that measurements are expensive and it’s easy to measure something irrelevant. So there it’s worth doing more analysis before measurements.
I think both cases fight what I’ve heard called intellectual laziness. It’s sometimes hard to make yourself be intellectually honest and do the proper unbiased analysis and measuring for RCA. It’s also really easy to sit around and conjecture compared to taking the time to measure. It’s really easy for your brain to say “oh it’s always caused by this thing cuz it’s junk” and move on because you want to be done with it. Is this really the cause? Could there be nothing else causing it? Would you investigate this more if other people’s lives depended on this?
I learned about this model of viewing RCA from people who work on safety critical systems. It takes a lot of energy and time to be thorough and your brain will use shortcuts and confirmation bias. I ask myself if I’m being lazy because I want a certain answer. Can I be more thorough? Is there a measurement I know will be annoying so I’m avoiding it?
MSFT_Edging|6 months ago
It's helped a dozen times so far essentially playing 20 questions and being able to point to the exact problem and have it resolved quickly.
This is a semi-embedded system. FGPAs, SoCs, drivers, userspace, userspace drivers, etc. Lots of stuff to go wrong, speculation gives a place to start.
organsnyder|6 months ago
teiferer|6 months ago
That's not to say that at some point you don't need to get your hands dirty. But it's equally important to balance that with thinking and theory building. It's whoever gets that balance right who will be most effective at debugging, not the one with the dirtiest hands.
cushychicken|6 months ago
The most dangerous words during debug are: “…but it should work this way!” This is a mantra I try hard to instill in all EEs I mentor.
“Should” isn’t worth a damn to me. You test your way out of hardware bugs - you don’t logic your way out.
fedeb95|6 months ago
SkyPuncher|6 months ago
Speculation is fine, but you need to ground it in reality.