top | item 37317549

(no title)

blurker | 2 years ago

> Back when I was a junior developer, there was a smoke test in our pipeline that never passed. I recall asking, “Why is this test failing?” The Senior Developer I was pairing with answered, “Ohhh, that one, yeah it hardly ever passes.” From that moment on, every time I saw a CI failure, I wondered: “Is this a flaky test, or a genuine failure?”

This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.

discuss

jiggawatts|2 years ago

"Normalisation of Deviance" is a concept that will change the way you look at the world once you learn to recognise it. It's made famous by Richard Feynman's report about the Challenger disaster, where he said that NASA management had started accepting recurring mission-critical failures as normal issues and ignored them.

My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".

I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."

It's not slow! It's broken!

ertian|2 years ago

Oof, yes. I used to be an SRE at Google, with oncall responsibility for dozens of servers maintained by a dozen or so dev teams.

Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.

pxc|2 years ago

You don't even have to go as far from your desk as a remote server to see this happening, or open a log file.

The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.

gregmac|2 years ago

I agree with what you're saying, but this is a bad example:

> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.

It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".

The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.

I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..

dzhiurgis|2 years ago

Horrors from enterprise - few weeks ago a solution architect forced me to rollback a fix (a basic null check) that they "couldn't test" because its not a "real world" scenario (testers creating incorrect data would crash business process for everyone)...

koonsolo|2 years ago

Your system could also retry the flaky tests. If it fails after 3 or 5 runs, it's for sure a defect.

no_wizard|2 years ago

This is the power of GitHub actions where each workflow is one YAML file.

If you have flaky tests, you can isolate them to their own workflow, and deal with it as isolated away from the rest of your CI process.

Does wonders around this. The idea of monolithic CI job is backward to me now