(no title)
blurker | 2 years ago
This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.
blurker | 2 years ago
This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.
jiggawatts|2 years ago
My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".
I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."
It's not slow! It's broken!
ertian|2 years ago
Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.
pxc|2 years ago
The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.
gregmac|2 years ago
> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.
It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".
The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.
I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..
dzhiurgis|2 years ago
koonsolo|2 years ago
no_wizard|2 years ago
If you have flaky tests, you can isolate them to their own workflow, and deal with it as isolated away from the rest of your CI process.
Does wonders around this. The idea of monolithic CI job is backward to me now