top | item 46967724

Ask HN: How do you manage flaky E2E tests at scale?

3 points| forcepushed | 19 days ago

I’m curious how folks + teams here deal with flaky end-to-end tests once a product and test suite gets "big" (> thousands test cases).

I’ve seen a few patterns over the years: retries everywhere, quarantining tests, rewriting flows, adding more waits than anyone feels good about, or just slowly losing trust in CI signal. None of them feel great once you have hundreds or thousands of tests running across multiple environments.

I’m especially interested in how QA and engineering teams split responsibility. Do you treat flakiness as a test problem, a product problem, or infrastructure noise? At what point do you decide a test is no longer worth keeping?

Asking partly out of personal frustration and partly because I’ve been working on tooling around browser automation and want to sanity check the problems I’m seeing against the pains others are feeling day to day.

Would love to hear real stories from people running E2E at scale, what actually worked, and what you wish you had done earlier.

Thanks in advance.

7 comments

order

alexandriaeden|5 days ago

I've been working on an approach where the test framework caches selector alternatives at the time of a successful match... the element's aria-label, role, ID, class, and text content. When the primary selector fails, it falls back to the alternatives ranked by confidence score, reporting success only if the healed selector has a confidence score above 60%. Basically, self-healing selectors with a confidence gate to avoid silent false passes. Combined with running each test N times and computing a flakiness percentage, you can automatically quarantine tests above a threshold rather than manually triaging them. The hardest part is distinguishing "the test is flaky" from "the product has a race condition" ... same symptom, totally different fix.

benoau|19 days ago

I think testing via browser automation is fantastic, API-driven web browsers are just amazing tools, but it's also the source of a lot of the flakiness because of the inherent difficulties determining when a page is completely in a state that it can do what you're expecting of it. Playwright improves a lot on this over Puppeteer but it's imperfect, I often end up with a wait for a selector, load state, or function evaluating when it's actually okay to proceed and sometimes I'll use a combination to really make sure, because what it's really waiting for is not just achieving some state but what actually happens after it does.

forcepushed|17 days ago

Yes, exactly how most of experience here goes. Each test ends up becoming sort of a custom implementation to handle specific use cases around interactivity and availability.

Do you feel yourself wanting to extract this logic (wait for a selector, load state, or function evaluating etc) to some shared utility and then just pushing all of your interactions through this as a sort of feedback engine for future problems?

alexgandy|15 days ago

As far as the "what to do with flaky tests", I error on the side of just outright killing the tests. Unless it's an absolutely crucial business case, I'd rather have no test than a test that slowly degrades trust.

For what it's worth, I've been working on a side-project to try to help with almost this exact situation and would be really interested if it could help you; https://gaffer.sh

apothegm|19 days ago

You find out why they’re flaky, and fix them. If they’re flaky in testing, there’s probably something flaky in real world use. If they’re flaky because of something specific to the test environment then they’re not testing what you think they’re testing, so fix them or get rid of them.

forcepushed|17 days ago

Definitely. Making the decision between fixing them or deleting them is important, not only since it might save or waste time, but also since it could help prevent future occurrences down the road.

How much time do you think you spend making this decision vs just fixing them tests and how often do you see yourself adding tests back in you quarantined or deleted?