Ask HN: How do you keep system context from rotting over time?
35 points| kennethops | 1 month ago
I know there are a lot of tools focused on root cause analysis after things break. Cool, but that’s not what’s wearing me down. What actually hurts is the constant context switching while trying to understand how a system fits together, what depends on what, and what changed recently.
As systems grow, this feels like it gets exponentially harder. Add logs and now you’ve created a million new events to reason about. Add another database and suddenly you’re dealing with subnet constraints or a DB choice that’s expensive as hell, and no one noticed until later. Everyone knows their slice, but the full picture lives nowhere, so bit rot just keeps creeping in.
This feels even worse now that AI agents are pushing large amounts of code and config changes quickly. Things move faster, but shared understanding falls behind even faster.
I’m honestly stuck on how people handle this well in practice. For folks dealing with real production systems, what’s actually helped? Diagrams, docs, tribal knowledge, tooling, something else? Where does it break down?
dlcarrier|1 month ago
A laptop computer is extremely complex, but is actively developed and maintained by a small number of people, built on parts themselves developed by a small number of people, many of which are themselves built on parts themselves developed by a small number of people, and so on and so forth.
This works well in electronics design, because everything is documented and tested to comply with the documentation. You'd think this would slow things down, but developing a new generation of a laptop takes fewer man hours and less calendar time than developing a new generation of any software of a similar complexity running on it, despite the laptop skirting with the limitations of physics. Technical debt adds up really fast.
The top-level designers only have access to what the component manufacturers have published, and not to their internal designs, but that doesn't matter because the publications include correct and relevant data. When the component manufacturer comes out with something new, they use documentation from their supplier, to design the new product.
As long as each components of documentation is complete and accurate, it will meet all of the needs of anyone using that component. Diving deeper would only be necessary if something is incomplete or inaccurate.
xyzzy_plugh|1 month ago
2. Systems should be explicit, not implicit. Configuration should be explicit wherever possible. Implicit behavior should be documented.
3. Living documentation adjacent to your systems. Write markdown files next to your code. If you keep systems documentation somewhere else (like some wysiwyg knowledge system bullshit) then you must build a markdown-to-whatever sync job (where the results are immutable) else the documentation is immediately out of date, and out of date documentation is just harmful noise.
4. If it's dead, delete it. You have version control for a reason. Don't keep cruft around. If there's a subnet that isn't being used, delete it.
Lastly, if you find yourself in this situation and have none of the above, ask yourself if you really have the agency to fix it -- and I mean really fix it, no half measures -- then do so. If you don't, then your options are to stop caring or find a new job. The alternative is a recipe for burnout.
v_CodeSentinal|1 month ago
The biggest issue isn't just that documentation gets outdated; it's that the 'mental model' of the system only exists accurately in a few engineers' heads at any given moment. When they leave or rotate, that model degrades.
We found the only way to really fight this is to make the system self-documenting in a semantic way—not just auto-generated docs, but maintaining a live graph of dependencies and logic that can be queried. If the 'map' of the territory isn't generated from the territory automatically, it will always drift. Manual updates are a losing battle.
sonofhans|1 month ago
You’re describing the infrastructure of a large system — it’s a custom-built machine designed to serve a custom purpose. There are no examples in the world of things like that working without a lot of human intervention.
This is compounded, as you say, by increasing demands placed on the system: “Now it must react to AIs committing code,” or “Our customer base is growing but your Ops budget is decreasing.” This means the system needs more humans, not fewer.
reactordev|1 month ago
Adding more humans seems like an immediate fix but systems of systems exist without humans.
Observability, automation, infrastructure as code, audits, all these things compliment the “wtf happened?” scenario and all of these are systems. Not humans.
The SRE needs signal from noise.
gtirloni|1 month ago
It feels a bit dishonest to be asking for advice on how to tackle the complexity problem for SREs when you're are actually providing a paid solution for the very same problem.
shaneoh|1 month ago
nitwit005|1 month ago
You then eventually have that same pattern happen with services, where people give up on mapping the full thing out as well.
What I've done for my current team is to list the "downstream" services, what we use them for, who to contact, etc. It only goes one level deep, but it's something that someone can read quickly during an incident.
kennethops|1 month ago
htrp|1 month ago
I'm not sure I've seen any good vendors but I remember seeing a reverse devops tool posted a few days ago that would reverse engineer your VMs into Ansible code. If that got extended to your entire environment, that would almost be an auto documenting process.
dexdal|1 month ago
kennethops|1 month ago
I will check that tool out.
sinzin91|1 month ago
Beyond a certain scale, you can't keep a mental model of the entire system in your head. What matters then is accessing accurate, up to date information the moment you need it (troubleshooting an unfamiliar subsystem, making a cross-cutting change). Table stakes are IaC, APM, structured logging.
Code-generated docs sound great in theory, but a huge category of knowledge never lives in code (RFCs, deployment processes, how to get prod access). Humans have to write and maintain those. That requires a culture where people believe their effort matters (and ideally gets rewarded). Without that, docs rot regardless of tooling.
Then there's discovery. The docs often exist; they're just buried. RAG tools actually help here imo. When people can find what you wrote, you're more motivated to keep it accurate. As we increasingly rely on agents to tell us about our system, we're going to have to ensure the docs they're reading are not woefully out of date or inaccurate.
liveoneggs|1 month ago
All of those endpoints should be documented in an environment variable or similar as well.
The breakdown is when you don't instrument the same tooling everywhere.
Documentation is generally out of date by the time you finish writing it so I don't really bother with much detail there.
kennethops|1 month ago
amadeuswoo|1 month ago
kennethops|1 month ago
linux4dummies|1 month ago
kennethops|1 month ago
canhdien_15|1 month ago
BOOSTERHIDROGEN|1 month ago
kennethops|1 month ago
unknown|1 month ago
[deleted]