(no title)
FroshKiller | 3 months ago
The app should write records to a database? Fine. Here's where you configure the connection. The app in production is slow because the database server is weak? Not my problem, talk to your DBA.
The app should expose an HTTP endpoint for liveness probes? Fine. It's served from the path you specified. Your reused it for an external outage check, and that's reporting the service is down because the route timed out due to your ops team screwing up the reverse proxy? Literally not my problem, I could not care less.
jiggawatts|3 months ago
Okay, so, what is the DBA to do? Double the server capacity to "see if that helps"?
It didn't, and now the opex of the single most expensive cloud server is 2x what it was and is starting to dwarf everything else... combined.
Maybe it's "just" a bad query. Which one? Under what circumstances? Is it supposed to be doing that much work because that's what the app needs, or is it an error that it's sucking down a gigabyte of data every few minutes?
How is the DBA to know what the usecases are?
The best tools that solve these runtime performance are modern APM tools like Azure App Insights, Open Telemetry, or the like.
Some of these products can be injected into precompiled apps using "codeless attach" methods, and this works... okay at best.
So SysOps takes your code, layers on an APM, sees a long list of potential issues... and the developers "don't care" because they think that this is a SysOps thing.
But if the developer takes an interest and is an involved party, then they can integrate the APM software development kit, "enrich" the logged data, log user names, internal business metadata, etc... They log on to the APM web portal and investigate how their app is running in production, with real-world users instead of synthetic tests, with real data, with "noisy neighbours", and all that.
Now if Bob's queries are slowing down the entire platform, it's a trivial matter to track this down and fix Bob's custom report SQL query that is sucking down SELECT * FROM "MassiveReportView" and killing the entire server.
Troubleshooting, performance, security, etc... are all end-to-end things. Nobody can work in isolation and expect a good end result.
FroshKiller|3 months ago
If you put that responsibility on the developer--meaning you expect the dev to diagnose an issue that they introduced in the first place--what kind of result do you think you're going to get?
Layering these demands takes away from the overall quality of the application in my experience. You want an app developer to learn all about Prometheus so the app can have an endpoint with all these custom metrics, okay, and you want structured logging and expect the dev to learn how to use Kibana effectively? All that's a huge cognitive burden that eats a slice of the same pie (their brains) as domain knowledge, language & runtime knowledge, etc.
Get maybe one app developer to specialize, get maybe one app developer to cross-train with ops or monitoring even. But leave most of us out of it.
When you flip that expectation of developer involvement in operations, it exposes how unreasonable that arrangement is. Hey, DBA, the app is sucking up resources. Why don't you crack open an IDE and write a patch for it? What do you mean you don't know Go, what do you mean you don't use Git? Every DBA should know how to attach a debugger to a remote process, shouldn't they?
It's just exploitative. Or at least that's been my experience, so there's my bias.