top | item 34006463

Things I want from Devs as SRE/DevOps

198 points| oschvr | 3 years ago |oschvr.com | reply

143 comments

order
[+] chrsig|3 years ago|reply
> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me. Which is much more pleasant than an adversarial environment.

Also: I'm not the only one that knows how it works, it's been peer reviewed in no small part to reduce my bus factor. All documentation requested is perfectly reasonable, and should be part of the organizations standard operating procedure.

If it's not part of the SOP, then no, you wont have those things. You need to work at a cultural level to change that, and for that you're much better off making allies than anything else. Make it clear how those things help you, and what you'll do to make the developers life easier when you don't need to worry about the basics. If altruism fails you, you can usually count on people to act in their own best interests.

[+] klodolph|3 years ago|reply
> We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me.

SRE here.

My takeaway from this is: If you want SRE support running this service, then you need to provide SREs with knowledge of how the system works. As long as only the devs have this knowledge, it's a bit unfair to put the SREs on the hook for supporting it.

Maybe I'm reading between the lines too much--the wording in the article is sloppy at best, and at worst, it doesn't actually say what I'm saying.

It's nice that your code has been through peer review and other people on your team know how it works too. That's less helpful for the SREs running it. SREs bear the burden of the pager--sometimes getting woken up at odd hours of the night to fix problems that were, in a sense, created by developers.

The SOP for getting SRE support for new services should include things like runbooks and design reviews. SREs should be in the loop when you figure out what metrics to expose from your service, because SREs will be the ones using those metrics to figure out the alerting systems. Very few companies have decent "SOP" for SRE support--there are a few companies which are really good at it, like Google, and then a long tail of companies which dump services on SREs without including SREs in the process.

IMO--the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service, barring exceptional circumstances. There's a deeper discussion to be had about why this should be the case--basically, devs and SREs have different incentives, and neither team should be put in a subordinate position to the other, because both teams have goals that support the business.

[+] hayst4ck|3 years ago|reply
Ownership is important to be explicit about, less as a means of assigning blame, but more as a means of coordination and resource allocation.

The author is using ownership as a tool to avoid responsibility, and is thus creating an `us vs them` mindset rather than an `us vs the problem` mindset.

Having a strong definition of ownership (like committing your organizational structure to your monorepo as config file) is invaluable for building tooling.

If you have a strong definition of code ownership it allows for things like people less familiar with a particular piece of code being able to make changes with the approval of the owners, while simultaneously notifying them of the change.

Likewise, if you are working on a platform that multiple teams use, you can write tooling that automatically assigns bugs or tickets.

Ownership problems and "us vs them" is a clear sign of poor leadership. Most devs that experience it become cyncial or hostile without being able to understand that it is leadership that failed them.

[+] chidog12|3 years ago|reply
"I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem."

If you are a dev on the team that owns that service then it's you and your team's responsibility to answer all of these questions... Even Org's SOP would end up reaching back to the team who owns the service if problem's arises...

[+] scarface74|3 years ago|reply
> We're both being paid to solve different facets of the same problem

Code that is not running correctly in production is worthless. If you write code and haven’t thought about all of the implications that it takes to make it run correctly you haven’t produce business value.

Yes, I have been a developer for 25+ years professionally. But for the last 10, I’ve also thought through all of the topics that the author has delineated in his article.

Yes I consider myself to be an competent “DevOps” engineer as long as it is on AWS and can go from empty AWS account to a fully functional infrastructure using IAC, a CI/CD pipeline, monitoring, alerting, centralizing logging etc.

Knowing that I will either be the person doing the “DevOps” or working with the person who is informs the design of my development.

[+] runamok|3 years ago|reply
It's a question of aligned incentives. If your team (not you necessarily) only have a priority to ship features than your team needs to be alerted for performance and reliability issues because you are thus incentivized to fix said issues. If these issues are thrown over the fence to some other team who doesn't know your codebase and have the domain knowledge to improve said code then things (usually) never get fixed.

Likewise error budgets that say "your service has not been reliable enough this period so the next sprint will be dedicated to improving that and no new features will ship" is another way to make sure quality is not an afterthought.

[+] badrabbit|3 years ago|reply
"Own" in this context means be responsible for. Peer review does not transfer ownership to peers but it does reduce risk of your code breaking. Other people should not understand your own code (and intent) more than you, so long as that is correct, while everyone in dev/ops shares responsibility of delivering service, you own and are responsible for the part of that service delivery that you authored simply because you are the person most qualified to resolve any issue that arise from that code breaking. If an SRE is keeping your code running then they own the code's uptime in as much as you have been able to communicate the parameters and configuration of the app, but at the end of the day, when there is a bug in the code you are the best person to fix it, so you own it.
[+] disruptiveink|3 years ago|reply
Isn't this (you devs own your app, I don't give a shit about it, I just want it to not page me) the classic old school "Sysadmin" approach that supposedly DevOps was supposed to counter? How can anyone say this and claim to be doing anything remotely near DevOps?
[+] Galanwe|3 years ago|reply
Agree 100%

Although it's sane that developers keep contact with the reality of maitaining the live applications they wrote, it just doesn't scale to ask them to fully support them.

There is an infinite amount of maintenance for any live system. No service of any magnitude just "works" in production indefinitely, at the very least because this service interacts with others that will fail.

If developers are responsible for every live system they publish, they will get locked on after a finite amount of service they published, and leave because of maintenance boredom.

There needs to be a reasonable amount of documentation written, explanations given, level 2 support taken, but that's it, the maintenance is for ops.

[+] treeman79|3 years ago|reply
Dealt with this at a company.

Eventually the developers just gave up on all development work and became operations. The actual operations team kept k8 cluster alive. Developers had to do everything else.

Eventually people would get the hang of the operations side but by that point they were burnt out and quit.

[+] bambax|3 years ago|reply
> If it's not part of the SOP, then no, you wont have those things

Isn't this "adversarial" as well? Why would you withhold that information just because the SOP don't make you provide it? What will happen then is that eventually the service will break, nobody will know how to fix it, and they will come and ask you.

If you're no longer there, the service will be decommissioned and all your work will have been in vain. I don't see a net benefit for any of this.

[+] hayst4ck|3 years ago|reply
This questionnaire is kind of foreign to me since I see an SRE's job, more or less, as defining interfaces and then forcing everything to adhere to them (politically or manually).

These are the questions I find useful:

  "How is capacity for the service allocated right now?"
  "How is software updated right now?"
  "How was the last outage handled in as much detail as possible?"
From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).

Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.

> that YOU wrote, only YOU know how it works, thus YOU own.

I find this attitude pretty toxic. If you are in an SRE vs Product Dev mindset, then you have bigger battles to fight than service manipulation.

[+] VMtest|3 years ago|reply
> From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).

> Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.

SREs can own the whole development process you mean?

edit: in HN you probably want to use intellJ for everything, don't even mention grep please, they don't know what that is

[+] tayo42|3 years ago|reply
I don't get why SRE is a job(and it was my title for years) The stuff listed is just good software engineering. If a swe cant figure out that they need to monitor their application (or really anything on this list) you have no business being anything other then a junior programmer.

These kinds of responsibilities create this weird scenario now where the team sre is the teams babysitter. Which just leads to the ops vs dev bullshit weve seen before. Toxic right off the bat.

[+] 0x457|3 years ago|reply
> I don't get why SRE is a job(and it was my title for years) The stuff listed is just good software engineering. If a swe cant figure out that they need to monitor their application (or really anything on this list) you have no business being anything other then a junior programmer.

Someone has to enforce those good practices. Weak engineers hire more weak engineers and they suck and their job.

[+] chrismarlow9|3 years ago|reply
In my experience outside of enterprise oracle type places it's just a label. You still work on service level code, you also work on architecture, you also do infrastructure and monitoring and really all the stuff. From the places I've worked that aren't red tape bound the title really just means "we need someone who we can give a business problem to and they will solve it in an efficient way without needing a full team or re inventing wheels, and we need to be sure it's going to keep making money for us with stability". All the other rigor is just job description fluff to attract talent.
[+] kubectl_h|3 years ago|reply
Yeah sure that sounds great in principle but at the end of the day someone has to be in charge of tracking down where that new user_uuid label in the prometheus metrics came from and find the team responsible and explain why that is a bad idea.
[+] hnarn|3 years ago|reply
> I don't get why SRE is a job

I've met CTOs that would agree with you. I no longer work with any of them.

[+] mberning|3 years ago|reply
People are complaining about the idea that the developer is ultimately the owner of any service they wrote.

I don’t see how this is even controversial. Consider the case where a SRE is responsible for 5 or 10 such systems. They could never be expected to know as much about those systems as the people that wrote them.

Now if there is a one to one relationship between SREs and systems then it might make sense to expect that level of understanding from the SRE.

In my experience it would be a great privilege to have a dedicated SRE to your application.

[+] dilyevsky|3 years ago|reply
You say that’s not controversial but irl I’ve worked with more than a handful of non-jr engineers and even managers who think developer job ends at seeing green build in ci (btw a ci which some other team is supposed to manage). Sometimes even green build locally
[+] dchftcs|3 years ago|reply
"Owner or not" is a false dichotomy. I get it that the author and many SREs are probably jaded from developers not taking ownership at all.

The right attitude is to figure out processes that let people draw a line when to go to DevOps, and when to escalate to developers. Developers need to understand the costs they impose on devops and organizations need to make sure developers are empowered to fix their own issues, rather than to be constantly chased around to business requirements.

Developers ultimately answer to business priorities, and they don't necessarily own the business processes that demand their support. If developers are given ample resources to keep bugs out of systems, document operational expectstions and respond to incidents, then the developers can "own" the processes better. If not, it's a management problem that is just of the same nature as the usual SRE complaint that developers don't want to own anything at all.

[+] Adiqq|3 years ago|reply
For me SRE/DevOps is just support for developers. It's possible that such person has more knowledge/experience in development and operations, but in general their focus is on infrastructure, automation and general troubleshooting.

They might know how to build/test/deliver/monitor some solution, they might know to some degree how to configure solution (but developers should support them with it and describe it well), how to script some operations, however they definitely won't write bugfix themselves.

[+] lamontcg|3 years ago|reply
Can someone explain to me how this is any different of a mentality from system engineers that SRE replaced?

I haven't read the SRE book, but my understanding was that at Google the answer to all this would be that the SRE would act as a software developer and submit pull requests to the codebase in order to implement/fix all of this?

> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

And my own take on this statement which is getting so much traction in the comments is that this seems largely indistinguishable from the wall between Dev and Ops that we had back in the late 90s.

[+] namdnay|3 years ago|reply
I don't know when exactly "DevOps" became a new role equivalent in every way to the old "Ops"? My original understanding of the term was that developers did their own operations
[+] devjab|3 years ago|reply
I don't think there is a difference. I also think this is part of the reason why so many companies now don't have an Ops department. My own is an example of this, we've outsourced the Operations part to another company because our Developers are doing most of the Ops that isn't related to networking anyway. I'm not saying this is good by the way. It's just that developers tend to get caught in the "company" part of "company vs IT" because developers want to make stuff work first, and work correctly second where as IT has always been the other way around.

I don't necessarily disagree with everything the author writes by the way. There are a lot of good points in there about building things to be operational, but at the same time, what good is an operations department if it can't actually operate it's systems? I know the answer to this often becomes buying third-party software or "standard" systems, but as more and more businesses are realizing, that's often worse than simply using a lot of interconnected excel sheets (which doesn't really scale).

It'll be interesting to see what happens once the newer waves of project managers and it-business partners learn from the successes of companies that build in-house software instead of going to "standard" systems where they'd need the inhouse developers anyway to make the un-Godly amount of API's and data-transfers work. At least here in Denmark, companies like Lego and Vestas are doing some really groundbreaking money making at a much lower cost, by not going to "standard" systems for everything. Not that you should never use standard systems, there are somethings that are shared across businesses after all, but there are typically also a lot of things that just won't fit into some internationally shared box well enough for it to work out as a net bonus.

[+] eyelidlessness|3 years ago|reply
Maybe I’m being overly pedantic, but… these are questions DevOps engineers should be able to answer themselves because they’ve contributed to the answers. I understand that DevOps has basically become a euphemism for ops + automation complexity that requires product-equivalent engineering talent + arcane knowledge of a zillion cloud vendors’ … everything. But can we go back to calling that ops?

I actually liked the DevOps-as-in-devs-also-ops as a forcing function to keep deployment relatively simple because it’s very low on the core competency/value proposition spectrums. It also has the benefit of rewarding companies for making that feasible at the expense of a tiny fraction of the cost of dedicated ops roles.

[+] hnarn|3 years ago|reply
I work as an SRE and while I agree with the "list of questions" as a general template for collaboration, I strongly disagree with the point that developers "own" the applications.

If you work in the same company, you all own the application. The customers don't care that you're "only" the SRE, or "only" the sales guy. This type of attitude is toxic and should be challenged categorically.

If you, the SRE, do not have the information needed (i.e. the "list of questions") then it's as much your responsibility to ask for it as it is the developers jobs to help you answer it.

If you feel that the company culture makes it impossible for you to create these necessary processes so that everyone have the information they need, you need to either work towards changing that culture or get a new job.

[+] mianos|3 years ago|reply
This list is exactly what we try to deliver to operations in our firm. All very reasonable.

You know why you "rarely get an answer for straight away "? I assume because they are working on the next ticket/delivery. A lot of this stuff is not estimated properly. A way to get it estimated properly is to work with the devs, cooperatively.

This said, for some reason, this blog post seems adversarial and gives me a bad vibe. Instead of "List of questions I’d like to get an answer from devs", it should be "we should work together to get these things done".

[+] dsr_|3 years ago|reply
This is exactly the sort of requirements list that a dev group would receive from an ops group back when ops were systems administrators and network engineers.

And I am not objecting to it in the least; these are all good and vital questions.

I am objecting to anyone claiming that DevOps is anything other than "using the kinds of tools that help software development projects to help operations", and I present this as absolute evidence.

[+] spmurrayzzz|3 years ago|reply
This sentiment seems related to an observation I've been making more frequently as of late on this topic.

Before DevOps was en vogue (i.e. was a descriptive term more so than a buzz word), the whole premise was to collapse the bulwark between engineers and sys admins. All SWE's should care about how their application is deployed, monitored, and scaled in production. This leads to far better application engineering outcomes in most efforts in which I've been involved.

The end result of those efforts was often, but not always, engineers writing some amount of operations tooling themselves.

But now we've come full circle. There is a ton of operations tooling you can pull off the shelf, and those tools are generic/complex enough to require administration. So many DevOps roles now as a result, particularly in larger orgs, are mostly administration-focused and less so about building the tooling itself.

It feels like we've reinvented the bulwark we tried to escape previously. There's an open question as to whether, from a practical perspective, we still have gained a net win there irrespective of the logical separation between eng and ops. I'm not sure where I've landed yet on that question.

[+] mediascreen|3 years ago|reply
My answer to about half of those questions would probably be: "How would YOU like it to work. You are the expert on our systems and I would like to know what you consider best practices. Give me some guidance on how we run things here and I will do my best to set it up that way. If my application is very special and need special considerations I will contact you to figure out a way that works for both of us"
[+] kubectl_h|3 years ago|reply
I moved from full stack eng to SRE/DevOps a couple of years ago but have the least enjoyable role of straddling the two. And while I think this post surfaces some good points I can tell you that deep in the heart of every SRE/DevOps engineer that didn't come from a software dev background -- all they truly want is to get paid 250K a year to administer a system that literally does nothing and thus never breaks and this desire is the subtext to and informs every interaction they have with the engineering team.
[+] davewritescode|3 years ago|reply
I personally think you it’s very difficult be an SRE who doesn’t come from a software engineering background. To me an SRE needs to come from a background that includes extensive coding and architecture experience.

To me, the worst SREs are the folks who come from the DevOps side whose experience is limited to pipelines and infrastructure as code type stuff. They invent solutions that just don’t work.

[+] slyall|3 years ago|reply
and every dev just wants to get paid $250k/year to write some code that solves the problem on their machine, hit deploy, close their laptop and go home.

Random stereotypes might be funny but they are not useful in getting stuff done.

[+] t-writescode|3 years ago|reply
I am very opinionated about what SRE and DevOps own vs what devs own; and, I didn't really have anything negative to say about my (admitted) skim of the article.

As an SWE, I want to and need to know how to provide metrics on my system to be able to understand its health, and I should have good safeguards in place, or at least have communicated with the SREs what I need to provide to them to help them have good safeguards in place, to make sure the application keeps running. If the application goes down, it's my responsibility to make sure it's not my fault (bug in application code) that caused the system to fail.

What I, an SWE, want out of an SRE, though, is infrastructure management. I want to be able to ask them for some queues, and for a redis instance with high availability. I want them to set up the Kafka cluster, the database. I want us to have a conversation about where the secrets are to be stored. I want to be able to ask them what I need to do in code to get a secret and use it. I want them to be able to give me a good template for k8s deployments - or maybe to pair with them, given the docker containers and sidecars I need for a deployment and the projected scaling I'll need and come out with a best-practices set of k8s deployments.

I would be grateful if they monitor the database for some horrible queries; and, use their knowledge of which deployments made that bad query, to file a ticket to the right team so they fix their code or add an index or whatever is necessary.

Infrastructure, be it k8s or nomad, configuring redis, making rabbitmq highly available, configuring and organizing (especially organizing) k8s deployments into something sane and logical, and so many other things related to infrastructure are as specialized of skills as writing high-performance or unusually architected, large systems. I've seen the systems that come up when SWE-on-assignment create infrastructure; and, I've seen the literal years of work SREs have in their backlog to fix it with best practices.

It's similar to front-end developers: it's an entirely different skill set; and, while each person in each tear can stumble around in the other tiers, it's way better if we are all there, working together toward a common goal, and especially focusing in the areas we have each specialized our craft.

addendum: of course there are exceptions; but I think those exceptions are 1 in 100 or 1 in 1000.

[+] deathanatos|3 years ago|reply
> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

Like this is the single biggest truth in the article, and I'm glad to see it stated so clearly. Shout it from the rooftops, please. It's a direct logical consequence, too — and yet, so many people seem to make decisions that violate this truth.

I field so many questions about "why is service X doing Y?" Have you asked the service owners?

Unfortunately, I've found one more or less has to become proficient in rapidly understanding services you don't own, because getting other people to act logically is a fool's errand.

> Are you logging to stdout ?

Nooooo to stderr, that's literally what it is there for. (As C says, "for writing diagnostic output". Logs are that.) Also, it is sometimes buffered and you don't (IMO) really want that.

Any output producing program requires stdout for the output, and you can't co-mingle logs with that and have piping still work. While it is unlikely that your production service is producing output, there's no reason to do anything different with the logs. (I'd say a part of being a good production service is "don't be needlessly special".)

(But our tooling will just capture and mux the two streams together, too, so it doesn't matter, unless buffering means the error logs don't make it right before your service is killed.)

Also, your infra team provides the metrics service, but you need to capture your own metrics. My metrics provider does not have a crystal ball, it cannot peer into your service's memory and pull out critical stats. You must push them yourself. Talk to your infra team, they can show you the API they use… (We collect common, machine level stats, like "CPU in use" or external things about your service that are easily visible, like per-container memory usage. But not your reqs/sec.)

[+] rad_gruchalski|3 years ago|reply
Cool, turn it into a set of requirements and put up as part of the definition of done.

Questions in this form always seem condescending. Like “I‘m smarter than you, I thought about it, you didn’t”.

[+] xarthna|3 years ago|reply
This is exactly what I came to express.

If this isn't standardized in an organization it should be. Otherwise, it's the same repetitive questions, the same finger pointing, and the same miscommunication. If these are the requirements needed to put a service into production, then make it explicit. As the developer, of course I own the service, but (usually) don't have the access. Standardized as requirements, both teams can work together to produce, monitor, and troubleshoot production services smoothly. Then nobody is surprised when it is release day, and asked these questions with an impatient PM whom has already publicly set expectations.

[+] mattpallissard|3 years ago|reply
This comment section was exactly what I expected. A mirror of how most folks in the trenches discuss these murky boundaries.

  * SRE/DevOps folks stating the person that wrote the application has the knowledge to debug it.

  * Devs saying that it's SRE/DevOps job to debug it

  * Lots of comments on culture and you should do X

I know most people like the whole grassroots thing, but the only shops I've seen that are actually killing it are the ones who dictate these boundaries and responsibilities from the top down. And I've seen a lot of shops.
[+] jamesrom|3 years ago|reply
This is completely backwards. As someone that has been an SRE and DevOps engineer.

Almost all of the questions can be simply answered with: "This is a NFR that was created by SRE".

The important thing is to collaborate with each team and be there when architectural and design decisions are being made in the first place!

All of these questions are post-hoc, coming after the thing has been built. You would never need to ask these questions, if you help drive initial design.

Embed yourself with your teams. Ask to be part of design discussions. Remember: 50% eng 50% ops. You have no excuse!

[+] KronisLV|3 years ago|reply
> Embed yourself with your teams. Ask to be part of design discussions.

I agree that this should happen, most successful projects have people with all sorts of knowledge contributing to it, without too many silos in place.

> You have no excuse!

However, the Ops people don't always get that power or a say in the matter. In many dysfunctional environments they'll simply be given an apparently finished service and will be told to put it in prod.

Please don't dismiss that these circumstances exist altogether and don't shift the "blame" exclusively on the people who already have their lives be needlessly hard, this isn't likely to encourage a positive outlook.

[+] RcouF1uZ4gsC|3 years ago|reply
It seems that these questions should basically be answered once per company.

All services should have common health endpoints and shutdown operations.

Logging should be standardized across all the services of a company.

Having bespoke answers to these questions for each service will rapidly devolve into chaos, when you have multiple services deployed.

[+] blacklion|3 years ago|reply
Are you really DevOps if you need to write such rants? Are you really DevOps if you company has Devs?

I've thought, that DevOps by definition is developer and operations in one. You wrote service, you support service, and there is no boundary, and there is no such problem as described in this text, by definition.

DevOps complains about problem, proposed solution for which is to be DevOps...

[+] Joel_Mckay|3 years ago|reply
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." ( Melvin E. Conway, https://en.wikipedia.org/wiki/Conway%27s_law )

This is unfortunately the death knell for DevOps organizational teams on large projects. Primarily, the design specification usually ends up being hammered into the inherent dysfunction the project was intended to solve in the first place.

Best of luck =)

[+] mkl95|3 years ago|reply
I agree with some of the points, but on the other hand most organizations do not really empower SDEs to reason about architecture. Things like production budgets and production grade monitoring and observability are usually owned fully by SRE/devops, and if some enterprise architect type is involved devs won't even own the spec. At those places, devs can at best make a wild guess of what the expectations are. Responsibility should be proportional to power.