top | item 46759063

The future of software engineering is SRE

265 points| Swizec | 1 month ago |swizec.com

138 comments

order

v_CodeSentinal|1 month ago

Hard agree. As LLMs drive the cost of writing code toward zero, the volume of code we produce is going to explode. But the cost of complexity doesn't go down—it actually might go up because we're generating code faster than we can mentally model it.

SRE becomes the most critical layer because it's the only discipline focused on 'does this actually run reliably?' rather than 'did we ship the feature?'. We're moving from a world of 'crafting logic' to 'managing logic flows'.

ottah|1 month ago

I dunno, I don't think in practice SRE or DevOPs are even really different from the people we used to call sys admins (former sysadmin myself). I think the future of mediocre companies is SRE chasing after LLM fires, but I think a competitive business would have a much better strategy for building systems. Humans are still by far the most efficient and generalized reasoners, and putting the energy intensive, brittle ai model in charge of most implementation is setting yourself up to fail.

mupuff1234|1 month ago

> But the cost of complexity doesn't go down

But how much of current day software complexity is inherent in the problem space vs just bad design and too many (human) chefs in the kitchen? I'm guessing most of it is the latter category.

We might get more software but with less complexity overall, assuming LLMs become good enough.

wavemode|1 month ago

By "SRE", are people actually talking about "QA"?

SREs usually don't know the first thing about whether particular logic within the product is working according to a particular set of business requirements. That's just not their role.

belter|1 month ago

>> As LLMs drive the cost of writing code toward zero

And they drive the cost of validating the correctness of such code towards infinity...

storystarling|1 month ago

I see it less as SRE and more about defensive backend architecture. When you are dealing with non-deterministic outputs, you can't just monitor for uptime, you have to architect for containment. I've been relying heavily on LangGraph and Celery to manage state, basically treating the LLM as a fuzzy component that needs a rigid wrapper. It feels like we are building state machines where the transitions are probabilistic, so the infrastructure (Redis, queues) has to be much more robust than the code generating the content.

franktankbank|1 month ago

This sounds like the most min maxed drivel. What if I took every concept and dialed it to either zero or 11 and then picked a random conclusion!!!??

solatic|1 month ago

I think there's two kinds of software-producing-organizations:

There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.

Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.

I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

ffsm8|1 month ago

> SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.

Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role

https://sre.google/sre-book/table-of-contents/

Very little connection to the blog post we're commenting on though - at least as far as I can tell.

At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE

tryauuum|1 month ago

those alertmanager descriptions feel scary. I'm stuck in the zabbix era.

what do you mean "progressive rollout of new alerts across the organization"? what kind of alerts?

weitendorf|1 month ago

Having worked on Cloud Run/Cloud Functions, I think almost every company that isn't itself a cloud provider could be in category 1, with moderately more featureful implementations that actually competed with K8s.

Kubernetes is a huge problem, it's IMO a shitty prototype that industry ran away with (because Google tried to throw a wrench at Docker/AWS when Containers and Cloud were the hot new things, pretending Kubernetes is basically the same as Borg), then the community calcified around the prototype state and bought all this SAAS/structured their production environments around it, and now all these SAAS providers and Platform Engineers/Devops people who make a living off of milking money out of Kubernetes users are guarding their gold mines.

Part of the K8s marketing push was rebranding Infrastructure Engineering = building atop Kubernetes (vs operating at the layers at and beneath it), and K8s leaks abstractions/exposes an enormous configuration surface area, so you just get K8s But More Configuration/Leaks. Also, You Need A Platform, so do Platform Engineering too, for your totally unique use case of connecting git to CI to slackbot/email/2FA to our release scripts.

At my new company we're working on fixing this but it'll probably be 1-2 more years until we can open source it (mostly because it's not generalized enough yet and I don't want to make the same mistake as Kubernetes. But we will open source it). The problem is mostly multitenancy, better primitives, modeling the whole user story in the platform itself, and getting rid of false dichotomies/bad abstractions regarding scaling and state (including the entire control plane). Also, more official tooling and you have to put on a dunce cap if YAML gets within 2 network hopes of any zone.

In your example, I think

1. you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better

2. YAML is indeed garbage but availability reporting and alerting need better official support, it doesn't make sense for every ecommerce shop and bank to building this stuff

3. a huge amount of alerts and configs could actually be expressed in business logic if cloud platforms exposed synchronous/real-time billing with the scaling speed of Cloud Run.

If you think about it, so so so many problems devops teams deal with are literally just

1. We need to be able to handle scaling events

2. We need to control costs

3. Sometimes these conflict and we struggle to translate between the two.

4. Nobody lets me set hard billing limits/enforcement at the platform level.

(I implemented enforcement for something close to this for Run/Appengine/Functions, it truly is a very difficult problem, but I do think it's possible. Real time usage->billing->balance debits was one of the first things we implemented on our platform).

5. For some reason scaling and provisioning are different things (partly because the cloud provider is slow, partly because Kubernetes is single-tenant)

6. Our ops team's job is to translate between business logic and resource logic, and half our alerts are basically asking a human to manually make some cost/scaling analysis or tradeoff, because we can't automate that, because the underlying resource model/platform makes it impossible.

You gotta go under the hood to fix this stuff.

augusteo|1 month ago

stackskipton makes a good point about authority. SRE works at Google because SREs can block launches and demand fixes. Without that organizational power, you're just an on-call engineer who also writes tooling.

The article's premise (AI makes code cheap, so operations becomes the differentiator) has some truth to it. But I'd frame it differently: the bottleneck was never really "writing code." It was understanding what to build and keeping it running. AI helps with one of those. Maybe.

nasretdinov|1 month ago

> because SREs can block launches and demand fixes

I didn't find that particularly true during my tenure, but obviously Google is huge, so there probably exist teams that actually can afford to behave this way...

pcj-github|1 month ago

If the agent swarm is collectively smarter and better than the SRE, they'll be replaced just like other types of workers. There is no domain that has special protection.

ottah|1 month ago

The models are not smarter than us by far. Have you not run into issues with reasoning and comprehension with them? They get confused, they miss big details, build complicated code thats ineffective. They don't work well at tasks that require a larger holistic understanding of the problem. The models are weak, brittle reasoners, because they have an indirect and contradictory understanding of the wold. We're several breakthroughs away and several hardware generations from having models that are robust reasoners for grounded, non-kind problems.

measurablefunc|1 month ago

What about C-suite executives & shareholders? Are they safe from automation?

bronlund|1 month ago

My thoughts exactly. This is just some guy grasping at straws before he understands that he will have to bow to our new overlords sooner or later.

Edit: Or maybe he is fully aware and just need to push some books before it's too late.

whoamii|1 month ago

There absolutely is. Sports.

joshuaisaact|1 month ago

Couldn't disagree with this article more. I think the future of software engineering is more T-shaped.

Look at the 'Product Engineer' roles we are seeing spreading in forward-thinking startups and scaleups.

That's the future of SWE I think. SWEs take on more PM and design responsibilities as part of the existing role.

reeredfdfdf|1 month ago

I agree. In many cases it's probably easier for a developer to become more of a product person, than for a product person to become a dev. Even with LLM's you still need to have some technical skills & be able to read code to handle technical tasks effectively.

Of course things might look different when the product is something that requires really deep domain knowledge.

jzig|1 month ago

I don't think the two are mutually exclusive! e.g. a T-shaped product engineer on one side and a T-shaped SRE on the other. Both will kind of compact what used to be multiple roles/responsibilities together. The good news (and my prediction) IMO is the engineering won't be going away as much as the other roles.

pjmlp|1 month ago

Or architects, someone has to draw the nice diagrams and spec files for the robots.

However, like in automated factories, only a small percentage is required to stay around.

silisili|1 month ago

I was an old school SRE before the days of containerization and such. Today, we have one who is a YAML wizard and I won't even pretend to begin to understand the entire architecture between all the moving pieces(kube, flux, helm, etc).

That said, Claude has absolutely no problem not only answering questions, but finding bugs and adding new features to it.

In short, I feel they're just as screwed as us devs.

adelmotsjr|1 month ago

For those who were oblivious to what SRE means, just like me: SRE os _site reliability engineering_

F7F7F7|1 month ago

I knew what an SRE was and found the article somewhat interesting with a slightly novel (throwaway), more realistic take, on the "why need Salesforce when you can vibe your own Salesforce convo."

But not defining what an SRE is feels like a glaring, almost suffocating, omission.

ares623|1 month ago

Seemingly Random Engineering

Sparkyte|1 month ago

As an SRE I can tell you AI can't do everything. I have done a little software development, even AI can't do everything. What we are likely to see is operational engineering become the consolidated role between the two. Knows enough about software development and knows enough about site reliability... blamo operational engineer.

mellosouls|1 month ago

"As an SRE I can tell you AI can't do everything."

That's what they used to say about software engineering and yet this is becoming less and less obvious as capabilities increase.

There are no hiding places for any of us.

squidbeak|1 month ago

Paraphrase: "As an SRE I can tell you that the undetermined and unknowable potential of AI definitely won't involve my job being replaced."

stared|1 month ago

Yet, AI is not there yet. Even the top models struggle at simplest SRE tasks.

We just created a benchmark on adding distributed logs (OpenTelemetry instrumentation) to small services, around 300 lines of code.

Claude Opus 4.5 succeed at 29%, GPT 5.2 at 26%, Gemini 3 Pro at 16%.

https://quesma.com/blog/introducing-otel-bench/

chubot|1 month ago

Yeah, I think that when writing code becomes cheap, then all the COMPLEMENTS become more valuable:

    - testing
    - reviewing, and reading/understanding/explaining
    - operations / SRE

mon_|1 month ago

But what if those complementary skills also become cheap?

northfield27|1 month ago

Agreed. I believe this is going to be the trend.

I don’t think LLM context will able to digest large codebases and their algorithms are not going to reason like SREs in the next coming years. And given the current hype and market, investors are gonna pull out with recessions all over the world and we will see another AI Winters.

Code has become a commodity. Corporate engineering hierarchy will be much flat in coming years both horizontally and vertically - one staff will command two senior engineers with two juniors each, orchestrating N agents each.

I think that’s it - this is the end of bootcamp devs. This will act as a great filter and probably decrease the mass influx of bootcamp devs.

deadbabe|1 month ago

Bootcamp devs were always going to be doomed in the job market. They were a symptom of not having enough true classically trained computer science degree holding engineers to hire, so you compromised by looking for anyone that knew how to code well enough. But this problem eventually corrects.

Now, there are way too many computer science grads in a time when code is easy and cheap. Not much to gain from hiring a bootcamp dev over the real deal.

But I would say if you truly enjoy coding and you didn’t get to study CS in a university, a bootcamp is probably a fun experience to go through just for your own enjoyment, not for job seeking purposes. Just don’t pay too much.

stackskipton|1 month ago

As someone who works in Ops role (SRE/DevOps/Sysadmin), SREs are something that only works at Google mainly because for Devs to do SRE, they need ability to reject or demand code fixes which means you need someone being a prompt engineer who needs to understand the code and now they back to being developer.

As for more dedicated to Ops side, it's garbage in, garbage out. I've already had too many outages caused by AI Slop being fed into production, calling all Developers = SRE won't change the fact that AI can't program now without massive experienced people controlling it.

bionsystem|1 month ago

Most devs can't do SRE, in fact the best devs I've met know they can't do SRE (and vice versa). If I may get a bit philosophical, SRE must be conservative by nature and I feel that devs are often innovative by nature. Another argument is that they simply focus on different problems. One sets up an IDE and clicks play, has some ephemeral devcontainer environment that "just works", and the hard part is to craft the software. The other has the software ready and sometimes very few instructions on how to run it, + your typical production issues, security, scaling, etc. The brain of each gets wired differently over time to solve those very different issues effectively.

austin-cheney|1 month ago

I manage a team of developers in a low code environment without AI. The junior developer positions require 8 years of experience, which I think is absurd. Everybody has to program on their own, though pair programming for knowledge transfer is super frequent, but the primary skills of concern are operational excellence (including some project management tasks), transmission, and reliability.

From a people perspective that means excellence when working with outside teams and gathering requirements on your own. It also means always knowing the status of your work in all environments, even in production after deployment. If your soft skills are strong and you can independently program work streams that touch multiple external parties you are golden. It seems this is the future.

mxuribe|1 month ago

I'm sorry, nothing personal...but any place that requires 8 years of experience but only gives a title of "junior" is pretty dang close to a sweat shop.

On a different note, i do see what you mention about some op excellence skills (e.g. project management, requirements gathering, etc.) being areas of concern at my $dayjob. But, i kinda always saw them as skills that are valuable in any era, and need not only be in this AI era....but everyone's mileage and environment certainly can vary that expectation. Also, at my $dayjob, the business lacks so much funding to pay software vendors fairly, properly that we get what we pay for....so its often low quality output. Its not low *code* because we employee and contract regular, full code devs....but it certainly often is poor quality...and i wonder as low code offerings and opportunities - paired with more solid AI development asistance - continue to emerge, i suppose something like a SRE role can become that much more important - regardless if one works in low code or low cost arena.

mexicocitinluez|1 month ago

> All he wanted was to make his job easier and now he's shackled to this stupid system.

What people failed to grasp about low-code/no-code tools (and what I believe the author ultimately says) is that it was never about technical ability. It was about time.

The people who were "supposed" to be the targets of these tools didn't have the time to begin with, let alone the technical experience to round out the rough edges. It's a chore maintaining these types of things.

These tools don't change that equation. I truly believe that we'll see a new golden age of targeted, bepsoke software that can now be developed cheaper instead of small/medium businesses utilizing off-the-shelf, one-size-fits-all solutions.

ivan_gammel|1 month ago

Operational excellency was always part of the job, regardless of what fancy term described it, be it DevOps, SRE or something else. The future of software engineering is software engineering, with emphasis on engineering.

zahlman|1 month ago

> And you definitely don't care how a payments network point of sale terminal and your bank talk to each other... Good software is invisible.

> ...

> Are you keeping up with security updates? Will you leak all my data? Do I trust you? Can I rely on you?

IMO, if the answers to those questions matter to you, then you damn well should care how it works. Because even if you aren't sufficiently technically minded to audit the system, having someone be able to describe it to you coherently is an important starting point in building that trust and having reason to believe that security and privacy will work as advertised.

stosssik|1 month ago

Totally agree. Vibe coding will generate lots of internal AI apps, but turning them into reliable, secure, governed services still requires real engineering, which is exactly why we’re building https://manifest.build. It lets non-technical teams build Agentic apps fast through an AI powered workflow builder while giving engineering and IT a single platform to add governance, security, data access, and keep everything production-ready at scale.

coffeefirst|1 month ago

In other words, the apps will be trash, and an operations team that doesn't have the time, capability, or mandate to fix them will be constantly scrambling to keep the fires out?

Sounds... reliable.

deadbabe|1 month ago

CRE - Code Reliability Engineering

AI will not get much better than what we have today, and what we have today is not enough to totally transform software engineering. It is a little easier to be a software engineer now, but that’s it. You can still fuck everything up.

falcor84|1 month ago

> AI will not get much better than what we have today

Wow, where did this come from?

From what just comes to my mind based on recent research, I'd expect at least the following this or next year:

* Continuous learning via an architectural change like Titans or TTT-E2E.

* Advancement in World Models (many labs focusing on them now)

* Longer-running agentic systems, with Gas Town being a recent proof of concept.

* Advances in computer and browser usage - tons of money being poured into this, and RL with self-play is straightforward

* AI integration into robotics, especially when coupled with world models

alexgotoi|1 month ago

There were several cheaper than programmers options to automate things, Robot Processing Automation being probably the most known, but it never get the expected traction.

Why (imo)? Senior leaders still like to say: I run a 500 headcount finance EMEA organization for Siemens, I am the Chief People Officer of Meta anf I lead an org of 1000 smart HR pros. Most of their status is still tight to the org headcount.

petetnt|1 month ago

Again there's a cognitive dissonance in play here where the future of coding is somehow LLMs and but at the same time the LLMS would not evolve not to handle the operations as well even if we disregard pipedreams about AGIs being just around the corner. Especially when markdown files for AI are essentially glorified runbooks.

sylvainkalache|1 month ago

If the future of software engineering is SRE, because GenAI is taking care of coding, a similar trend is coming for SRE-type work.

It's called AI SRE, and for now, it's mostly targeted at helping on-call engineers investigate and solve incidents. But of course, these agents can also be used proactively to improve reliability.

mg794613|1 month ago

Euh, our job is hard enough as it is, don't start leaning on us to clean up the AI mess too.

joe_91|1 month ago

True, but also need to know the basics well of what constitutes good code and how it should scale vs just working code. Too many people relying on LLMs to produce stuff which just about works but give users a terrible experience as it bearly works.

johndoh42|1 month ago

“People don’t buy software, they hire a service” is a bullshit straw man.

That OS on your laptop? Software. The terminal your SSH runs in? Software. The browser you’re reading this take in? Software. The editor you wrote your last 10k LOC in? Software.

The only “service” I buy is email — and even that I run myself. It’s still just software, plus ops.

Yes, running things is hard. Nobody serious disputes that. But pretending this is some new revelation is ahistorical. We used to call this systems engineering, operations, reliability, or just doing your job before SRE needed a brand deck.

And let’s be clear about the direction of value:

Software without SRE still has value. SRE without software has none.

A binary I can run, copy, fork, and understand beats a perfectly monitored nothing. A CLI tool with zero uptime guarantees still solves problems. A library still ships value. A game still runs. A compiler still compiles.

Ops exists to serve software, not replace it. Reliability amplifies value — it does not create it.

If “writing code is easy,” why is the world drowning in unreliable, unmaintainable, over-engineered trash with immaculate dashboards and flawless incident postmortems?

People buy software. They appreciate service when the software becomes infrastructure. Confusing the two is how you end up worshipping uptime graphs while shipping nothing worth running.

arbirk|1 month ago

I have a lot of work: Make the agents work at warp speed. Prepare specs for next iteration Hopefully exhaust resources.. for free time. <rest as much as possible>

Every 5 hours 24/7. Rinse repeat

nbevans|1 month ago

Surely SRE is just a .md file like everything else? :upside-down-face:

outside2344|1 month ago

And the other part of the future is that we are all going to become "editors" (in the publishing sense) instead of "writers"

Artoooooor|1 month ago

The only thing lacking in this article was explanation of the abbreviation from the title. SRE = Site Reliability Engineer(ing).

didip|1 month ago

Real SRE? or low skilled sysadmin drowned in pagers calling themselves as SRE? Because the future is bleak if it’s the latter.

eschneider|1 month ago

Who wants to be on-call for someone else's buggy vibe-coded app? Sign me right up for that...

willtemperley|1 month ago

This may be true about SaaS. Not all software is SaaS, thankfully.

pjmlp|1 month ago

Except the small detail that as proven by all the people that lost their jobs to factory robots, the number of required SRE is relatively small in porpotion to existing demographics of SWEs.

Also this doesn't cover most of the jobs, which are actually in consulting, and not product development.

siliconc0w|1 month ago

IMO SRE works mostly because they exist outside the product engineering organization. They want to help you succeed but if you want to YOLO your launch and move fast and break things they have the option to hand back the pager and find other work. That option is rarely used but the option alone seems to create better than usual incentives.

With Vibecoding I imagine the LLM will get a MCP that allows them to schedule the jobs on Kubernetes or whatever IaaS and a fleet of agents will do the basic troubleshooting or whackamole type activities, leaving only the hard problems for human SRE. Before and after AI, the corporate incentives will always be to ship slop unless there is a counterbalancing force keeping the shipping team accountable to higher standards.

trkabv|1 month ago

We have another person without any respect for the actual stack that powers his fantasies writing LLM propaganda.

Who probably has never written anything of value in his life and therefore approves the theft of other people's valuable work.

ks2048|1 month ago

This says nothing about how if AI can write software, AI cannot do these other things.

almosthere|1 month ago

Until you find out there are 40 - 80 startups writing agents in the SRE space :/

Nextgrid|1 month ago

It only matters if any of those can promise reliability and either put their own money where their mouth is or convince (and actually get them to pay up) a bigger player to insure them.

Ultimately hardware, software, QA, etc is all about delivering a system that produces certain outputs for certain inputs, with certain penalties if it doesn’t. If you can, great, if you can’t, good luck. Whether you achieve the “can” with human development or LLM is of little concern as long as you can pay out the penalties of “can’t”.

cl0ckt0wer|1 month ago

Reliable ai agents would make you a trillionaire.

ozim|1 month ago

Basically that’s what people are doing with YOLO mode letting Claude do everything in the system.

ikiris|1 month ago

And I wish them luck, because the thought of current ai bots doing SRE work effectively is laughable.

metasim|1 month ago

What’s an “SRE”?

netdevphoenix|1 month ago

Site Reliability Engineering. It is the role that, among other things, ensures that a service uptime is optimal. It's the closest thing we have nowadays to the system admin role

giancarlostoro|1 month ago

What? Maybe OPs future. SWE is just going to replace QA and maybe architects if the industry adopts AI more, but there's a lot of hold outs. There's plenty of projects out there that are 'boring' and will not bother.

hahahahhaah|1 month ago

Operational excellence will always be needed but part of that is writing good code. If the slop machine has made bad decisions it could be more efficient to rewrite using human expertise and deploy that.

dionian|1 month ago

But there is bad code and good code and SREs cant tell you which is which, nor fix it.

bionsystem|1 month ago

My take (I'm an SRE) is that SRE should work pre-emptively to provide reproducible prod-like environments so that QA can test DEV code closer to real-life conditions. Most prod platforms I've seen are nowhere near that level of automation, which makes it really hard to detect or even reproduce production issues.

And no, as an SRE I won't read DEV code, but I can help my team test it.

VirusNewbie|1 month ago

Why not? I'm a SWE SRE and I'm arguably better at telling good code from bad code than many of the pure devs I've worked with.

tasuki|1 month ago

> Writing code was always the easy part of this job. The hard part was keeping your code running for the long time.

Spoken like a true SRE. I'm mostly writing code, rather than working on keeping it in production, but I've had websites up since 2006 (hope that counts as long time in this corner of the internet) with very little down time and frankly not much effort.

My experience with SREs was largely that they're glorified SSH: they tell me I'm the programmer and I should know what to type into their shell to debug the problem (despite them SREing those services for years, while I joined two months ago and haven't even seen the particular service). But no I can't have shell access, and yes I should be the one spelling out what needs to be typed in.