top | item 47323017

After outages, Amazon to make senior engineers sign off on AI-assisted changes

659 points| ndr42 | 23 days ago |arstechnica.com | reply

https://www.ft.com/content/7cab4ec7-4712-4137-b602-119a44f77... (https://archive.ph/wXvF3)

https://twitter.com/lukolejnik/status/2031257644724342957 (https://xcancel.com/lukolejnik/status/2031257644724342957)

486 comments

order
[+] cobolcomesback|23 days ago|reply
This “mandatory meeting” is just the usual weekly company-wide meeting where recent operational issues are discussed. There was a big operational issue last week, so of course this week will have more attendance and discussion.

This meeting happens literally every week, and has for years. Feels like the media is making a mountain out of a mole hill here.

[+] davidclark|23 days ago|reply
The article claims:

>He asked staff to attend the meeting, which is normally optional.

Is that false? It also discusses a new policy:

>Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.

Is that inaccurate? It is good context that this is a regularly scheduled meeting. But, regularly scheduled meetings can have newsworthy things happen at them.

[+] CoolGuySteve|23 days ago|reply
It didn't seem to make the news but at least in NYC the entire Amazon storefront was broken all afternoon on Friday.

Items weren't displaying prices and it was impossible to add anything to your cart. It lasted from about 2pm to 5pm.

It's especially strange because if a computer glitch brought down a large retail competitor like Walmart I probably would have seen something even though their sales volume is lower.

[+] belval|23 days ago|reply
I am not in that specific meeting but it made me chuckle that a weekly ops meeting will somehow get media attention. It's been an Amazon thing forever. Wait until the public learns about CoEs!
[+] otterley|23 days ago|reply
> Feels like the media is making a mountain out of a mole hill here.

That's been their job ever since cable news was invented.

[+] groundzeros2015|22 days ago|reply
It’s always sobering to see a news story about something you have insider perspective on.
[+] embedding-shape|23 days ago|reply
> This meeting happens literally every week, and has for years. Feels like the media is making a mountain out of a mole hill here.

Are you completely missing the point of the submission? It's not about "Amazon has a mandatory weekly meeting" but about the contents of that specific meeting, about AI-assisted tooling leading to "trends of incidents", having a "large blast radius" and "best practices and safeguards are not yet fully established".

No one cares how often the meeting in general is held, or if it's mandatory or not.

[+] furyofantares|23 days ago|reply
This reply chain is confusing but I'm guessing got merged from another thread that had a different title?

Must have as the comments are hours older than OP.

[+] cmiles8|23 days ago|reply
The core message of the article is that Amazon has been having issues with AI slop causing operational reliability concerns, and that seems to be 100% accurate.
[+] rahbert|22 days ago|reply
This is correct. We ran them on Wednesday’s in Alexa. Jessy actually used to come and sit in ours once a quarter or so when he was running AWS.
[+] Clent|23 days ago|reply
Who is the media you're accusing here? This is a twitter post. As far as I can tell they do not work a media company.

What is worth being pointed out is how quickly people blame "The Media" for how people use, consume and spread information on social networks.

[+] niwtsol|23 days ago|reply
I believe it is by group - AWS started the weekly operations meeting, effectively every service's oncall from the last week had to attend. Then it grew massive, so they made it optional. Alexa had a similar meeting that tried to replicate what AWS did. A lot of time spent reviewing load tests getting ready for holiday season, prime day, and the superbowl (super bowl ads used to cause crazy TPS spikes for Alexa). And a lot of finger pointing if there was an outage from one team. While it probably did help raise the operational bar, so much time wasted by engineers on busywork/paperwork documenting an error or fix vs improving the actual service.
[+] happytoexplain|23 days ago|reply
>Junior and mid-level engineers can no longer push AI-assisted code without a senior signing off

Review by a senior is one of the biggest "silver bullet" illusions managers suffer from. For a person (senior or otherwise) to examine code or configuration with the granularity required to verify that it even approximates the result of their own level of experience, even only in terms of security/stability/correctness, requires an amount of time approaching the time spent if they had just done it themselves.

I.e. senior review is valuable, but it does not make bad code good.

This is one major facet of probably the single biggest problem of the last couple decades in system management: The misunderstanding by management that making something idiot proof means you can now hire idiots (not intended as an insult, just using the terminology of the phrase "idiot proof").

[+] ardeaver|23 days ago|reply
When I was really early in my career, a mentor told me that code review is not about catching bugs but spreading context (i.e. increasing bus factor.) Catching bugs is a side effect, but unless you have a lot of people review each pull request, it's basically just gambling.

The more expensive and less sexy option is to actually make testing easier (both programmatically and manually), write more tests and more levels of tests, and spend time reducing code complexity. The problem, I think, is people don't get promoted for preventing issues.

[+] marginalia_nu|23 days ago|reply
Expert reviews are just about the only thing that makes AI generated code viable, though doing them after the fact is a bit sketchy, to be efficient you kinda need to keep an eye on what the model is doing as its working.

Unchecked, AI models output code that is as buggy as it is inefficient. In smaller green field contexts, it's not so bad, but in a large code base, it's performs much worse as it will not have access to the bigger picture.

In my experience, you should be spending something like 5-15X the time the model takes to implement a feature on reviewing and making it fix its errors and inefficiencies. If you do that (with an expert's eye), the changes will usually have a high quality and will be correct and good.

If you do not do that due dilligence, the model will produce a staggering amount of low quality code, at a rate that is probably something like 100x what a human could output in a similar timespan. Unchecked, it's like having a small army of the most eager junior devs you can find going completely fucking ape in the codebase.

[+] js8|23 days ago|reply
> requires an amount of time approaching the time spent if they had just done it themselves

It's actually often harder to fix something sloppy than to write it from scratch. To fix it, you need to hold in your head both the original, the new solution, and calculate the difference, which can be very confusing. The original solution can also anchor your thinking to some approach to the problem, which you wouldn't have if you solve it from scratch.

[+] unshavedyak|23 days ago|reply
> For a person (senior or otherwise) to examine code or configuration with the granularity required to verify that it even approximates the result of their own level of experience, even only in terms of security/stability/correctness, requires an amount of time approaching the time spent if they had just done it themselves.

Hell, often it feels slower/worse. Foreign code is easily confusing at first, which slows you down - and bad code quickly gets bewildering and sends you down paths of clarifications that waste time.

[+] steveBK123|23 days ago|reply
Right, code reviews should already have been happening with human written junior code.

If AI is a productivity boost and juniors are going to generate 10x the PRs, do you need 10x the seniors (expensive) or 1/10th the juniors (cost save).

A reminder that in many situations, pure code velocity was never the limiting factor.

Re: idiot prooofing I think this is a natural evolution as companies get larger they try to limit their downside & manage for the median rather than having a growth mindset in hiring/firing/performance.

[+] AgentOrange1234|23 days ago|reply
Seniors are going to need to hold Juniors to a high bar for understanding and explaining what they are committing. Otherwise it will become totally soul destroying to have a bunch of juniors submitting piles of nonsense and claiming they are blocked on you all the time.
[+] jetrink|23 days ago|reply
It could create the right sort of incentives though. If I'm a junior and I suddenly have to take my work to a senior every time I use AI, I'm going to be much more selective about how I use it and much more careful when I do use it. AI is dangerous because it is so frictionless and this is a way to add friction.

Maybe I don't have the correct mental model for how the typical junior engineer thinks though. I never wanted to bug senior people and make demands on their time if I could help it.

[+] onion2k|23 days ago|reply
I.e. senior review is valuable, but it does not make bad code good.

I suspect that isn't the goal.

Review by more senior people shifts accountability from the Junior to a Senior, and reframes the problem from "Oh dear, the junior broke everything because they didn't know any better" to "Ah, that Senior is underperforming because they approved code that broke everything."

[+] bs7280|23 days ago|reply
This is also why I think we will enter a world without Jr's. The time it takes for a Senior to review the Jr's AI code is more expensive than if the Sr produced their own AI code from scratch. Factor in the lack of meetings from a Sr only team, and the productivity gains will appear to be massive.

Whether or not these productivity gains are realized is another question, but spreadsheet based decision makers are going to try.

[+] hintymad|23 days ago|reply
> Review by a senior is one of the biggest "silver bullet" illusions managers suffer from

Especially in a big co like Amazon, most senior engineers are box drawers, meeting goers, gatekeepers, vision setters, org lubricants, VP's trustees, glorified product managers, and etc. They don't necessarily know more context than the more junior engineers, and they most likely will review slowly while uncovering fewer issues.

[+] raw_anon_1111|23 days ago|reply
Why only AI generated code? I wouldn’t let a junior or mid level developer’s code go into production without at least verifying the known hotspots - concurrency, security, database schema, and various other non functional requirements that only bite you in production.

I’m probably not going to review a random website built by someone except for usability, requirements and security.

[+] belval|23 days ago|reply
The unwritten thing is that if you need seniors to review every single change from junior and mid-level engineers, and those engineers are mostly using Kiro to write their CRs, then what stops the senior from just writing the CRs with Kiro themselves?
[+] qnleigh|23 days ago|reply
I seriously doubt that they think senior reviewers will meticulously hunt down and fix all the AI bugs. Even if they could, they surely don't have the time. But it offers other benefits here:

1. They can assess whether the use of AI is appropriate without looking in detail. E.g. if the AI changed 1000 lines of code to fix a minor bug, or changed code that is essential for security.

2. To discourage AI use, because of the added friction.

[+] zamalek|23 days ago|reply
> Review by a senior is one of the biggest "silver bullet" illusions managers suffer from.

My manager has been urging us to truly vibe code, just yesterday saying that "language is irrelevant because we've reached the point where it works - so you don't need to see it." This article is a godsend; I'll take this flawed silver bullet any day of the week.

[+] mrothroc|23 days ago|reply
Senior review can definitely help, regardless if the code comes from a junior or an LLM. We've done this since the dawn of time. However, it doesn't scale and since LLM volume far exceeds what juniors can do, you end up overwhelming the seniors, who are normally overbooked anyway.

The other problem is that the type of errors LLMs make are different than juniors. There are huge sections of genuinely good code. So the senior gets "review fatigue" because so much looks good they just start rubber stamping.

I use an automated pipeline to generate code (including terraform, risking infrastructure nukes), and I am the senior reviewer. But I have gates that do a whole range of checks, both deterministic and stochastic, before it ever gets to me. Easy things are pushed back to the LLM for it to autofix. I only see things where my eyes can actually make a difference.

Amazon's instinct is right (add a gate), but the implementation is wrong (make it human). Automated checks first, humans for what's left.

[+] yifanl|23 days ago|reply
Senior reviews are useful, but as I understand it, Amazon has a fairly high turnover rate, so I wonder just how many seniors with deep knowledge of the codebase they could possibly have.
[+] grvdrm|23 days ago|reply
What a statement at the end. You are absolutely right.

I hear “x tool doesn’t really work well” and then I immediately ask: “does someone know how to use it well?” The answer “yes” is infrequent. Even a yes is often a maybe.

The problem is pervasive in my world (insurance). Number-producing features need to work in a UX and product sense but also produce the right numbers, and within range of expectations. Just checking the UX does what it’s supposed to do is one job, and checking the numbers an entirely separate task.

I don’t many folks that do both well.

[+] 33MHz-i486|22 days ago|reply
In case it isn’t completely obvious from this, it is indeed hellish to work there. Most of AWS has a 2 reviewer requirement. If AI is writing most of the code (and it is because most Amazon code is copypasta boilerplate) you need 3 developers to sign off to ship anything. But of course due to headcount attrition, managers have ~1.5 developers to a project. Meanwhile the L8 manager is doing nothing except stack ranking each level of engineers according to number of commits merged & customer facing features shipped, and firing 15% of the bottom at the end of each year. There is no notion of subject matter expertise or technical depth, theyre happy to replace whoever with fresh-grads (theyre all just cogs anyway right!). Between that and voluntary departures, teams having 80-100% turnover every 5 years is basically par.

Also while this is happening most developers are getting constantly hammered by operational issues and critical security tasks because 1) the legacy toolchain imports 6 different language package ecosystems and 2)no one ever pays down tech debt in legacy code until its a high severity ticket count in a KPI dashboard visible to the senior management.

[+] prakhar897|23 days ago|reply
From the amazon I know, people only care about a. not getting fired and b. promotions. For devs, the matrix looks like this:

1. Shipping: deliver tickets or be pipped.

2. Having Less comments on their PRs: for some drastically dumb reason, having a PR thoroughly reviewed is a sign of bad quality. L7 and above use this metric to Pip folks.

3. Docs: write docs, get them reviewed to show you're high level.

Without AI, an employee is worse off in all of the above compared to folks who will cheat to get ahead.

I can't see how "requesting" folks for forego their own self-preservation will work. especially when you've spent years pitting people against each other.

[+] sdevonoes|23 days ago|reply
Reviewing AI generated code at PR time is a bottleneck. It cancels most of the benefits senior leadership thinks AI offers (delivery speed).

There’s also this implicit imbalance engineers typically don’t like: it takes me 10 min to submit a complete feature thanks to Claude… but for the human reviewing my PR in a manual way it will take them 10-20 times that.

Edit: at the end real engineers know that what takes effort is a) to know what to build and why, b) to verify that what was built is correct. Currently AI doesn’t help much with any of these 2 points.

The inbetweens are needed but they are a byproduct. Senior leadership doesn’t know this, though.

[+] cmiles8|23 days ago|reply
The optics here are really bad for Amazon. The continuing mass departures of long tenured folks, second-rate AI products, and a string of bad outages paints a picture that current leadership is overseeing a once respected engineering train flying off the tracks.

News from the inside makes it sound like things are getting pretty bad.

[+] philip1209|23 days ago|reply
I think the deeper need is a "self-review" flow.

People push AI-reviewed code like they wrote it. In the past, "wrote it" implies "reviewed it." With AI, that's no longer true.

I advocate for GitHub and other code review systems to add a "Require self-review" option, where people must attest that they reviewed and approved their own code. This change might seem symbolic, but it clearly sets workflows and expectations.

[+] ritlo|23 days ago|reply
The only way to see the kinds of speed-up companies want from these things, right now, is to do way too little review. I think we're going to see a lot of failures in a lot of sectors where companies set goals for reduced hours on various things they do, based on what they expected from LLM speed-ups, and it will have turned out the only way to hit those goals was by spending way too little time reviewing LLM output.

They're torn between "we want to fire 80% of you" and "... but if we don't give up quality/reliability, LLMs only save a little time, not a ton, so we can only fire like 5% of you max".

(It's the same in writing, these things are only a huge speed-up if it's OK for the output to be low-quality, but good output using LLMs only saves a little time versus writing entirely by-hand—so far, anyway, of course these systems are changing by the day, but this specific limitation has remained true for about four years now, without much improvement)

[+] petterroea|22 days ago|reply
I feel bad for the seniors who have to take on this workload. The general pattern I am seeing is that seniors at "AI-first" companies are being held back from doing their work by reviewing junior PRs, who are now able to ship much more code they don't understand the badness of.

Mentoring Juniors is an important part of the job and crucial service to the industry, but juniors equipped with LLMs make the deal a bit more sour. Anecdotally, they don't really remember the feedback as well, because they weren't involved in writing the code. Its burnout-inducing to see your hard work and feedback go in one ear and out another.

I personally know people looking to jump ship because they waste too much time at their current employer on this.

[+] lokar|23 days ago|reply
If this is true, it misunderstands the primary goals of code review.

Code review should not be (primarily) about catching serious errors. If there are always a lot of errors, you can’t catch most of them with review. If there are few it’s not the best use of time.

The goal is to ensure the team is in sync on design, standards, etc. To train and educate Jr engineers, to spread understanding of the system. To bring more points of view to complex and important decisions.

These goals help you reduce the number of errors going into the review process, this should be the actual goal.

[+] 827a|22 days ago|reply
> Company that lays-off 20% of its staff every year in an attempt to "reduce inefficiency" and "remain agile in the adoption of new technologies and workflows" finds they cannot run a stable service, have more inefficiency than ever, and have also failed to establish leadership in the adoption of any new technologies or workflows. They plan to solve these problems by introducing more inefficiency (making your most expensive employees review the work of others).

We love this for Amazon, they're a very strong company making bold decisions.

[+] znpy|23 days ago|reply
"Make senior engineer sign off ai-assisted changes" sounds incredibly weird.

First thing that comes to mind is: reminds me of those movie where some dictatorship starts to crumble and the dictator start being tougher and tougher on generals, not realizing the whole endeavor is doomed, not just the current implementation.

Then again, as a former amazon (aws) engineer: this is just not going to work. Depending how you define "senior engineer" (L5? L6? L7?) this is less and less feasible.

L5 engineers are already supposed to work pretty much autonomously, maybe with L6 sign-off when changes are a bit large in scope.

L6 engineers already have their own load of work, and a fairly large amount of engineers "under" them (anywhere from 5 to 8). Properly reviewing changes from all them, and taking responsibility for that, is going to be very taxing on such people.

L7 engineers work across teams and they might have anywhere from 12 to 30 engineers (L4/5/6) "under" them (or more). They are already scarce in number and they already pretty much mostly do reviews (which is proving not sufficient, it seems). Mandating sign-off and mandating assumption of responsibility for breaking changes means these people basically only do reviews and will be stricter and stricter[1] with engineers under them.

L8 engineers, they barely do any engineering at all, from what I remember. They mostly review design documents, in my experience not always expressing sound opinions or having proper understanding of the issues being handled.

In all this, considering the low morale (layoffs), the reduced headcount (layoffs) and the rise in expectations (engineers trying harder to stay afloat[2] due to... layoffs)... It's a dire situation.

I'm going to tell you, this stinks A LOT like rotting day 2 mindset.

----

1. keep in mind you can't, in general, determine the absence of bugs

2. Also cranking out WAY MUCH MORE code due to having gen-ai tools at their fingertips...

[+] paxys|23 days ago|reply
Someone should teach the decision makers how pipelines work. If AI-created diffs are being churned out at 10x the previous rate but manual reviews are the bottleneck then the overall system is producing at the exact same rate as before. The only thing you have added is cost, uncertainty and engineers being less familiar with the system.
[+] Lalabadie|23 days ago|reply
I'm not sure the sustainable solution is to treat an excess of lower-quality code output as the fixed thing to work with, and operationalize around that, but sure.
[+] sethops1|23 days ago|reply
> The response for now? Junior and mid-level engineers can no longer push AI-assisted code without a senior signing off.

So basically, kill the productivity of senior engineers, kill the ability for junior engineers to learn anything, and ensure those senior engineers hate their jobs.

Bold move, we'll see how that goes.

[+] rglover|23 days ago|reply
The amount of time and money being wasted chasing this dragon is unreal.
[+] ndr42|23 days ago|reply
I think the problem of responsibility will come for many more companies sooner than later. It is possible that some of the alleged efficacy gains by using ai are not so big anymore when someone has to be accountable for it.
[+] AlotOfReading|23 days ago|reply
I'm not surprised by the outages, but I am surprised that they're leaning into human code review as a solution rather than a neverending succession of LLM PR reviewers.

I wonder if it's an early step towards an apprenticeship system.

[+] mentos|23 days ago|reply
What are we going to do about software for critical infrastructure in the coming decade?

Feels inevitable that code for aviation will slowly rot from the same forces at play but with lethal results.

[+] AlexeyBrin|23 days ago|reply
I wonder how this will work in practice. Say I'm a senior engineer and I produce myself thousands of lines of code per day with the help of LLMs as mandated by the company. I still need to presumably read and test the code that I push to production. When will I have time to read and evaluate similar amounts of code produced by a junior or a mid level engineer ?