Watching AI drive Microsoft employees insane

[+] diggan|10 months ago|reply

Interesting that every comment has "Help improve Copilot by leaving feedback using the or buttons" suffix, yet none of the comments received any feedback, either positive or negative.

> This seems like it's fixing the symptom rather than the underlying issue?

This is also my experience when you haven't setup a proper system prompt to address this for everything an LLM does. Funniest PRs are the ones that "resolves" test failures by removing/commenting out the test cases, or change the assertions. Googles and Microsofts models seems more likely to do this than OpenAIs and Anthropics models, I wonder if there is some difference in their internal processes that are leaking through here?

The same PR as the quote above continues with 3 more messages before the human seemingly gives up:

> please take a look

> Your new tests aren't being run because the new file wasn't added to the csproj

> Your added tests are failing.

I can't imagine how the people who have to deal with this are feeling. It's like you have a junior developer except they don't even read what you're telling them, and have 0 agency to understand what they're actually doing.

Another PR: https://github.com/dotnet/runtime/pull/115732/files

How are people reviewing that? 90% of the page height is taken up by "Check failure", can hardly see the code/diff at all. And as a cherry on top, the unit test has a comment that say "Test expressions mentioned in the issue". This whole thing would be fucking hilarious if I didn't feel so bad for the humans who are on the other side of this.

[+] surgical_fire|10 months ago|reply

> I can't imagine how the people who have to deal with this are feeling. It's like you have a junior developer except they don't even read what you're telling them, and have 0 agency to understand what they're actually doing.

That comparison is awful. I work with quite a few Junior developers and they can be competent. Certainly don't make the silly mistakes that LLMs do, don't need nearly as much handholding, and tend to learn pretty quickly so I don't have to keep repeating myself.

LLMs are decent code assistants when used with care, and can do a lot of heavy lifting, they certainly speed me up when I have a clear picture of what I want to do, and they are good to bounce off ideas when I am planning for something. That said, I really don't see how it could meaningfully replace an intern however, much less an actual developer.

[+] yubblegum|10 months ago|reply

This field (SE - when I started out back in late 80s) was enjoyable. Now it has become toxic, from the interview process, to imitating "big tech" songs and dances by small fry companies, and now this. Is there any joy left in being a professional software developer?

[+] mrweasel|10 months ago|reply

At least we can tell the junior developers to not submit a pull-request before they have the tests running locally.

At what point does the human developers just give up and close the PRs as "AI garbage". Keep the ones that works, then just junk the rest. I feel that at some point entertaining the machine becomes unbearable and people just stops doing it or rage close the PRs.

[+] throwup238|10 months ago|reply

> Interesting that every comment has "Help improve Copilot by leaving feedback using the or buttons" suffix, yet none of the comments received any feedback, either positive or negative.

The feedback buttons open a feedback form modal, they don’t reflect the number of feedback given like the emoji button. If you leave feedback, it will reflect your thumbs up/down (hiding the other button), it doesn’t say anything about whether anyone else has left feedback (I’ve tried it on my own repos).

[+] belter|10 months ago|reply

This whole thread from yesterday take a whole different meaning: https://news.ycombinator.com/item?id=44031432

Comment in the GitHub discussion:

"...You and I and every programmer who hasn't been living under a rock knows that AI isn't ready to be adopted at this scale yet, on the premier; 100M-user code-hosting platform. It doesn't make any sense except in brain-washed corporate-talk like "we are testing today what it can do tomorrow".

I'm not saying that this couldn't be an adequate change some day, perhaps even in a few years but we all know this isn't it today. It's 100% financial-driven hype with a pinch of we're too big to fail mentality..."

[+] vasco|10 months ago|reply

> improve Copilot by leaving feedback using the or buttons" suffix, yet none of the comments received any feedback, either positive or negative

Why do they even need it? Success is code getting merged 1st shot, failure gets worse the more requests for changes the agent gets. Asking for manual feedback seems like a waste of time. Measure cycle time and rate of approvals and change failure rate like you would for any developer.

[+] dfxm12|10 months ago|reply

It's like you have a junior developer except they don't even read what you're telling them, and have 0 agency to understand what they're actually doing.

Anyone who has dealt with Microsoft support knows this feeling well. Even talking to the higher level customer success folks feels like talking to a brick wall. After dozens of support cases, I can count on zero hands the number of issues that were closed satisfactorily.

I appreciate Microsoft eating their dogfood here, but please don't make me eat it too! If anyone from MS is reading this, please release finished products that you are prepared to support!

[+] xnorswap|10 months ago|reply

> How are people reviewing that? 90% of the page height is taken up by "Check failure",

Typically, you wouldn't bother manually reviewing something until the automated checks have passed.

[+] prossercj|10 months ago|reply

This comment on that PR is pure gold. The bots are talking to each other:

https://github.com/dotnet/runtime/pull/115732#issuecomment-2...

[+] spacecadet|10 months ago|reply

"I wonder if there is some difference in their internal processes that are leaking through here?"

Maybe, but likely it is reality and their true company culture leaking through. Eventually some higher eq execs might come to the very late realization that they cant actually lead or build a worthwhile and productive company culture and all that remains is an insane reflection of that.

[+] worldsayshi|10 months ago|reply

> How are people reviewing that?

I agree that not auto-collapsing repeated annotations is an annoying bug in the github interface.

But just pointing out that annotations can be hidden in the ... menu to the right (which I just learned).

[+] marmakoide|10 months ago|reply

Hot take : the whole LLM craze is fed by a delusion. LLM are good at mimicking human language, capturing some semantics on the way. With a large enough training set, the amount of semantic captured covers a large fraction of what the average human knows. This gives the illusion of intelligence, and the humans extrapolates on LLM capabilities, like actual coding. Because large amounts of code from textbooks and what not is on the training set, the illusion is convincing for people with shallow coding abilities.

And then, while the tech is not mature, running on delusion and sunken costs, it's actually used for production stuffs. Butlerian Jihad when

[+] unknown|10 months ago|reply

[deleted]

[+] ta1243|10 months ago|reply

> @copilot please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

[+] cruffle_duffle|10 months ago|reply

It’s probably the junior devs that get to review these PRs. That and interns.

[+] unknown|10 months ago|reply

[deleted]

[+] TheNewsIsHere|10 months ago|reply

> This whole thing would be fucking hilarious if I didn't feel so bad for the humans who are on the other side of this.

Which will soon be anyone who directly or indirectly relies on Microsoft technologies. Some of these PRs, including at least one that I saw reworked certificate validation logic with not much more than a perfunctory “LGTM”, have been merged into main.

Coincidentally, I wonder if issues orthogonal to this slop is why I’ve been getting so many HTTP 500 errors when using GitHub lately.

[+] kruuuder|10 months ago|reply

A comment on the first pull request provides some context:

> The stream of PRs is coming from requests from the maintainers of the repo. We're experimenting to understand the limits of what the tools can do today and preparing for what they'll be able to do tomorrow. Anything that gets merged is the responsibility of the maintainers, as is the case for any PR submitted by anyone to this open source and welcoming repo. Nothing gets merged without it meeting all the same quality bars and with us signing up for all the same maintenance requirements.

[+] abxyz|10 months ago|reply

The author of that comment, an employee of Microsoft, goes on to say:

> It is my opinion that anyone not at least thinking about benefiting from such tools will be left behind.

The read here is: Microsoft is so abuzz with excitement/panic about AI taking all software engineering jobs that Microsoft employees are jumping on board with Microsoft's AI push out of a fear of "being left behind". That's not the confidence inspiring the statement they intended it to be, it's the opposite, it underscores that this isn't the .net team "experimenting to understand the limits of what the tools" but rather the .net team trying to keep their jobs.

[+] lcnPylGDnU4H9OF|10 months ago|reply

This is important context given that it would be absurd for the managers to have already drawn a definitive conclusion about the models’ capabilities. An explicit understanding that the purpose of the exercise is to get a better idea of the current strengths and weaknesses of the models in a “real world” context makes this actually very reasonable.

[+] rsynnott|10 months ago|reply

Beyond every other absurdity here, well, maybe Microsoft is different, but I would never assign a PR that was _failing CI_ to somebody. That that's happening feels like an admission that the thing doesn't _really_ work at all; if it worked even slightly, it would at least only assign passing PRs, but presumably it's bad enough that if they put in that requirement there would be no PRs.

[+] sbarre|10 months ago|reply

I feel like everyone is applying a worse-case narrative to what's going on here..

I see this as a work in progress.. I am almost certain the humans in the loop on these PRs are well aware of what's going on and have their expectations in check, and this isn't just "business as usual" like any other PR or work assignment.

This is a test. You can't improve a system without testing it on real world conditions.

How do we know they're not tweaking the Copilot system prompts and settings behind the scenes while they're doing this work?

Can no one see the possibility that what is happening in those PRs is exactly what all the people involved expected to have happen, and they're just going through the process of seeing what happens when you try to refine and coach the system to either success or failure?

When we adopted AI coding assist tools internally over a year ago we did almost exactly this (not directly in GitHub though).

We asked a bunch of senior engineers to see how far they could get by coaching the AI to write code rather than writing it themselves. We wanted to calibrate our expectations and better understand the limits, strengths and weaknesses of these new tools we wanted to adopt.

In most of those early cases we ended up with worse code than if it had been written by humans, but we learned a ton. We can also clearly see how much better things have gotten over time, since we have that benchmark to look back on.

[+] Dlanv|10 months ago|reply

They said in the comments that currently the firewall is blocking it from checking tests for passing, and they need to fix that.

Otherwise it would check the tests are passing.

[+] robotcapital|10 months ago|reply

Replace the AI agent with any other new technology and this is an example of a company:

1. Working out in the open

2. Dogfooding their own product

3. Pushing the state of the art

Given that the negative impact here falls mostly (completely?) on the Microsoft team which opted into this, is there any reason why we shouldn't be supporting progress here?

[+] JB_Dev|10 months ago|reply

100% agree. i’m not sure why everyone is clowning on them here. This process is a win. Do people want this all being hidden instead in a forked private repo?

It’s showing the actual capabilities in practice. That’s much better and way more illuminating than what normally happens with sales and marketing hype.

[+] constantcrying|10 months ago|reply

Who is "we" and how and why would "we" "support" or not "support" anything.

Personally I just think it is funny that MS is soft launching a product into total failure.

[+] throwaway844498|10 months ago|reply

"Pushing the state of the art" and experimenting on a critical software development framework is probably not the best idea.

[+] mrguyorama|10 months ago|reply

>supporting progress

This presupposes AI IS progress.

Nevermind that what this actually shows is an executive or engineering team that so buys their own hype that they didn't even try to run this locally and internally before blasting to the world that their system can't even ensure tests are passing before submitting a PR. They are having a problem with firewall rules blocking the system from seeing CI outcomes and that's part of why it's doing so badly, so why wasn't that verified BEFORE doing this on stage?

"Working out in the open" here is a bad thing. These are issues that SHOULD have been caught by an internal POC FIRST. You don't publicly do bullshit.

"Dogfooding" doesn't require throwing this at important infrastructure code. Does VS code not have small bugs that need fixing? Infrastructure should expect high standards.

"Pushing the state of the art" is comedy. This is the state of the art? This is pushing the state of the art? How much money has been thrown into the fire for this result? How much did each of those PRs cost anyway?

[+] lawn|10 months ago|reply

Because they're using it on an extremely popular repository that many people depend on?

And given the absolute garbage the AI is putting out the quality of the repo will drop. Either slop code will get committed or the bots will suck away time from people who could've done something productive instead.

[+] globalise83|10 months ago|reply

Malicious compliance should be the order of the day. Just approve the requests without reviewing them and wait until management blinks when Microsoft's entire tech stack is on fire. Then quit your job and become a troubleshooter on x3 the pay.

[+] balazstorok|10 months ago|reply

At least opening PRs is a safe option, you can just dump the whole thing if it doesn't turn out to be useful.

Also, trying something new out will most likely have hiccups. Ultimately it may fail. But that doesn't mean it's not worth the effort.

The thing may rapidly evolve if it's being hard-tested on actual code and actual issues. For example it will be probably changed so that it will iterate until tests are actually running (and maybe some static checking can help it, like not deleting tests).

Waiting to see what happens. I expect it will find its niche in development and become actually useful, taking off menial tasks from developers.

[+] petetnt|10 months ago|reply

GitHub has spent billions of dollars building an AI that struggles with things like whitespace related linting errors on one of the most mature repositories available. This would be probably okay for a hobbyist experiment, but they are selling this as a groundbreaking product that costs real money.

[+] Philpax|10 months ago|reply

Stephen Toub, a Partner Software Engineer at MS, explaining that the maintainers are intentionally requesting these PRs to test Copilot: https://github.com/dotnet/runtime/pull/115762#issuecomment-2...

[+] Quarrelsome|10 months ago|reply

rah, we might be in trouble here. The primary issue at play is that we don't have a reliable means of measuring developer performance, outside of subjective judgement like end of year reviews.

This means its probably quite hard to measure the gain or the drag of using these agents. On one side, its a lot cheaper than a junior, but on the other side it pulls time from seniors and doesn't necessarily follow instruction well (i.e. "errr your new tests are failing").

This combined with the "cult of the CEO" sets the stage for organisational dissonance where developer complaints can be dismissed as "not wanting to be replaced" and the benefits can be overstated. There will be ways of measuring this, to project it as huge net benefit (which the cult of the CEO will leap upon) and there will be ways of measuring this to project it as a net loss (rabble rousing developers). All because there is no industry standard measure accepted by both parts of the org that can be pointed at which yields the actual truth (whatever that may be).

If I might add absurd conjecture: We might see interesting knock-on effects like orgs demanding a lowering of review standards in order to get more AI PRs into the source.

[+] Crosseye_Jack|10 months ago|reply

I do love one bot asking another bot to sign a CLA! - https://github.com/dotnet/runtime/pull/115732#issuecomment-2...

[+] margorczynski|10 months ago|reply

With how stochastic the process is it makes it basically unusable for any large scale task. What's the plan? To roll the dice until the answer pops up? That would be maybe viable if there was a way to automatically evaluate it 100% but with a human in the loop required it becomes untenable.

[+] diggan|10 months ago|reply

> What's the plan?

Call me old school, but I find the workflow of "divide and conquer" to be as helpful when working with LLMs, as without them. Although what is needed to be considered a "large scale task" varies by LLMs and implementation. Some models/implementations (seemingly Copilot) struggles with even the smallest change, while others breeze through them. Lots of trial and error is needed to find that line for each model/implementation :/

[+] rsynnott|10 months ago|reply

I suspect that the plan is that MS has spent a lot, really a LOT, of money on this nonsense, and there is now significant pressure to put, something, anything, out even if it is worse than useless.

[+] Traubenfuchs|10 months ago|reply

> to roll the dice

This was discussed here

https://news.ycombinator.com/item?id=43988913

[+] le-mark|10 months ago|reply

The real tragedy is the management mandating this have their eyes clearly set on replacing the very same software engineers with this technology. I don’t know what’s more Kafka than Kafka but this situation certainly is!

[+] automatic6131|10 months ago|reply

Satya said "nearly 30% of code written at microsoft is now written by AI" in an interview with Zuckerberg, so underlings had to hurry to make it true. This is the result. Sad!

[+] rchaud|10 months ago|reply

It's remarkable how similar this feels to the offshoring craze of 20 years ago, where the complaints were that experienced developers were essentially having to train "low-skilled, cheap foreign labour" that were replacing them, eating up time and productivity.

Considering the ire that H1B related topics attract on HN, I wonder if the same outrage will apply to these multi-billion dollar boondoggles.

[+] cebert|10 months ago|reply

Do we know for a fact there are Microsoft employees who were told they have to use CoPilot and review its change suggestions on projects?

We have the option to use GitHub CoPilot on code reviews and it’s comically bad and unhelpful. There isn’t a single member of my team who find it useful for anything other than identifying typos.

[+] einrealist|10 months ago|reply

This is one good example of the Sunk Cost Fallacy: generative AI has cost so much money, acknowledging its shortcomings is now becoming more and more impossible.

This AI bubble is far worse than the Blockchain hype.

Its not yet clear whether productivity gains are real and whether the gains are eaten by a decline in overall quality.

[+] is_true|10 months ago|reply

Today I received the 2nd email about an endpoint in an API we run that doesn't exist but some AI tool told the client it does.

[+] Frost1x|10 months ago|reply

Sounds like the client has a feature request they want to pay for.

[+] bossyTeacher|10 months ago|reply

Every week, one of Google/OpenAI/Anthropic releases a new model, feature or product and it gets posted here with 3 figure comments mostly praising LLMs as the next best thing since the internet. I see a lot of hype on HN about LLMs for software development and how it is going to revolutionize everything. And then, reality looks like this.

I can't help but think that this LLM bubble can't keep growing much longer. The investment to results ratio doesn't look great so far and there is only so many dreams you can sell before institutional investors pull the plug.

[+] vachina|10 months ago|reply

> This seems like it's fixing the symptom rather than the underlying issue?

Exactly. LLM does not know how to use a debugger. LLM does not have runtime contexts.

For all we know, the LLM could’ve fixed the issue simply by commenting out the assertions or sanity checks and everything seemed fine and dandy until every client’s device catches on fire.

[+] aiono|10 months ago|reply

While I am AI skeptic especially for use cases like "writing fixes" I am happy to see this because it will be a great evidence whether it's really providing increase in productivity. And it's all out in the open.

552 comments