WingNews

bandoti|8 months ago

Here’s a few problems I foresee:

1. People get lazy when presented with four choices they had no hand in creating, and they don’t look over the four and just click one, ignoring the others. Why? Because they have ten more of these on the go at once, diminishing their overall focus.

2. Automated tests, end-to-end sim., linting, etc—tools already exist and work at scale. They should be robust and THOROUGHLY reviewed by both AI and humans ideally.

3. AI is good for code reviews and “another set of eyes” but man it makes serious mistakes sometimes.

An anecdote for (1), when ChatGPT tries to A/B test me with two answers, it’s incredibly burdensome for me to read twice virtually the same thing with minimal differences.

Code reviewing four things that do almost the same thing is more of a burden than writing the same thing once myself.

abdullin|8 months ago

A simple rule applies: "No matter what tool created the code, you are still responsible for what you merge into main".

As such, task of verification, still falls on hands of engineers.

Given that and proper processes, modern tooling works nicely with codebases ranging from 10k LOC (mixed embedded device code with golang backends and python DS/ML) to 700k LOC (legacy enterprise applications from the mainframe era)

eddd-ddde|8 months ago

With lazy people the same applies for everything, code they do write, or code they review from peers. The issue is not the tooling, but the hands.

OvbiousError|8 months ago

I don't think the human is the problem here, but the time it takes to run the full testing suite.

tlb|8 months ago

Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.

LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.

abdullin|8 months ago

Humans tend to lack inhumane patience.

diggan|8 months ago

It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?

Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.

If you're feeling brave, you start Robot B and C at the same time.

londons_explore|8 months ago

The full test suite is probably tens of thousands of tests.

But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.

Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.

Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.

Byamarro|8 months ago

I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save. The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.

9rx|8 months ago

Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. Force it stay within a single isolation zone just like you'd expect of a human. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.

yahoozoo|8 months ago

The problem is that every time you run your full automation with linting and tests, you’re filling up the context window more and more. I don’t know how people using Claude do it with its <300k context window. I get the “your message will exceed the length of this chat” message so many times.

diggan|8 months ago

I don't know exactly how Claude works, but the way I work around this with my own stuff is prompting it to not display full outputs ever, and instead temporary redirect the output somewhere then grep from the log-file what it's looking for. So a test run outputting 10K lines of test output and one failure is easily found without polluting the context with 10K lines.

abdullin|8 months ago

Claude's approach is currently a bit dated.

Cursor.sh agents or especially OpenAI Codex illustrate that a tool doesn't need to keep on stuffing context window with irrelevant information in order to make progress on a task.

And if really needed, engineers report that Gemini Pro 2.5 keeps on working fine within 200k-500k token context. Above that - it is better to reset the context.

the_mitsuhiko|8 months ago

I started to use sub agents for that. That does not pollute the context as much

elif|8 months ago

In my experience with Jules and (worse) Codex, juggling multiple pull requests at once is not advised.

Even if you tell the git-aware Jules to handle a merge conflict within the context window the patch was generated, it is like sorry bro I have no idea what's wrong can you send me a diff with the conflict?

I find i have to be in the iteration loop at every stage or else the agent will forget what it's doing or why rapidly. for instance don't trust Jules to run your full test suite after every change without handholding and asking for specific run results every time.

It feels like to an LLM, gaslighting you with code that nominally addresses the core of what you just asked while completely breaking unrelated code or disregarding previously discussed parameters is an unmitigated success.

layer8|8 months ago

> Tight feedback loops are the key in working productively with software. […] even more tight loops than what a sane human would tolerate.

Why would a sane human be averse to things happening instantaneously?

latexr|8 months ago

> Or generating 4 versions of PR for the same task so that the human could just pick the best one.

That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality. Why are we doing this to ourselves and embracing it?

A few years ago, it would have been seen as a joke to say “the future of software development will be to have a million monkey interns banging on one million keyboards and submit a million PRs, then choose one”. Today, it’s lauded as a brilliant business and cost-saving idea.

We’re beyond doomed. The first major catastrophe caused by sloppy AI code can’t come soon enough. The sooner it happens, the better chance we have to self-correct.

chamomeal|8 months ago

I say this all the time!

Does anybody really want to be an assembly line QA reviewer for an automated code factory? Sounds like shit.

Also I can’t really imagine that in the first place. At my current job, each task is like 95% understanding all the little bits, and then 5% writing the code. If you’re reviewing PRs from a bot all day, you’ll still need to understand all the bits before you accept it. So how much time is that really gonna save?

ponector|8 months ago

>That sounds awful.

Not for the cloud provider. AWS bill to the moon!

osigurdson|8 months ago

I'm not sure that AI code has to be sloppy. I've had some success with hand coding some examples and then asking codex to rigorously adhere to prior conventions. This can end up with very self consistent code.

Agree though on the "pick the best PR" workflow. This is pure model training work and you should be compensated for it.

bonoboTP|8 months ago

If it's monkeylike quality and you need a million tries, it's shit. It you need four tries and one of those is top-tier professional programmer quality, then it's good.

diggan|8 months ago

> A truly terrible and demotivating way to work and produce anything of real quality

You clearly have strong feelings about it, which is fine, but it would be much more interesting to know exactly why it would terrible and demotivating, and why it cannot produce anything of quality? And what is "real quality" and does that mean "fake quality" exists?

> million monkey interns banging on one million keyboards and submit a million PRs

I'm not sure if you misunderstand LLMs, or the famous "monkeys writing Shakespeare" part, but that example is more about randomness and infinity than about probabilistic machines somewhat working towards a goal with some non-determinism.

> We’re beyond doomed

The good news is that we've been doomed for a long time, yet we persist. If you take a look at how the internet is basically held up by duct-tape at this point, I think you'd feel slightly more comfortable with how crap absolutely everything is. Like 1% of software is actually Good Software while the rest barely works on a good day.

koakuma-chan|8 months ago

> That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality

This is the right way to work with generative AI, and it already is an extremely common and established practice when working with image generation.

unknown|8 months ago

[deleted]