(no title)
abdullin | 8 months ago
The best part is that AI-driven systems are fine with running even more tight loops than what a sane human would tolerate.
Eg. running full linting, testing and E2E/simulation suite after any minor change. Or generating 4 versions of PR for the same task so that the human could just pick the best one.
bandoti|8 months ago
1. People get lazy when presented with four choices they had no hand in creating, and they don’t look over the four and just click one, ignoring the others. Why? Because they have ten more of these on the go at once, diminishing their overall focus.
2. Automated tests, end-to-end sim., linting, etc—tools already exist and work at scale. They should be robust and THOROUGHLY reviewed by both AI and humans ideally.
3. AI is good for code reviews and “another set of eyes” but man it makes serious mistakes sometimes.
An anecdote for (1), when ChatGPT tries to A/B test me with two answers, it’s incredibly burdensome for me to read twice virtually the same thing with minimal differences.
Code reviewing four things that do almost the same thing is more of a burden than writing the same thing once myself.
abdullin|8 months ago
As such, task of verification, still falls on hands of engineers.
Given that and proper processes, modern tooling works nicely with codebases ranging from 10k LOC (mixed embedded device code with golang backends and python DS/ML) to 700k LOC (legacy enterprise applications from the mainframe era)
eddd-ddde|8 months ago
OvbiousError|8 months ago
tlb|8 months ago
LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.
abdullin|8 months ago
diggan|8 months ago
Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?
Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.
If you're feeling brave, you start Robot B and C at the same time.
londons_explore|8 months ago
But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.
Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.
Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.
Byamarro|8 months ago
9rx|8 months ago
yahoozoo|8 months ago
diggan|8 months ago
abdullin|8 months ago
Cursor.sh agents or especially OpenAI Codex illustrate that a tool doesn't need to keep on stuffing context window with irrelevant information in order to make progress on a task.
And if really needed, engineers report that Gemini Pro 2.5 keeps on working fine within 200k-500k token context. Above that - it is better to reset the context.
the_mitsuhiko|8 months ago
elif|8 months ago
Even if you tell the git-aware Jules to handle a merge conflict within the context window the patch was generated, it is like sorry bro I have no idea what's wrong can you send me a diff with the conflict?
I find i have to be in the iteration loop at every stage or else the agent will forget what it's doing or why rapidly. for instance don't trust Jules to run your full test suite after every change without handholding and asking for specific run results every time.
It feels like to an LLM, gaslighting you with code that nominally addresses the core of what you just asked while completely breaking unrelated code or disregarding previously discussed parameters is an unmitigated success.
layer8|8 months ago
Why would a sane human be averse to things happening instantaneously?
latexr|8 months ago
That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality. Why are we doing this to ourselves and embracing it?
A few years ago, it would have been seen as a joke to say “the future of software development will be to have a million monkey interns banging on one million keyboards and submit a million PRs, then choose one”. Today, it’s lauded as a brilliant business and cost-saving idea.
We’re beyond doomed. The first major catastrophe caused by sloppy AI code can’t come soon enough. The sooner it happens, the better chance we have to self-correct.
chamomeal|8 months ago
Does anybody really want to be an assembly line QA reviewer for an automated code factory? Sounds like shit.
Also I can’t really imagine that in the first place. At my current job, each task is like 95% understanding all the little bits, and then 5% writing the code. If you’re reviewing PRs from a bot all day, you’ll still need to understand all the bits before you accept it. So how much time is that really gonna save?
ponector|8 months ago
Not for the cloud provider. AWS bill to the moon!
osigurdson|8 months ago
Agree though on the "pick the best PR" workflow. This is pure model training work and you should be compensated for it.
bonoboTP|8 months ago
diggan|8 months ago
You clearly have strong feelings about it, which is fine, but it would be much more interesting to know exactly why it would terrible and demotivating, and why it cannot produce anything of quality? And what is "real quality" and does that mean "fake quality" exists?
> million monkey interns banging on one million keyboards and submit a million PRs
I'm not sure if you misunderstand LLMs, or the famous "monkeys writing Shakespeare" part, but that example is more about randomness and infinity than about probabilistic machines somewhat working towards a goal with some non-determinism.
> We’re beyond doomed
The good news is that we've been doomed for a long time, yet we persist. If you take a look at how the internet is basically held up by duct-tape at this point, I think you'd feel slightly more comfortable with how crap absolutely everything is. Like 1% of software is actually Good Software while the rest barely works on a good day.
koakuma-chan|8 months ago
This is the right way to work with generative AI, and it already is an extremely common and established practice when working with image generation.
unknown|8 months ago
[deleted]