There's a misunderstanding here broadly. Context could be infinite, but the real bottleneck is understanding intent late in a multi-step operation. A human can effectively discard or disregard prior information as the narrow window of focus moves to a new task, LLMs seem incredibly bad at this.
Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
I think that's the real issue. If the LLM spends a lot of context investigating a bad solution and you redirect it, I notice it has trouble ignoring maybe 10K tokens of bad exploration context against my 10 line of 'No, don't do X, explore Y' instead.
Asking, not arguing, but: why can't they? You can give an agent access to its own context and ask it to lobotomize itself like Eternal Sunshine. I just did that with a log ingestion agent (broad search to get the lay of the land, which eats a huge chunk of the context window, then narrow searches for weird stuff it spots, then go back and zap the big log search). I assume this is a normal approach, since someone else suggested it to me.
> A human can effectively discard or disregard prior information as the narrow window of focus moves to a new task, LLMs seem incredibly bad at this.
This is how I designed my LLM chat app (https://github.com/gitsense/chat). I think agents have their place, but I really think if you want to solve complex problems without needlessly burning tokens, you will need a human in the loop to curate the context. I will get to it, but I believe in the same way that we developed different flows for working with Git, we will have different 'Chat Flows' for working with LLMs.
I have an interactive demo at https://chat.gitsense.com which shows how you can narrow the focus of the context for the LLM. Click "Start GitSense Chat Demos" then "Context Engineering & Management" to go through the 30 second demo.
You don't want to discard prior information though. That's the problem with small context windows. Humans don't forget the original request as they ask for more information or go about a long task. Humans may forget parts of information along the way, but not the original goal and important parts. Not unless they have comprehension issues or ADHD, etc.
This isn't a misconception. Context is a limitation. You can effectively have an AI agent build an entire application with a single prompt if it has enough (and the proper) context. The models with 1m context windows do better. Models with small context windows can't even do the task in many cases. I've tested this many, many, many times. It's tedious, but you can find the right model and the right prompts for success.
Humans have a very strong tendency (and have made tremendous collective efforts) to compress context. I'm not a neuroscientist but I believe it's called "chunk."
Language itself is a highly compressed form of compressed context. Like when you read "hoist with one's own petard" you don't just think about literal petard but the context behind this phrase.
i think that's really just a misunderstanding of what "bottleneck" means. a bottleneck isn't an obstacle where overcoming it will allow you to realize unlimited potential, a bottleneck is always just an obstacle to finding the next constraint.
on actual bottles without any metaphors, the bottle neck is narrower because humans mouths are narrower.
> It needs to understand product and business requirements
Yeah this is the really big one - kind of buried the lede a little there :)
Understanding product and business requirements traditionally means communicating (either via docs and specs or directly with humans) with a bunch of people. One of the differences between a junior and senior is being able to read between the lines of a github or jira issue and know that more information needs to be teased out from… somewhere (most likely someone).
I’ve noticed that when working with AI lately I often explicitly tell them “if you need more information or context ask me before writing code”, or variations thereof. Because LLMs, like less experienced engineers, tend to think the only task is to start writing code immediately.
It will get solved though, there’s no magic in it, and LLMs are well equipped by design to communicate!
We stopped hiring a while ago because we were adjusting to "AI". We're planning to start hiring next year, as upper management finally saw the writing on the wall: LLMs won't evolve past junior engineers, and we need to train junior engineers to become mid-level and senior engineers to keep the engine moving.
We're now using LLMs as mere tools (which is what it was meant to be from the get-go) to help us with different tasks, etc., but not to replace us, since they understand you need experienced and knowledgeable people to know what they're doing, since they won't learn everything there's to know to manage, improve and maintain tech used in our products and services. That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
If we get AGI, or the more sci-fi one, ASI, then all things will radically change (I'm thinking humanity reaching ASI will be akin to the episode from Love, Death & Robots: "When the Yogurt Took Over"). In the meantime, the hype cycle continues...
> That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
I mean, did you try it for those purposes?
I have personally submitted an appeal to court for an issue I was having for which I would otherwise have to search almost indefinitely for a lawyer to be even interested into it.
I also debugged health opportunities from different angles using the AI and was quite successful at it.
I also experimented with the well-being topic and it gave me pretty convincing and mind opening suggestions.
So, all I can say is that it worked out pretty good in my case. I believe its already transformative in a ways we wouldn't be able even to envision couple years ago.
I don't think intelligence is increasing. Arbitrary benchmarks don't reflect real world usage. Even with all the context it could possibly have, these models still miss/hallucinate things. Doesn't make them useless, but saying context is the bottleneck is incorrect.
Agreed. I feel like, in the case of GPT models, 4o was better in most ways than 5 has been. I'm not seeing increases in quality of anything between the two 5 feels like a major letdown honestly. I am constantly reminding it what we're doing lol
I agree, I often see Opus 4.1 and GPT5 (Thinking) make astoundingly stupid decisions with full confidence, even on trivial tasks requiring minimal context. Assuming they would make better decisions "if only they had more context" is a fallacy
Gemini 2.5 Pro is okay if you ask it to work on a very tiny problem. That's about it for me, the other models don't even create a convincing facsimile of reasoning.
Context is also a bottleneck in many human to human interactions as well so this is not surprising. Especially juniors often start by talking about their problems without providing adequate context about what they’re trying to accomplish or why they’re doing it.
Mind you, I was exactly like that when I started my career and it took quite a while and being on both sides of the conversation to improve. One difference is that it is not so easy to put oneself in the shoes of an LLM. Maybe I will improve with time. So far assuming the LLM is knowledgeable but not very smart has been the most effective strategy for my LLM interactions.
The ICPC is a short (5 hours) timed contest with multiple problems, in which contestants are not allowed to use the internet.
The reason most don't get a perfect score isn't because the tasks themselves are unreasonably difficult, but because they're difficult enough that 5 hours isn't a lot of time to solve so many problems. Additionally they often require a decent amount of math / comp-sci knowledge so if you don't know have the knowledge necessary you probably won't be able complete it.
So to get a good score you need lots of math & comp-sci knowledge + you need to be a really quick coder.
Basically the consent is perfect for LLMs because they have a ton of math and comp-sci knowledge, they can spit out code at super human speeds, and the problems themselves are fairly small (they take a human maybe 15 mins to an hour to complete).
Who knows, maybe OP is right and LLMs are smart enough to be super human coders if they just had the right context, but I don't think this example proves their point well at all. These are exactly the types of problems you would expect a supercharged auto-complete would excel at.
If not now, soon, the bottleneck will be responsibility. Where errors in code have real-world impacts, "the agentic system wrote a bug" won't cut it for those with damages.
As these tools make it possible for a single person to do more, it will become increasingly likely that society will be exposed to greater risks than that single person's (or small company's) assets can cover.
These tools already accelerate development enough that those people who direct the tools can no longer state with credibility that they've personally reviewed the code/behavior with reasonable coverage.
It'll take over-extensions of the capability of these tools, of course, before society really notices, but it remains my belief that until the tools themselves can be held liable for the quality of their output, responsibility will become the ultimate bottleneck for their development.
I agree. My speed at reviewing tokens <<<< LLM's token's. Perhaps an output -> compile -> test loop will slow things down, but will we ever get to a "no review needed" point?
IMHO, jumping from Level 2 to Level 5 is a matter of:
- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly
Microservices should already be a last resort when you’ve either:
a) hit technical scale that necessitates it
b) hit organizational complexity that necessitates it
Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).
> Better documentation
If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.
Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).
I've been using claude on two codebases, one with good layering and clean examples, the other not so much. I get better output from the LLM with good context and clean examples and documentation. Not surprising that clarity in code benefits both humans and machines.
> Level 2 - One commit - Cursor and Claude Code work well for tasks in this size range.
I'll stop ya right there. Spending the past few weeks fixing bugs in a big multi-tier app (which is what any production software is this days). My output per bug is always one commit, often one line.
Claude is an occasional help, nothing more. Certainly not generating the commit for me!
I'll stop you right there. I've been using Claude Code for almost a year on production software with pretty large codebases. Both multi-repo and monorepo.
Claude is able to create entire PRs for me that are clean, well written, and maintainable.
Can it fail spectacularly? Yes, and it does sometimes. Can it be given good instructions and produce results that feel like magic? Also yes.
This is interesting, and I'd say you're not the target audience. If you want the code Claude writes to be line-by-line what you think is most appropriate as a human, you're not going to get it.
You have to be willing to accept "close-ish and good enough" to what you'd write yourself. I would say that most of the time I spend with Claude is to get from its initial try to "close-ish and good enough". If I was working on tiny changes of just a few lines, it would definitely be faster just to write them myself. It's the hundreds of lines of boilerplate, logging, error handling, etc. that makes the trade-off close to worth it.
While this is sort of true, remember: it's not the size of the context window that matters, it's how you use it.
You need to have the right things in the context, irrelevant stuff is not just wasteful, it is increasingly likely to cause errors. It has been shown a few times that as the context window grows, performance drops.
Heretical I know, but I find that thinking like a human goes a long way to working with AI.
Let's take the example of large migrations. You're not going to load the whole codebase in your brain and figure out what changes to make and then vomit them out into a huge PR. You're going to do it bit by bit, looking up relevant files, making changes to logically-related bits of code, and putting out a PR for each changelist.
This exactly what tools should do as well. At $PAST_JOB my team built a tool based on OpenRewrite (LLMs were just coming up) for large-scale multi-repo migrations and the centerpiece was our internal codesearch tool. Migrations were expressed as a codesearch query + codemod "recipe"; you can imagine how that worked.
That would be the best way to use AI for large-scale changes as well. Find the right snippets of code (and documentation!), load each one into the context of an agent in multiple independent tasks.
Caveat: as I understand it, this was the premise of SourceGraph's earliest forays into AI-assisted coding, but I recall one of their engineers mentioning that this turned out to be much trickier than expected. (This was a year+ back, so eons ago in LLM progress time.)
Just hypothesizing here, but it may have been that the LSIF format does not provide sufficient context. Another company in this space is Moderne (the creators of OpenRewrite) that have a much more comprehensive view of the codebase, and I hear they're having better success with large LLM-based migrations.
I'm making a pretty complex project using claude. I tried claude flow and some other orchestrators but they produced garbage.
Have found using github issues to track the progress as comments works fairly well, the PR's can get large comment wise (especially if you have gemini code assist, recommeded as another code review judge), so be mindful of that (that will blow the context window). Using a fairly lean CLAUDE.md and a few mcps (context7 and consult7 with gemini for longer lookups). works well too. Although be prepared to tell it to reread CLAUDE.md a few conversations deep as it loses it.
It's working fairly well so far, it feels a bit akin to herding cats sometimes and be prepared to actually read the code it's making, or the important bits at least.
your comment reminds me of another one i saw on reddit. someone said they found that using github diff as a way to manage context and reference chat history worked the best for their ai agent. i think he is on to something here.
It is pretty clear that the long horizon tasks are difficult for coding agents and that is a fundamental limitation of how probabilistic word generation works either with transformer or any other architecture. The errors propagate and multiply and becomes open ended.
However, the limitation can be masqueraded using layering techniques where output of one agent is fed as an input to another using consensus for verification or other techniques to the nth degree to minimize errors. But this is a bit like the story of a boy with a finger in the dike. Yes, you can spawn as many boys but there is a cost associated that would keep growing and wont narrow down.
It has nothing to do with contexts or window of focus or any other human centric metric. This is what the architecture is supposed to do and it does so perfectly.
I gave up building agents as soon as I figured they would never scale beyond context constraint. Increase in memory and compute costs to grow the context size of these things isn't linear.
Replace “coding agent” with “new developer on the team” and this article could be from anytime in the last 50 years. The thing is, a coding agent acts like a newly-arrived developer every time you start it.
Context is a bottleneck for humans as well. We don’t have full context when going through the code because we can’t hold full context.
We summarize context and remember summarizations of it.
Maybe we need to do this with the LLM. Chain of thought sort of does this but it’s not deliberate. The system prompt needs to mark this as a deliberate task of building summaries and notes notes of the entire code base and this summarized context of the code base with gotchas and aspects of it can be part of permanent context the same way ChatGPT remembers aspects of you.
The summaries can even be sectioned off and and have different levels of access. So if the LLM wants to drill down to a subfolder it looks at the general summary and then it looks at another summary for the sub folder. It doesn’t need to access the full summary for context.
Imagine a hierarchy of system notes and summaries. The LLM decides where to go and what code to read while having specific access to notes it left previously when going through the code. Like the code itself it never reads it all it just access sections of summaries that go along with the code. It’s sort of like code comments.
We also need to program it to change the notes every time it changes the program. And when you change the program without consulting AI, every commit you do the AI also needs to update the notes based off of your changes.
The LLM needs a system prompt that tells it to act like us and remember things like us. We do not memorize and examine full context of anything when we dive into code.
They need a proper memory. Imagine you're a very smart, skilled programmer but your memory resets every hour. You could probably get something done by making extensive notes as you go along, but you'll still be smoked by someone who can actually remember what they were doing in the morning. That's the situation these coding agents are in. The fact that they do as well as they do is remarkable, considering.
Agreed. As engineers we build context every time we interact with the codebase. LLMs don't do that.
A good senior engineer has a ton in their head after 6+ months in a codebase. You can spend a lot of time trying to equip Claude Code with the equivalent in the form of CLAUDE.MD, references to docs, etc., but it's a lot of work, and it's not clear that the agents even use it well (yet).
youre projecting a deficiency of the human brain onto computers. computers have advantages that our brains dont (perfect and large memory), theres no reason to think that we should try to recreate how humans do things.
why would you bother with all these summaries if you can just read and remember the code perfectly.
I've noticed that chatgpt doesnt seem to be very good at understanding elapsed time. I have some long running threads and unless i prompt it with elapsed time ("it's now 7 days later") the responses act like it was 1 second after the last message.
I think this might be a good leap for agents, the ability to not just review a doc in it's current state, but to keep in context/understanding the full evolution of a document.
I've noticed the same thing with Grok. One time it predicted a X% chance that something would happen by July 31. On August 1, it was still predicting the thing would happen by July 31, just with lower (but non-zero) odds. Their grasp on time is tenuous at best.
This is one cause but another is that agents are mostly trained using the same sets of problems. There are only so many open source projects that can be used for training (ie. benchmarks). There's huge oversampling for a subset of projects like pandas and nothing at all for proprietary datasets. This is a huge problem!
If you want your agent to be really good at working with dates in a functional way or know how to deal with the metric system (as examples), then you need to train on those problems, probably using RFT. The other challenge is that even if you have this problem set in testable fashion running at scale is hard. Some benchmarks have 20k+ test cases and can take well over an hour to run. If you ran each test case sequentially it would take over 2 years to complete.
Right now the only company I'm aware of that lets you do that at scale is runloop (disclaimer, I work there).
This has been the case for a while. Attempting to code API connections via Vibe-Coding will leave you pulling your hair out if you don't take the time to scrape all relevant documentation and include said documentation in the prompt. This is the case whether it's major APIs like Shopify, or more niche ones like warehousing software (Cin7 or something similar).
The context pipeline is a major problem in other fields as well, not just programming. In healthcare, the next billion-dollar startup will likely be the one that cracks the personal health pipeline, enabling people to chat with GPT-6 PRO while seamlessly bringing their entire lifetime of health context into every conversation.
These are such silly arguments. I sounds like people looking at a graph of a linear function crossing and exponential one at x=2, y=2 and wonder why the curves don't fit at x=3 y=40.
"Its not the x value that's the problem, its the y value".
You're right, it's not "raw intelligence" that's the bottleneck, because there's none of that in there. The truth is no tweak to any parameter is ever going to make the LLM capable of programming. Just like an exponential curve is always going to outgrow a linear one. You can't tweak the parameters out of that fundamental truth.
I agree, and I think intent behind the code is the most important part in missing context. You can sometimes infer intent from code, but usually code is a snapshot of an expression of an evolving intent.
I've started making sure my codebase is "LLM compatible". This means everything has documentation and the reasons for doing things a certain way and not another are documented in code. Funnily enough i do this documentation work with LLMs.
Eg. "Refactor this large file into meaningful smaller components where appropriate and add code documentation on what each small component is intended to achieve." The LLM can usually handle this well (with some oversight of course). I also have instructions to document each change and why in code in the LLMs instructions.md
If the LLM does create a regression i also ask the LLM to add code documentation in the code to avoid future regressions, "Important: do not do X here as it will break Y" which again seems to help since the LLM will see that next time right there in the portion of code where it's important.
None of this verbosity in the code itself is harmful to human readers either which is nice. The end result is the codebase becomes much easier for LLMs to work with.
I suspect LLM compatibility may be a metric we measure codebases in the future as we learn more and more how to work with them. Right now LLMs themselves often create very poor LLM compatible code but by adding some more documentation in the code itself they can do much better.
In my opinion human beings also do not have unlimited cognitive context. When a person sits down to modify a codebase, they do not read every file in the codebase. Instead they rely on a combination of working memory and documentation to build the high-level and detailed context required to understand the particular components they are modifying or extending, and they make use of abstraction to simplify the context they need to build. The correct design of a coding LLM would require a similar approach to be effective.
I’m working on a project that has now outgrown the context window of even gpt-5 pro. I use code2prompt and ChatGPT with pro will reject the prompt as too large.
I’ve been trying to use shorter variable names. Maybe I should move unit tests into their own file and ignore them? It’s not idiomatic in Rust though and breaks visibility rules for the modules.
What we really need is for the agent to assemble the required context for the problem space. I suspect this is what coding agents will do if they don’t already.
It's both context and memory. If an LLM could keep the entire git history in memory, and each of those git commits had enough context, it could take a new feature and understand the context in which it should live by looking up the history of the feature area in it's memory.
I’m really wondering why so many advertising posts mimicked as discourse make it to frontpage and I assume it’s a new Silicon Valley trick because there is no way HN community values these so much.
Let me tell you I’m scared of these tools. With Aider I have the most human in the loop possible each AI action is easy to undo, readable and manageable.
However even here most of the time I want AI to write a bulk of code I regret it later.
Most codebase challenges I have are infrastructural problems, where I need to reduce complexity to be able to safely add new functionality or reduce error likelihood. I’m talking solid well named abstractions.
This in the best case is not a lot of code. In general I would always rather try to have less code than more. Well named abstraction layers with good domain driven design is my goal.
When I think of switching to an AI first editor I get physical anxiety because it feels like it will destroy so many coders by leading to massive frustration.
I think still the best way of using ai is literally just chat with it about your codebase to make sure you have good practise.
You're on a site that exists to advertise job postings from YC companies, and does not stop people from spamming their personal or professional projects/companies, even when they have no activity here other than self promotion. This is an advertising site.
Here I think that the problem with the context is in the mind of business and dev not everything is written down and even if I would be translating it understandable (prompting) will sometimes be more work than to build it on the go with modern idea and typesafe languages
Notably, all of this information would be very helpful if written down as documentation in the first place. Maybe this will encourage people to do that?
I downloaded the app and it failed at the first screen when I set up the models. I agree with the spirit of the blog post but the execution seems lacking.
I know it isn’t your question exactly, and you probably know this, but the models for coding assist tools are generally fine tunes of models for coding specific purposes. Example: in OpenAI codex they use GPT-5-codex
"And yet, coding agents are nowhere near capable of replacing software developers. Why is that?"
Because you will always need a specialist to drive these tools. You need someone who understands the landscape of software - what's possible, what's not possible, how to select and evaluate the right approach to solve a problem, how to turn messy human needs into unambiguous requirements, how to verify that the produced software actually works.
Provided software developers can grow their field of experience to cover QA and aspects of product management - and learn to effectively use this new breed of coding agents - they'll be just fine.
No, it's not. The limitation is believing a human can define how the agent should recall things. Instead, build tools for the agent to store and retrieve context and then give it a tool to refine and use that recall in the way it sees best fits the objective.
Humans gatekeep, especially in the tech industry, and that is exactly what will limit us improving AI over time. It will only be when we turn over it's choices to it that we move beyond all this bullshit.
Here's a project I've been working on the past 2 weeks and only yesterday did I unify everything entirely while in Cursor Claude-4-Sonnet-1M MAX mode and I am pretty astounded with the results, Cursor usage dashboard tells me many of my prompts are 700k-1m context for around $0.60-$0.90 USD each, it adds up fast but wow it's extraordinary
aliljet|5 months ago
Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
bgirard|5 months ago
tptacek|5 months ago
neutronicus|5 months ago
Coding agents choke on our big C++ code-base pretty spectacularly if asked to reference large files.
sdesol|5 months ago
This is how I designed my LLM chat app (https://github.com/gitsense/chat). I think agents have their place, but I really think if you want to solve complex problems without needlessly burning tokens, you will need a human in the loop to curate the context. I will get to it, but I believe in the same way that we developed different flows for working with Git, we will have different 'Chat Flows' for working with LLMs.
I have an interactive demo at https://chat.gitsense.com which shows how you can narrow the focus of the context for the LLM. Click "Start GitSense Chat Demos" then "Context Engineering & Management" to go through the 30 second demo.
tom_m|5 months ago
This isn't a misconception. Context is a limitation. You can effectively have an AI agent build an entire application with a single prompt if it has enough (and the proper) context. The models with 1m context windows do better. Models with small context windows can't even do the task in many cases. I've tested this many, many, many times. It's tedious, but you can find the right model and the right prompts for success.
raincole|5 months ago
Language itself is a highly compressed form of compressed context. Like when you read "hoist with one's own petard" you don't just think about literal petard but the context behind this phrase.
ray__|5 months ago
notatoad|5 months ago
on actual bottles without any metaphors, the bottle neck is narrower because humans mouths are narrower.
sheerun|5 months ago
suninsight|5 months ago
[deleted]
davedx|5 months ago
Yeah this is the really big one - kind of buried the lede a little there :)
Understanding product and business requirements traditionally means communicating (either via docs and specs or directly with humans) with a bunch of people. One of the differences between a junior and senior is being able to read between the lines of a github or jira issue and know that more information needs to be teased out from… somewhere (most likely someone).
I’ve noticed that when working with AI lately I often explicitly tell them “if you need more information or context ask me before writing code”, or variations thereof. Because LLMs, like less experienced engineers, tend to think the only task is to start writing code immediately.
It will get solved though, there’s no magic in it, and LLMs are well equipped by design to communicate!
throwacct|5 months ago
We're now using LLMs as mere tools (which is what it was meant to be from the get-go) to help us with different tasks, etc., but not to replace us, since they understand you need experienced and knowledgeable people to know what they're doing, since they won't learn everything there's to know to manage, improve and maintain tech used in our products and services. That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
If we get AGI, or the more sci-fi one, ASI, then all things will radically change (I'm thinking humanity reaching ASI will be akin to the episode from Love, Death & Robots: "When the Yogurt Took Over"). In the meantime, the hype cycle continues...
menaerus|5 months ago
I mean, did you try it for those purposes?
I have personally submitted an appeal to court for an issue I was having for which I would otherwise have to search almost indefinitely for a lawyer to be even interested into it.
I also debugged health opportunities from different angles using the AI and was quite successful at it.
I also experimented with the well-being topic and it gave me pretty convincing and mind opening suggestions.
So, all I can say is that it worked out pretty good in my case. I believe its already transformative in a ways we wouldn't be able even to envision couple years ago.
asdev|5 months ago
chankstein38|5 months ago
reclusive-sky|5 months ago
Jweb_Guru|5 months ago
musebox35|5 months ago
Mind you, I was exactly like that when I started my career and it took quite a while and being on both sides of the conversation to improve. One difference is that it is not so easy to put oneself in the shoes of an LLM. Maybe I will improve with time. So far assuming the LLM is knowledgeable but not very smart has been the most effective strategy for my LLM interactions.
kypro|5 months ago
The ICPC is a short (5 hours) timed contest with multiple problems, in which contestants are not allowed to use the internet.
The reason most don't get a perfect score isn't because the tasks themselves are unreasonably difficult, but because they're difficult enough that 5 hours isn't a lot of time to solve so many problems. Additionally they often require a decent amount of math / comp-sci knowledge so if you don't know have the knowledge necessary you probably won't be able complete it.
So to get a good score you need lots of math & comp-sci knowledge + you need to be a really quick coder.
Basically the consent is perfect for LLMs because they have a ton of math and comp-sci knowledge, they can spit out code at super human speeds, and the problems themselves are fairly small (they take a human maybe 15 mins to an hour to complete).
Who knows, maybe OP is right and LLMs are smart enough to be super human coders if they just had the right context, but I don't think this example proves their point well at all. These are exactly the types of problems you would expect a supercharged auto-complete would excel at.
ISL|5 months ago
As these tools make it possible for a single person to do more, it will become increasingly likely that society will be exposed to greater risks than that single person's (or small company's) assets can cover.
These tools already accelerate development enough that those people who direct the tools can no longer state with credibility that they've personally reviewed the code/behavior with reasonable coverage.
It'll take over-extensions of the capability of these tools, of course, before society really notices, but it remains my belief that until the tools themselves can be held liable for the quality of their output, responsibility will become the ultimate bottleneck for their development.
jimbohn|5 months ago
And who writes the tests?
bhu8|5 months ago
- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly
skaosobab|5 months ago
Microservices should already be a last resort when you’ve either: a) hit technical scale that necessitates it b) hit organizational complexity that necessitates it
Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).
> Better documentation
If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.
Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).
lomase|5 months ago
perplex|5 months ago
marstall|5 months ago
I'll stop ya right there. Spending the past few weeks fixing bugs in a big multi-tier app (which is what any production software is this days). My output per bug is always one commit, often one line.
Claude is an occasional help, nothing more. Certainly not generating the commit for me!
SparkyMcUnicorn|5 months ago
Claude is able to create entire PRs for me that are clean, well written, and maintainable.
Can it fail spectacularly? Yes, and it does sometimes. Can it be given good instructions and produce results that feel like magic? Also yes.
agf|5 months ago
You have to be willing to accept "close-ish and good enough" to what you'd write yourself. I would say that most of the time I spend with Claude is to get from its initial try to "close-ish and good enough". If I was working on tiny changes of just a few lines, it would definitely be faster just to write them myself. It's the hundreds of lines of boilerplate, logging, error handling, etc. that makes the trade-off close to worth it.
unknown|5 months ago
[deleted]
keeda|5 months ago
You need to have the right things in the context, irrelevant stuff is not just wasteful, it is increasingly likely to cause errors. It has been shown a few times that as the context window grows, performance drops.
Heretical I know, but I find that thinking like a human goes a long way to working with AI.
Let's take the example of large migrations. You're not going to load the whole codebase in your brain and figure out what changes to make and then vomit them out into a huge PR. You're going to do it bit by bit, looking up relevant files, making changes to logically-related bits of code, and putting out a PR for each changelist.
This exactly what tools should do as well. At $PAST_JOB my team built a tool based on OpenRewrite (LLMs were just coming up) for large-scale multi-repo migrations and the centerpiece was our internal codesearch tool. Migrations were expressed as a codesearch query + codemod "recipe"; you can imagine how that worked.
That would be the best way to use AI for large-scale changes as well. Find the right snippets of code (and documentation!), load each one into the context of an agent in multiple independent tasks.
Caveat: as I understand it, this was the premise of SourceGraph's earliest forays into AI-assisted coding, but I recall one of their engineers mentioning that this turned out to be much trickier than expected. (This was a year+ back, so eons ago in LLM progress time.)
Just hypothesizing here, but it may have been that the LSIF format does not provide sufficient context. Another company in this space is Moderne (the creators of OpenRewrite) that have a much more comprehensive view of the codebase, and I hear they're having better success with large LLM-based migrations.
_joel|5 months ago
nowittyusername|5 months ago
cuttothechase|5 months ago
However, the limitation can be masqueraded using layering techniques where output of one agent is fed as an input to another using consensus for verification or other techniques to the nth degree to minimize errors. But this is a bit like the story of a boy with a finger in the dike. Yes, you can spawn as many boys but there is a cost associated that would keep growing and wont narrow down.
It has nothing to do with contexts or window of focus or any other human centric metric. This is what the architecture is supposed to do and it does so perfectly.
hirako2000|5 months ago
I gave up building agents as soon as I figured they would never scale beyond context constraint. Increase in memory and compute costs to grow the context size of these things isn't linear.
wrs|5 months ago
ninetyninenine|5 months ago
We summarize context and remember summarizations of it.
Maybe we need to do this with the LLM. Chain of thought sort of does this but it’s not deliberate. The system prompt needs to mark this as a deliberate task of building summaries and notes notes of the entire code base and this summarized context of the code base with gotchas and aspects of it can be part of permanent context the same way ChatGPT remembers aspects of you.
The summaries can even be sectioned off and and have different levels of access. So if the LLM wants to drill down to a subfolder it looks at the general summary and then it looks at another summary for the sub folder. It doesn’t need to access the full summary for context.
Imagine a hierarchy of system notes and summaries. The LLM decides where to go and what code to read while having specific access to notes it left previously when going through the code. Like the code itself it never reads it all it just access sections of summaries that go along with the code. It’s sort of like code comments.
We also need to program it to change the notes every time it changes the program. And when you change the program without consulting AI, every commit you do the AI also needs to update the notes based off of your changes.
The LLM needs a system prompt that tells it to act like us and remember things like us. We do not memorize and examine full context of anything when we dive into code.
hirako2000|5 months ago
We do take notes, we summarize our writings, that's a process. But the brain does not follow that primitive process to "scale".
wat10000|5 months ago
nsedlet|5 months ago
A good senior engineer has a ton in their head after 6+ months in a codebase. You can spend a lot of time trying to equip Claude Code with the equivalent in the form of CLAUDE.MD, references to docs, etc., but it's a lot of work, and it's not clear that the agents even use it well (yet).
maerF0x0|5 months ago
yes, and if you're an engineering manager you retain _out of date_ summarizations, often materially out of date.
anthonypasq|5 months ago
why would you bother with all these summaries if you can just read and remember the code perfectly.
maerF0x0|5 months ago
I think this might be a good leap for agents, the ability to not just review a doc in it's current state, but to keep in context/understanding the full evolution of a document.
wat10000|5 months ago
HankStallone|5 months ago
mpalmer|5 months ago
revel|5 months ago
If you want your agent to be really good at working with dates in a functional way or know how to deal with the metric system (as examples), then you need to train on those problems, probably using RFT. The other challenge is that even if you have this problem set in testable fashion running at scale is hard. Some benchmarks have 20k+ test cases and can take well over an hour to run. If you ran each test case sequentially it would take over 2 years to complete.
Right now the only company I'm aware of that lets you do that at scale is runloop (disclaimer, I work there).
EcommerceFlow|5 months ago
The context pipeline is a major problem in other fields as well, not just programming. In healthcare, the next billion-dollar startup will likely be the one that cracks the personal health pipeline, enabling people to chat with GPT-6 PRO while seamlessly bringing their entire lifetime of health context into every conversation.
delusional|5 months ago
"Its not the x value that's the problem, its the y value".
You're right, it's not "raw intelligence" that's the bottleneck, because there's none of that in there. The truth is no tweak to any parameter is ever going to make the LLM capable of programming. Just like an exponential curve is always going to outgrow a linear one. You can't tweak the parameters out of that fundamental truth.
alastairr|5 months ago
AnotherGoodName|5 months ago
Eg. "Refactor this large file into meaningful smaller components where appropriate and add code documentation on what each small component is intended to achieve." The LLM can usually handle this well (with some oversight of course). I also have instructions to document each change and why in code in the LLMs instructions.md
If the LLM does create a regression i also ask the LLM to add code documentation in the code to avoid future regressions, "Important: do not do X here as it will break Y" which again seems to help since the LLM will see that next time right there in the portion of code where it's important.
None of this verbosity in the code itself is harmful to human readers either which is nice. The end result is the codebase becomes much easier for LLMs to work with.
I suspect LLM compatibility may be a metric we measure codebases in the future as we learn more and more how to work with them. Right now LLMs themselves often create very poor LLM compatible code but by adding some more documentation in the code itself they can do much better.
binary132|5 months ago
J_Shelby_J|5 months ago
I’ve been trying to use shorter variable names. Maybe I should move unit tests into their own file and ignore them? It’s not idiomatic in Rust though and breaks visibility rules for the modules.
What we really need is for the agent to assemble the required context for the problem space. I suspect this is what coding agents will do if they don’t already.
999900000999|5 months ago
I started writing a solution, but to be honest I probably need the help of someone who's more experienced.
Although to be honest, I'm sure someone with VC money is already working on this.
maherbeg|5 months ago
_pdp_|5 months ago
heyrhett|5 months ago
MCP can use 10k tokens. Everything good happens in the first 100k tokens.
It's more context efficient to code a custom binary and prompt the LLM how to use the binary when needed.
jwpapi|5 months ago
Let me tell you I’m scared of these tools. With Aider I have the most human in the loop possible each AI action is easy to undo, readable and manageable.
However even here most of the time I want AI to write a bulk of code I regret it later.
Most codebase challenges I have are infrastructural problems, where I need to reduce complexity to be able to safely add new functionality or reduce error likelihood. I’m talking solid well named abstractions.
This in the best case is not a lot of code. In general I would always rather try to have less code than more. Well named abstraction layers with good domain driven design is my goal.
When I think of switching to an AI first editor I get physical anxiety because it feels like it will destroy so many coders by leading to massive frustration.
I think still the best way of using ai is literally just chat with it about your codebase to make sure you have good practise.
add-sub-mul-div|5 months ago
jwpapi|5 months ago
ravenical|5 months ago
luckydata|5 months ago
lowbloodsugar|5 months ago
koakuma-chan|5 months ago
CardenB|5 months ago
__alexs|5 months ago
anthonypasq|5 months ago
gtsop|5 months ago
Are we still calling it intelligence?
hatefulmoron|5 months ago
fortyseven|5 months ago
qaq|5 months ago
lxe|5 months ago
simonw|5 months ago
Because you will always need a specialist to drive these tools. You need someone who understands the landscape of software - what's possible, what's not possible, how to select and evaluate the right approach to solve a problem, how to turn messy human needs into unambiguous requirements, how to verify that the produced software actually works.
Provided software developers can grow their field of experience to cover QA and aspects of product management - and learn to effectively use this new breed of coding agents - they'll be just fine.
kordlessagain|5 months ago
Humans gatekeep, especially in the tech industry, and that is exactly what will limit us improving AI over time. It will only be when we turn over it's choices to it that we move beyond all this bullshit.
lerp-io|5 months ago
aiviewz|5 months ago
bilbo-b-baggins|5 months ago
KeatonDunsford|5 months ago
https://github.com/foolsgoldtoshi-star/foolsgoldtoshi-star-p...
_ _ kae3g