I’m biased [0], but I think we should be scripting around LLM-agnostic open source agents. This technology is changing software development at its foundations—-we need to ensure we continue to control how we work.
This looks like a good resource. There are some pretty powerful models that will run on a Nvidia 4090 w/ 24gb of RAM. Devstral and Queen 3. Ollama makes it simple to run them on your own hardware, but the cost of the GPU is a significant investment. But if you are paying $250 a month for a proprietary tool it would pay for itself pretty quickly.
This article is a bit all over the place. First, a slide deck to describe a codebase is not that useful. There's a reason why no one ever uses a slide deck for anything besides supporting an oral presentation.
Most of these things in the post aren't new capabilities. The automation of workflows is indeed valuable and cool. Not sure what AGI has anything to do with it.
Also I don't trust it. They touched on that I think (I only skimmed).
Plus you shouldn't need an LLM to understand a codebase. Just make it more understandable! Of course capital likes shortcuts and hacks to get the next feature out in Q3.
The number one thing I have found LLMs useful for is producing mermaidjs diagrams of code. Now, I know they are not always perfect but it has been "good enough" very many times, and I have never seen hallucinations here, only omissions. If I notice something missing its super-easy to tell it to amend.
Judging from the tone of the article, they’re using the term AGI in a jokey way and not taking themselves too seriously, which is refreshing.
I mean like, it wouldn’t be refreshing if the article didn’t also have useful information, but I do actually think a slide deck could be a useful way to understand a codebase. It’s exactly the kind of nice-to-have that I’d never want a junior wasting time on, but if it costs like $5 and gets me something minorly useful, that’s pretty cool.
Part of the mind-expanding transition to using LLMs involves recognizing that there are some things we used to dislike because of how much effort they took relative to their worth. But if you don’t need to do the thing yourself or burn through a team member’s time/sanity doing it, it can make you start to go “yeah fuck it, trawl the codebase and try to write a markdown document describing all of the features and requirements in a tabular format. Maybe it’ll go better than I expect, and if it doesn’t then on to something else.”
Great article! I have similar observations and techniques and Claude Code is exceptionally good - most of the days I'm working on multiple things at once (thanks to git worktrees) and each going faster than ever - that's really crazy.
For the "sub agents"thing, I must admit, that
Claude Code calling o3 via sigoden/aichat saved me countless of times!
There are just issues that o3 excells at (race conditions, bug hunting - anything that requires lot of context and really high reasoning abilities).
But I'm using it less since Opus 4 came out. And of course its none of the sub-agent thing at all.
Exactly. It has access to literally everything including any MCP server. It's so awesome having claude code check my database using a read-only user, or have it open a puppeteer browser and check whether its CSS changes look weird or not.
It's the perfect interface and anthropic nailed it.
It can even debug my k8s cluster using kubectl commands and check prometheus over the API, how awesome is this?
sort of, except I think the future of llms will be to to have the llm try 5 separate attempts to create a fix in parallel, since llm time is cheaper than human time... and once you introduce this aspect into the workflow, you'll want to spin up multiple containers, and the benefits of the terminal aren't as strong anymore.
Asking it to explain rust borrow checker is one of the worst examples to demonstrate its ability to read code. There are piles of that in its training data.
Agreed, ask it to explain how exceptions are handled in python asyncio tasks, even given all the code, and it will vacillate like the worst intern in the world. What's more, there's no way to "teach" it, and even if there was, it would not last beyond the current context.
A complete waste of time for important but relatively simple tasks.
Such a weird complaint. If you were to explain the rust borrow checker to me, should I complain that it doesn't count because you had read explanations of the borrow checker? That it was "in your training data"? I mean, do you think you just understand the borrow checker without being taught about it in some form?
I mean, I get what you are kind of saying, that there isn't much evidence that they tools are able to generate new ideas, and that the sheer amount of knowledge it has obscures the detection of that phenomenon, but practically speaking I don't care because it is useful and helpful (within its hallucinatory framework).
Assuming attention to detail is one of the best signs people give a fuck about craftsmanship, isn’t the fact the Anthropic legal terms are logically impossible to satisfy a bad sign for their ability to be trusted as careful stewards of ASI?
Not exactly “three laws safe” if we can’t use the thing for work without violating their competitive use prohibition
I can’t speak for their legal department, but their product, Claude Code, bears signs of lavish attention to detail. Right down to running Haiku on the context to come up with cute appropriate verbs for the “working…” indicators.
>Claude code feels more powerful than cursor, but why? One of the reasons seems it's ability to be scripted. At the end of the day, cursor is an editor, while claude code is a swiss army knife (on steroids).
Agreed, and I find that I use Claude Code on more than traditional code bases. I run it in my Obsidian vault for all kinds of things. I run it to build local custom keyboard bindings with scripts that publish screenshots to my CDN and give me a markdown link, or to build a program that talks to Ollama to summarize my terminal commands for the last day.
I remember the old days of needing to figure out if the formatting changes I wanted to make to a file were sufficient to build a script or just do them manually - now I just run Claude in the directory and have it done for me. It's useful for so many things.
The thing is, Claude Code only works if you have the plan. It’s impossible to use it on the API, and it makes me wonder if $100/month is truly enough. I use it all day every day now, and I must be consuming a whole lot more than my $100 is worth.
Having tried everything I settled on a $100/month Anthropic "Max" plan to use Claude Code. Then I learned how Claude Opus 4 is currently their best but most expensive model for my situation (math code and research). I limited out of a five hour session, switched to their API, and burned $20 in an hour. So I upgraded to $200/month "Max" and haven't hit limits yet.
Models matter. All these stories are like "I met a person who wasn't that smart." Duh!
This article is inspiring. I haven’t had the moment to get my head out of the Cursor + biz logic water until now. Very cool to think about LLMs automagically creating changelogs, testing packaging when dependencies are bumped, forcing unit tests on features.
Is anyone aware of something like this? Maybe in the GitHub actions or pre-commit world?
Gonna be a bit blunt here and ask why hooking up an agentic CLI tool to one or more other software tool(s) is the top post on HN right now... sure, some of these ideas are interesting but at the end of the day literally all of them have been explored / revisited by various MCP tools (or can be done more or less in scripted / hacked ways as the author shows here)
I don't know, just feels like a weird community response to something that is the equivalent to me of bash piping...
Not trying to be rude here, but that `last_week.md` is horrible to me. I can't imagine having to read that let alone listen to the computer say it to me. It's so much blah blah and fluff that reads like a bad PR piece. I'd much rather scan through commits of the last week.
I've found this generally with AI summaries...usually their writing style is terrible, and I feel like I cannot really trust them to get the facts right, and reading the original text is often faster and better.
## Instructions
* Be concise
* Use simple sentences. But feel free to use technical jargon.
* Do NOT overexplain basic concepts. Assume the user is technically proficient.
* AVOID flattering, corporate-ish or marketing language. Maintain a neutral viewpoint.
* AVOID vague and / or generic claims which may seem correct but are not substantiated by the the context.
Cannot completely avoid hallucinations and it's good to avoid AI for text that's used for human-to-human communication. But this makes AI answers to coding and technical questions easier to read.
I felt the same thing about the onboarding. Like what future are we trying to build for ourselves here, exactly? The kind where instead of sitting down with a coworker to learn about a codebase, instead we get an ai generated PowerPoint to read alone????
Yup, you can always tell LLMs just from the ridiculous output most of the time. Like 8-20 sentences minimum, for the most basic thing.
Even Gemini/gpt4o/etc are all guilty of this. Maybe they'll tighten things up at some point - if I ask an assistant a simple question like "is it possible to put apples into a pie?" what I want is "Yes, it is possible to put apples into a pie. Would you like to know more?"
But not "Yes, absolutely — putting apples into a pie is not only possible, it's classic! Apple pie is one of the most well-known and traditional fruit pies. Typically, sliced apples are mixed with sugar, cinnamon, nutmeg, and sometimes lemon juice or flour, then baked inside a buttery crust. You can use various types of apples depending on the flavor and texture you want (like Granny Smith for tartness or Honeycrisp for sweetness). Would you like a recipe or tips on which apples work best?" (from gpt4).
Python, a journey that began with an initial commit and evolved through a series of careful refinements to establish a robust foundation for the project..
Wow yeah what a waste. That is exactly the opposite of saving time.
If this was meant to be read, I might've agreed, but:
1) This was supposed to be piped through TTS and listened to in the background, and...
2) People like podcasts.
Your typical podcast is much worse than this. It's "blah blah" and "hahaha <interaction>" and "ooh <emoting>" and "<irrelevant anecdote>" and "<turning facts upside down and injecting a lie for humorous effect>", and maybe some of the actual topic mixed in between, and yet for some reason, people love it.
I honestly doubt this specific thing would be useful for me, but I'm not going to assume it's plain dumb, because again, podcasts are worse, and people love it.
Remember the sycophant bug? Maybe making the user FEELGOOD is part of what makes it feel smart or like a good experience. Is the reward function being smart? Is it maximizing interaction? Does it conflict with being accurate?
The vilification of juniors and the abandonment of the idea that teaching and mentoring are worthwhile are single-handedly making me speedrun burnout. May a hundred years of Microsoft Visio befall anybody who thinks that way.
A constant reminder: you can't have wizards without having noobs.
Every wizard was once a noob. No one is born that way, they were forged. It's in everybody's interest to train them. If they leave, you still benefit from the other companies who trained them, making the cost equal. Though if they leave, there's probably better ways to make them stay that you haven't considered (e.g. have you considered not paying new juniors more than your current junior that has been with the company for a few years? They should be able to get a pay bump without leaving)
I spent a lot of time in my career, honestly some of the most impactful stuff I've done, mentoring college students and junior developers. I think you are dead on about the skills being very similar. Being verbose, not making assumptions about existing context, and generalized warnings against pitfalls when doing the sort of thing you're asking it to do goes a long long way.
Just make sure you talk to Claude in addition to the humans and not instead of.
As it has been over three years ago, when that was originally published.
I'm continuously surprised both by how fast the models themselves evolve, and how slow their use patterns are. We're still barely playing with the patterns that were obvious and thoroughly discussed back before GPT-4 was a thing.
Right now, the whole industry is obsessed with "agents", aka. giving LLMs function calls and limited control over the loop they're running under. How many years before the industry will get to the point of giving LLMs proper control over the top-level loop and managing the context, plus an ability to "shell out" to "subagents" as a matter of course?
In general, "reader mode". I don't use Chrome but Google suggests that it's in a menu <https://support.google.com/chrome/answer/14218344?hl=en>. Many Chrome-alikes provide it built-in (Brave calls it Speedreader), and many extensions can add it for you (Readability was the OG one).
I've actually stumbled upon a novel new way of using Claude code that I don't think anybody else is doing that's insanely better. I'll release it soon.
I played around with agents yesterday, now I'm hooked.
I got Claude Code (With CLine and VSCode) to do a task for a personal project. It did it about 5x faster than i'd have been able to do manually including running bash commands e.g. to install dependencies for new npm packages.
These things can do real work. If you have things in plain text format like markdown, csv spreadsheets etc, alot of what normal human employees do today could be somewhat automated.
You currently still need a human to supervise the agent and what its doing, but that won't be needed anymore in the not so distant future.
rbren|8 months ago
[0] https://github.com/all-hands-ai/openhands
robotbikes|8 months ago
handfuloflight|8 months ago
ProofHouse|8 months ago
tinyhouse|8 months ago
Most of these things in the post aren't new capabilities. The automation of workflows is indeed valuable and cool. Not sure what AGI has anything to do with it.
bravesoul2|8 months ago
Plus you shouldn't need an LLM to understand a codebase. Just make it more understandable! Of course capital likes shortcuts and hacks to get the next feature out in Q3.
sandos|8 months ago
Uehreka|8 months ago
Judging from the tone of the article, they’re using the term AGI in a jokey way and not taking themselves too seriously, which is refreshing.
I mean like, it wouldn’t be refreshing if the article didn’t also have useful information, but I do actually think a slide deck could be a useful way to understand a codebase. It’s exactly the kind of nice-to-have that I’d never want a junior wasting time on, but if it costs like $5 and gets me something minorly useful, that’s pretty cool.
Part of the mind-expanding transition to using LLMs involves recognizing that there are some things we used to dislike because of how much effort they took relative to their worth. But if you don’t need to do the thing yourself or burn through a team member’s time/sanity doing it, it can make you start to go “yeah fuck it, trawl the codebase and try to write a markdown document describing all of the features and requirements in a tabular format. Maybe it’ll go better than I expect, and if it doesn’t then on to something else.”
jumski|8 months ago
For the "sub agents"thing, I must admit, that Claude Code calling o3 via sigoden/aichat saved me countless of times!
There are just issues that o3 excells at (race conditions, bug hunting - anything that requires lot of context and really high reasoning abilities).
But I'm using it less since Opus 4 came out. And of course its none of the sub-agent thing at all.
I use this prompt @included in the main CLAUDE.md: https://github.com/pgflow-dev/pgflow/blob/main/.claude/advan...
sigoden/aichat: https://github.com/sigoden/aichat
_1tem|8 months ago
jasonthorsness|8 months ago
ed_mercer|8 months ago
It can even debug my k8s cluster using kubectl commands and check prometheus over the API, how awesome is this?
drcode|8 months ago
ldjkfkdsjnv|8 months ago
mountainriver|8 months ago
Do you not want to edit your code after it’s generated?
blahgeek|8 months ago
dundarious|8 months ago
A complete waste of time for important but relatively simple tasks.
unknown|8 months ago
[deleted]
gilbetron|8 months ago
Such a weird complaint. If you were to explain the rust borrow checker to me, should I complain that it doesn't count because you had read explanations of the borrow checker? That it was "in your training data"? I mean, do you think you just understand the borrow checker without being taught about it in some form?
I mean, I get what you are kind of saying, that there isn't much evidence that they tools are able to generate new ideas, and that the sheer amount of knowledge it has obscures the detection of that phenomenon, but practically speaking I don't care because it is useful and helpful (within its hallucinatory framework).
bionhoward|8 months ago
Not exactly “three laws safe” if we can’t use the thing for work without violating their competitive use prohibition
alwa|8 months ago
abhisheksp1993|8 months ago
This made me chuckle
SamPatt|8 months ago
Agreed, and I find that I use Claude Code on more than traditional code bases. I run it in my Obsidian vault for all kinds of things. I run it to build local custom keyboard bindings with scripts that publish screenshots to my CDN and give me a markdown link, or to build a program that talks to Ollama to summarize my terminal commands for the last day.
I remember the old days of needing to figure out if the formatting changes I wanted to make to a file were sufficient to build a script or just do them manually - now I just run Claude in the directory and have it done for me. It's useful for so many things.
Aeolun|8 months ago
jjice|8 months ago
cpard|8 months ago
AstroBen|8 months ago
AstroBen|8 months ago
thunkle|8 months ago
jsjohnst|8 months ago
Syzygies|8 months ago
Having tried everything I settled on a $100/month Anthropic "Max" plan to use Claude Code. Then I learned how Claude Opus 4 is currently their best but most expensive model for my situation (math code and research). I limited out of a five hour session, switched to their API, and burned $20 in an hour. So I upgraded to $200/month "Max" and haven't hit limits yet.
Models matter. All these stories are like "I met a person who wasn't that smart." Duh!
beigebrucewayne|8 months ago
dirtbag__dad|8 months ago
Is anyone aware of something like this? Maybe in the GitHub actions or pre-commit world?
pjm331|8 months ago
citizenpaul|8 months ago
Yeah now companies that paid lip service to those things can still not have them but pretend they do cause the AI did it....
dweinus|8 months ago
It's at least decent though, right?
> "What emerged over these seven days was more than just code..."
Yeesh, ok, but is it accurate?
> Over time this will likely degrade the performance and truthfulness
Sure, but it's cheap right?
> $250 a month.
Well at least it's not horrible for the environment and built on top of massive copyright violations, right?
Right?
tom_m|8 months ago
citizenpaul|8 months ago
Lol, I guess their AI is too good for a redactor. Better have humans do it.
rikschennink|8 months ago
beigebrucewayne|8 months ago
fullstackchris|8 months ago
I don't know, just feels like a weird community response to something that is the equivalent to me of bash piping...
mjrbrennan|8 months ago
I've found this generally with AI summaries...usually their writing style is terrible, and I feel like I cannot really trust them to get the facts right, and reading the original text is often faster and better.
never_inline|8 months ago
WD-42|8 months ago
Im so over this timeline.
fennecfoxy|8 months ago
Even Gemini/gpt4o/etc are all guilty of this. Maybe they'll tighten things up at some point - if I ask an assistant a simple question like "is it possible to put apples into a pie?" what I want is "Yes, it is possible to put apples into a pie. Would you like to know more?"
But not "Yes, absolutely — putting apples into a pie is not only possible, it's classic! Apple pie is one of the most well-known and traditional fruit pies. Typically, sliced apples are mixed with sugar, cinnamon, nutmeg, and sometimes lemon juice or flour, then baked inside a buttery crust. You can use various types of apples depending on the flavor and texture you want (like Granny Smith for tartness or Honeycrisp for sweetness). Would you like a recipe or tips on which apples work best?" (from gpt4).
fullstackchris|8 months ago
ozim|8 months ago
Wow yeah what a waste. That is exactly the opposite of saving time.
block_dagger|8 months ago
unknown|8 months ago
[deleted]
TeMPOraL|8 months ago
1) This was supposed to be piped through TTS and listened to in the background, and...
2) People like podcasts.
Your typical podcast is much worse than this. It's "blah blah" and "hahaha <interaction>" and "ooh <emoting>" and "<irrelevant anecdote>" and "<turning facts upside down and injecting a lie for humorous effect>", and maybe some of the actual topic mixed in between, and yet for some reason, people love it.
I honestly doubt this specific thing would be useful for me, but I'm not going to assume it's plain dumb, because again, podcasts are worse, and people love it.
TZubiri|8 months ago
rsynnott|8 months ago
I suppose preferences differ, but really, does anyone _like_ this sort of writing style?
beigebrucewayne|8 months ago
1. I shouldn't have used a newly created repo that had no real work over the course of the last week.
2. I should have put more time into the prompt to make it sound less nails on chalkboard.
hoppp|8 months ago
jvanderbot|8 months ago
tra3|8 months ago
eru|8 months ago
unknown|8 months ago
[deleted]
distortionfield|8 months ago
BoredPositron|8 months ago
sorcerer-mar|8 months ago
And since they're human, the juniors themselves do not have the patience of an LLM.
I really would not want to be a junior dev right now... Very unfair and undesirable situation they've landed in.
qsort|8 months ago
godelski|8 months ago
Every wizard was once a noob. No one is born that way, they were forged. It's in everybody's interest to train them. If they leave, you still benefit from the other companies who trained them, making the cost equal. Though if they leave, there's probably better ways to make them stay that you haven't considered (e.g. have you considered not paying new juniors more than your current junior that has been with the company for a few years? They should be able to get a pay bump without leaving)
jayofdoom|8 months ago
Just make sure you talk to Claude in addition to the humans and not instead of.
unknown|8 months ago
[deleted]
handfuloflight|8 months ago
[deleted]
dwohnitmok|8 months ago
On the other hand, every time people are just spinning off sub-agents I am reminded of this: https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality...
It's simultaneously the obvious next step and portends a potentially very dangerous future.
TeMPOraL|8 months ago
As it has been over three years ago, when that was originally published.
I'm continuously surprised both by how fast the models themselves evolve, and how slow their use patterns are. We're still barely playing with the patterns that were obvious and thoroughly discussed back before GPT-4 was a thing.
Right now, the whole industry is obsessed with "agents", aka. giving LLMs function calls and limited control over the loop they're running under. How many years before the industry will get to the point of giving LLMs proper control over the top-level loop and managing the context, plus an ability to "shell out" to "subagents" as a matter of course?
lubujackson|8 months ago
> ${SUGESTION}
And recognized it wouldn't do anything because of a typo? Alas, my kind is not long for this world...
unknown|8 months ago
[deleted]
b0a04gl|8 months ago
[deleted]
intralogic|8 months ago
[deleted]
CGamesPlay|8 months ago
unknown|8 months ago
[deleted]
konexis007|8 months ago
jilles|8 months ago
johnwheeler|8 months ago
throwawayoldie|8 months ago
aussieguy1234|8 months ago
I got Claude Code (With CLine and VSCode) to do a task for a personal project. It did it about 5x faster than i'd have been able to do manually including running bash commands e.g. to install dependencies for new npm packages.
These things can do real work. If you have things in plain text format like markdown, csv spreadsheets etc, alot of what normal human employees do today could be somewhat automated.
You currently still need a human to supervise the agent and what its doing, but that won't be needed anymore in the not so distant future.