top | item 46771564

A few random notes from Claude coding quite a bit last few weeks

913 points| bigwheels | 1 month ago |twitter.com

https://xcancel.com/karpathy/status/2015883857489522876

848 comments

[+] daxfohl|1 month ago|reply

I worry about the "brain atrophy" part, as I've felt this too. And not just atrophy, but even moreso I think it's evolving into "complacency".

Like there have been multiple times now where I wanted the code to look a certain way, but it kept pulling back to the way it wanted to do things. Like if I had stated certain design goals recently it would adhere to them, but after a few iterations it would forget again and go back to its original approach, or mix the two, or whatever. Eventually it was easier just to quit fighting it and let it do things the way it wanted.

What I've seen is that after the initial dopamine rush of being able to do things that would have taken much longer manually, a few iterations of this kind of interaction has slowly led to a disillusionment of the whole project, as AI keeps pushing it in a direction I didn't want.

I think this is especially true if you're trying to experiment with new approaches to things. LLMs are, by definition, biased by what was in their training data. You can shock them out of it momentarily, whish is awesome for a few rounds, but over time the gravitational pull of what's already in their latent space becomes inescapable. (I picture it as working like a giant Sierpinski triangle).

I want to say the end result is very akin to doom scrolling. Doom tabbing? It's like, yeah I could be more creative with just a tad more effort, but the AI is already running and the bar to seeing what the AI will do next is so low, so....

[+] atonse|1 month ago|reply

> LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building.

I’ve always said I’m a builder even though I’ve also enjoyed programming (but for an outcome, never for the sake of the code)

This perfectly sums up what I’ve been observing between people like me (builders) who are ecstatic about this new world and programmers who talk about the craft of programming, sometimes butting heads.

One viewpoint isn’t necessarily more valid, just a difference of wiring.

[+] markb139|1 month ago|reply

I retired from paid sw dev work in 2020 when COVID arrived. I’ve worked on my small projects since with all development by hand. I’d followed the rise of AI, but not used it. Late last year I started a project that included reverse engineering some firmware that runs on an Intel 8096 based embedded processor. I’d never worked on that processor before. There are tools available, but they cost many $. So, I started to think about a simple disassembler. 2 weeks ago we decided to try Claude to see what it could do. We now have a disassembler, assembler and a partially working emulator. No doubt there are bugs and missing features and the code is a bit messy, but boy has it sped up the work. One thing did occur to me. Vendors of small utilities could be in trouble. For example I needed to cut out some pages from a pdf. I could have found a tool online(I’m sure there are several), write one myself. However, Claude quickly performed the task.

[+] jedberg|1 month ago|reply

> You realize that stamina is a core bottleneck to work

There has been a lot of research that shows that grit is far more correlated to success than intelligence. This is an interesting way to show something similar.

AIs have endless grit (or at least as endless as your budget). They may outperform us simply because they don't ever get tired and give up.

Full quote for context:

Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased.

[+] 0xbadcafebee|1 month ago|reply

> What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows a lot.

I was thinking about this the other day as relates to the DevOps movement.

The DevOps movement started as a way to accelerate and improve the results of dev<->ops team dynamics. By changing practices and methods, you get acceleration and improvement. That creates "high-performing teams", which is the team form of a 10x engineer. Whether or not you believe in '10x engineers', a high-performing team is real. You really can make your team deploy faster, with fewer bugs. You have to change how you all work to accomplish it, though.

To get good at using AI for coding, you have to do the same thing: continuous improvement, changing workflows, different designs, development of trust through automation and validation. Just like DevOps, this requires learning brand new concepts, and changing how a whole team works. This didn't get adopted widely with DevOps because nobody wanted to learn new things or change how they work. So it's possible people won't adapt to the "better" way of using AI for coding, even if it would produce a 10x result.

If we want this new way of working to stick, it's going to require education, and a change of engineering culture.

[+] jimbokun|1 month ago|reply

I'm pretty happy with Copilot in VS Code. Type what change I want Claude to make in the Copilot panel, and then use the VS Code in context diffs to accept or reject the proposed changes. While being able to make other small changes on my own.

So I think this tracks with Karpathy's defense of IDEs still being necessary ?

Has anyone found it practical to forgo IDEs almost entirely?

[+] netcraft|1 month ago|reply

> Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day.

This is true to an extent for sure and they will go much longer than most engineers without getting "tired", but I've def seen both sonnet and opus give up multiple times. They've updated code to skip tests they couldn't get to pass, given up on bugs they couldn't track down, etc. I literally had it ask "could we work on something else and come back to this"

[+] lucianbr|1 month ago|reply

The glorified autocomplete. Why would the LLM "work on something else then get back on this", is it's subconscious going to solve the problem during that time?

But because people say it, it says it too. Making sense is optional.

[+] Schlagbohrer|1 month ago|reply

Reminiscent of a time just a year or two ago where the LLMs would get downright frustrated and sassy

[+] manbash|1 month ago|reply

Oh, definitely. Also, they end up getting stuck in a loop, adding and removing the same code endlessly.

And then someone comes and "improves" their agent with additional "do not repeat yourself" prompts scattered all over the place, to no avail.

"Asinine" describes my experience perfectly.

[+] strogonoff|1 month ago|reply

LLM coding splits up engineers based on those who primarily like building and those who primarily like code reviews and quality assessment. I definitely don’t love the latter (especially when reviewing decisions not made by a human with whom I can build long-term personal rapport).

After certain experience threshold of making things from scratch, “coding” (never particularly liked that term) has always been 99% building, or architecture, and I struggle to see how often a well-architected solution today, with modern high-level abstractions, requires so much code that you’d save significant time and effort by not having to just type, possibly with basic deterministic autocomplete, exactly what you mean (especially considering you would have to also spend time and effort reviewing whatever was typed for you if you used a non-deterministic autocomplete).

[+] cmrdporcupine|1 month ago|reply

"those who primarily like code reviews and quality assessment" -- I don't love those. In fact I find it tedious and love it when I can work on my own without them.

Except after 25 years of working I know how imperative they are, how easily a project can disintegrate into confused silos, and am frustrated as heck with these tools being pushed without attention to this problem.

[+] OkayPhysicist|1 month ago|reply

See, I don't take it that extreme: LLMs make fantastic, never-before seen quality autocompletes. I hacked together a Neovim plugin that prompts an LLM to "finish this function" on command, and it's a big time save for the menial plumbing type operations. Think things like "this api I use expects JSON that encodes some subset of SQL, I want all the dogs with Ls in their name that were born on a Tuesday". Given an example of such API (or if the documentation ended up in its training), LLMs will consistently one-shot stuff like that.

Asking it to do entire projects? Dumb. You end up with spaghetti, unless you hand-hold it to a point that you might as well be using my autocomplete method.

[+] forrestthewoods|1 month ago|reply

HN should ban any discussion on “things I learned playing with AI” that don’t include direct artifacts of the thing built.

We’re about a year deep into “AI is changing everything” and I don’t see 10x software quality or output.

Now don’t get me wrong I’m a big fan of AI tooling and think it does meaningfully increase value. But I’m damn tired of all the talk with literally nothing to show for it or back it up.

[+] lomase|1 month ago|reply

[deleted]

[+] jwilliams|1 month ago|reply

> It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.

This is true... Equally I've seen it dive into a rabbit hole, make some changes that probably aren't the right direction... and then keep digging.

This is way more likely with Sonnet, Opus seems to be better at avoiding it. Sonnet would happily modify every file in the codebase trying to get a type error to go away. If I prompt "wait, are you off track?" it can usually course correct. Again, Opus seems way better at that part too.

Admittedly this has improved a lot lately overall.

[+] einrealist|1 month ago|reply

Somewhere, there are GPUs/NPUs running hot. You send all the necessary data, including information that you would never otherwise share. And you most likely do not pay the actual costs. It might become cheaper or it might not, because reasoning is a sticking plaster on the accuracy problem. You and your business become dependent on this major gatekeeper. It may seem like a good trade-off today. However, the personal, professional, political and societal issues will become increasingly difficult to overlook.

[+] cyode|1 month ago|reply

This quote stuck out to me as well, for a slightly different reason.

The “tenacity” referenced here has been, in my opinion, the key ingredient in the secret sauce of a successful career in tech, at least in these past 20 years. Every industry job has its intricacies, but for every engineer who earned their pay with novel work on a new protocol, framework, or paradigm, there were 10 or more providing value by putting the myriad pieces together, muddling through the ever-waxing complexity, and crucially never saying die.

We all saw others weeded out along the way for lacking the tenacity. Think the boot camp dropouts or undergrads who changed majors when first grappling with recursion (or emacs). The sole trait of stubbornness to “keep going” outweighs analytical ability, leetcode prowess, soft skills like corporate political tact, and everything else.

I can’t tell what this means for the job market. Tenacity may not be enough on its own. But it’s the most valuable quality in an employee in my mind, and Claude has it.

[+] daxfohl|1 month ago|reply

I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.

Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"

[+] kshri24|1 month ago|reply

Agree with Karpathy's take. Finally a down to Earth analysis from a respected source in the AI space. I guess I'll be using slopocalypse a lot more now :)

> I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media

It has arrived. Github will be most affected thanks to git-terrorists at Apna College refusing to take down that stupid tutorial. IYKYK.

[+] bob1029|1 month ago|reply

I would agree that OAIs GPT-5 family of models is a phase change over GPT-4.

In the ChatGPT product this is not immediately obvious and many people would strongly argue their preference for 4. However, once you introduce several complex tools and make tool calling mandatory, the difference becomes stark.

I've got an agent loop that will fail nearly every time on GPT-4. It works sometimes, but definitely not enough to go to production. GPT-5 with reasoning set to minimal works 100% of the time. $200 worth of tokens and it still hasn't failed to select the proper sequence of tools. It sometimes gets the arguments to the tools incorrect, but it's always holding the right ones now.

I was very skeptical based upon prior experience but flipping between the models makes it clear there has been recent stepwise progress.

I'll probably be $500 deep in tokens before the end of the month. I could barely go $20 before I called bullshit on this stuff last time.

[+] oxag3n|1 month ago|reply

> Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually... > Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it.

Until you struggle to review it as well. Simple exercise to prove it - ask LLM to write a function in familiar programming language, but in the area you didn't invest learning and coding yourself. Try reviewing some code involving embedding/SIMD/FPGA without learning it first.

[+] philipwhiuk|1 month ago|reply

The bits left unsaid:

1. Burning tokens, which we charge you for

2. My CPU does this when I tell it to do bogosort on a million 32-bit integers, it doesn't mean it's a good thing

[+] vinhnx|1 month ago|reply

Boris Cherny (Claude Code creator) replies to Andrej Karpathy

https://xcancel.com/bcherny/status/2015979257038831967

[+] porise|1 month ago|reply

I wish the people who wrote this let us know what king of codebases they are working on. They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious. I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

[+] CameronBanga|1 month ago|reply

This is an antidotal example, but I released this last week after 3 months of work on it as a "nights and weekdends" project: https://apps.apple.com/us/app/skyscraper-for-bluesky/id67541...

I've been working in the mobile space since 2009, though primarily as a designer and then product manager. I work in kinda a hybrid engineering/PM job now, and have never been a particularly strong programmer. I definitely wouldn't have thought I could make something with that polish, let alone in 3 months.

That code base is ~98% Claude code.

[+] TaupeRanger|1 month ago|reply

Claude and Codex are CLI tools you use to give the LLM context about the project on your local machine or dev environment. The fact that you're using the name "ChatGPT" instead of Codex leads me to believe you're talking about using the web-based ChatGPT interface to work on a large codebase, which is completely beside the point of the entire discussion. That's not the tool anyone is talking about here.

[+] danielvaughn|1 month ago|reply

It's important to understand that he's talking about a specific set of models that were release around november/december, and that we've hit a kind of inflection point in model capabilities. Specifically Anthropic's Opus 4.5 model.

I never paid any attention to different models, because they all felt roughly equal to me. But Opus 4.5 is really and truly different. It's not a qualitative difference, it's more like it just finally hit that quantitative edge that allows me to lean much more heavily on it for routine work.

I highly suggest trying it out, alongside a well-built coding agent like the one offered by Claude Code, Cursor, or OpenCode. I'm using it on a fairly complex monorepo and my impressions are much the same as Karpathy's.

[+] CSMastermind|1 month ago|reply

Are you using Codex?

I'm not sure how big your repos are but I've been effective working with repos that have thousands of files and tens of thousands of lines of code.

If you're just prototyping it will hit wall when things get unwieldy but that's normally a sign that you need to refactor a bit.

Super strict compiler settings, static analysis, comprehensive tests, and documentation help a lot. As does basic technical design. After a big feature is shipped I do a refactor cycle with the LLM where we do a comprehensive code review and patch things up. This does require human oversight because the LLMs are still lacking judgement on what makes for good code design.

The places where I've seen them be useless is working across repositories or interfacing with things like infrastructure.

It's also very model-dependent. Opus is a good daily driver but Codex is much better are writing tests for some reason. I'll often also switch to it for hard problems that Claude can't solve. Gemini is nice for 'I need a prototype in the next 10 minutes', especially for making quick and dirty bespoke front-ends where you don't care about the design just the functionality.

[+] keerthiko|1 month ago|reply

Almost always, notes like these are going to be about greenfield projects.

Trying to incorporate it in existing codebases (esp when the end user is a support interaction or more away) is still folly, except for closely reviewed and/or non-business-logic modifications.

That said, it is quite impressive to set up a simple architecture, or just list the filenames, and tell some agents to go crazy to implement what you want the application to do. But once it crosses a certain complexity, I find you need to prompt closer and closer to the weeds to see real results. I imagine a non-technical prompter cannot proceed past a certain prototype fidelity threshold, let alone make meaningful contributions to a mature codebase via LLM without a human engineer to guide and review.

[+] gwd|1 month ago|reply

For me, in just the golang server instance and the core functional package, `cloc` reports over 40k lines of code, not counting other supporting packages. I spent the last week having Claude rip out the external auth system and replace it with a home-grown one (and having GPT-codex review its changes). If anything, Claude makes it easier on me as a solo founder with a large codebase. Rather than having to re-familiarize myself with code I wrote a year ago, I describe it at a high level, point Claude to a couple of key files, and then tell it to figure out what it needs to do. It can use grep, language server, and other tools to poke around and see what's going on. I then have it write an "epic" in markdown containing all the key files, so that future sessions already know the key files to read.

I really enjoyed the process. As TFA says, you have to keep a close eye on it. But the whole process was a lot less effort, and I ended up doing mor than I would otherwise have done.

[+] ph4te|1 month ago|reply

I don't know how big sufficiently large codebase is, but we have a 1mil loc Java application, that is ~10years old, and runs POS systems, and Claude Code has no issues with it. We have done full analyses with output details each module, and also used it to pinpoint specific issues when described. Vibe coding is not used here, just analysis.

[+] fy20|1 month ago|reply

At my dayjob my team uses it on our main dashboard, which is a pretty large CRUD application. The frontend (Vue) is a horrible mess, as it was originally built by people who know just enough to be dangerous. Over time people have introduced new standards without cleaning up the old code - for example, we have three or four different state management techologies.

For this the LLM struggles a bit, but so does a human. The main issues are it messes up some state that it didnt realise was used elsewhere, and out test coverage is not great. We've seen humans make exactly the same kind of mistakes. We use MCP for Figma so most of the time it can get a UI 95% done, just a few tweaks needed by the operator.

On the backend (Typescript + Node, good test coverage) it can pretty much one-shot - from a plan - whatever feature you give it.

We use opus-4.5 mostly, and sometimes gpt-5.2-codex, through Cursor. You aren't going to get ChatGPT (the web interface) to do anything useful, switch to Cursor, Codex or Claude Code. And right now it is worth paying for the subscription, you don't get the same quality from cheaper or free models (although they are starting to catch up, I've had promising results from GLM-4.7).

[+] yasoob|1 month ago|reply

Another personal example. I spent around a month last year in January on this application: https://apps.apple.com/us/app/salam-prayer-qibla-quran/id674...

I had never used Swift before that and was able to use AI to whip up a fairly full-featured and complex application with a decent amount of code. I had to make some cross-cutting changes along the way as well that impacted quite a few files and things mostly worked fine with me guiding the AI. Mind you this was a year ago so I can only imagine how much better I would fare now with even better AI models. That whole month was spent not only on coding but on learning Swift enough to fix problems when AI started running into circles and then learning about Xcode profiler to optimize the application for speed and improving perf.

[+] BeetleB|1 month ago|reply

> They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious.

What type of documents do you have explaining the codebase and its messy interactions, and have you provided that to the LLM?

Also, have you tried giving someone brand new to the team the exact same task and information you gave to the LLM, and how effective were they compared to the LLM?

> I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

As others have pointed out, from your comment, it doesn't sound like you've used a tool dedicated for AI coding.

(But even if you had, it would still fail if you expect LLMs to do stuff without sufficient context).

[+] smusamashah|1 month ago|reply

The code base I work on at $dayjob$ is legacy, has few files with 20k lines each and a few more with around 10k lines each. It's hard to find things and connect dots in the code base. Dont think LLMs able to navigate and understand code bases of that size yet. But have seen lots of seemingly large projects shown here lately that involve thousands of files and millions of lines of code.

[+] tunesmith|1 month ago|reply

If you have a ChatGPT account, there's nothing stopping you from installing codex cli and using your chatgpt account with it. I haven't coded with ChatGPT for weeks. Maybe a month ago I got utility out of coding with codex and then having ChatGPT look at my open IDE page to give comments, but since 5.2 came out, it's been 100% codex.

[+] Okkef|1 month ago|reply

Try Claude code. It’s different.

After you tried it, come back.

[+] epolanski|1 month ago|reply

1. Write good documentation, architecture, how things work, code styling, etc.

2. Put your important dependencies source code in the same directory. E.g. put a `_vendor` directory in the project, in it put the codebase at the same tag you're using or whatever: postgres, redis, vue, whatever.

3. Write good plans and requirements. Acceptance criteria, context, user stories, etc. Save them in markdown files. Review those multiple times with LLMs trying to find weaknesses. Then move to implementation files: make it write a detailed plan of what it's gonna change and why, and what it will produce.

4. Write very good prompts. LLMs follow instructions well if they are clear "you should proactively do X", is a weak instruction if you mean "you must do X".

5. LLMs are far from perfect, and full of limits. Karpathy sums their cons very well in his long list. If you don't know their limits you'll mismanage the expectations and not use them when they are a huge boost and waste time on things they don't cope well with. On top of that: all LLMs are different in their "personality", how they adhere to instruction, how creative they are, etc.

[+] datsci_est_2015|1 month ago|reply

Also I never see anyone talking about code reviews, which is one of the primary ways that software engineering departments manage liability. We fired someone recently because they couldn’t explain any of the slop they were trying to get merged. Why tf would I accept the liability of managing code that someone else can’t even explain?

I guess this is fine when you don’t have customers or stakeholders that give a shit lol.

[+] bluGill|1 month ago|reply

I've been trying Claude on my large code base today. When I give it the requirements I'd give an engineer and so "do it" it just writes garbage that doesn't make sense and doesn't seem to even meet the requirements (if it does I can't follow how - though I'll admit to giving up before I understood what it did, and I didn't try it on a real system). When I forced it to step back and do tiny steps - in TDD write one test of the full feature - it did much better - but then I spent the next 5 hours adjusting the code it wrote to meet our coding standards. At least I understand the code, but I'm not sure it is any faster (but it is a lot easier to see things wrong than come up with green field code).

Which is to say you have to learn to use the tools. I've only just started, and cannot claim to be an expert. I'll keep using them - in part because everyone is demanding I do - but to use them you clearly need to know how to do it yourself.

[+] jwr|1 month ago|reply

I successfully use Claude Code in a large complex codebase. It's Clojure, perhaps that helps (Clojure is very concise, expressive and hence token-dense).

[+] languid-photic|1 month ago|reply

They build Claude Code fully with Claude Code.

[+] redox99|1 month ago|reply

What do you even mean by "ChatGPT"? Copy pasting code into chatgpt.com?

AI assisted coding has never been like that, which would be atrocious. The typical workflow was using Cursor with some model of your choice (almost always an Anthropic model like sonnet before opus 4.5 released). Nowadays (in addition to IDEs) it's often a CLI tool like Claude Code with Opus or Codex CLI with GPT Codex 5.2 high/xhigh.

[+] maxdo|1 month ago|reply

chatGPT is not made to write code. Get out of stone age :)

[+] spaceman_2020|1 month ago|reply

I'm afraid that we're entering a time when the performance difference between the really cutting edge and even the three-month-old tools is vast

If you're using plain vanilla chatgpt, you're woefully, woefully out of touch. Heck, even plain claude code is now outdated

[+] gloosx|1 month ago|reply

So what is he even coding there all the time?

Does anybody have any info on what he is actually working on besides all the vibe-coding tweets?

There seems to be zero output from they guy for the past 2 years (except tweets)

[+] Macha|1 month ago|reply

> - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music?

Starcraft and Factorio are exactly what it is not. Starcraft has a loooot of micro involved at any level beyond mid level play, despite all the "pro macros and beats gold league with mass queens" meme videos. I guess it could be like Factorio if you're playing it by plugging together blueprint books from other people but I don't think that's how most people play.

At that level of abstraction, it's more like grand strategy if you're to compare it to any video game? You're controlling high level pushes and then the units "do stuff" and then you react to the results.

[+] onetimeusename|1 month ago|reply

> the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*

I have a professor who has researched auto generated code for decades and about six months ago he told me he didn't think AI would make humans obsolete but that it was like other incremental tools over the years and it would just make good coders even better than other coders. He also said it would probably come with its share of disappointments and never be fully autonomous. Some of what he said was a critique of AI and some of it was just pointing out that it's very difficult to have perfect code/specs.

[+] slfreference|1 month ago|reply

I can sense two classes of coders emerging.

Billionaire coder: a person who has "written" billion lines.

Ordinary coders : people with only couple of thousands to their git blame.