top | item 46355737

(no title)

mstank | 2 months ago

As the models have progressively improved (able to handle more complex code bases, longer files, etc) I’ve started using this simple framework on repeat which seems to work pretty well at one shorting complex fixes or new features.

[Research] ask the agent to explain current functionality as a way to load the right files into context.

[Plan] ask the agent to brainstorm the best practices way to implement a new feature or refactor. Brainstorm seems to be a keyword that triggers a better questioning loop for the agent. Ask it to write a detailed implementation plan to an md file.

[clear] completely clear the context of the agent —- better results than just compacting the conversation.

[execute plan] ask the agent to review the specific plan again, sometimes it will ask additional questions which repeats the planning phase again. This loads only the plan into context and then have it implement the plan.

[review & test] clear the context again and ask it to review the plan to make sure everything was implemented. This is where I add any unit or integration tests if needed. Also run test suites, type checks, lint, etc.

With this loop I’ve often had it run for 20-30 minutes straight and end up with usable results. It’s become a game of context management and creating a solid testing feedback loop instead of trying to purely one-shot issues.

discuss

jarjoura|2 months ago

As of Dec 2025, Sonnet/Opus and GPTCodex are both trained and most good agent tools (ie. opencode, claude-code, codex) have prompts to fire off subagents during an exploration (use the word explore) and you should be able to Research without needing the extra steps of writing plans and resetting context. I'd save that expense unless you need some huge multi-step verifiable plan implemented.

The biggest gotcha I found is that these LLMs love to assume that code is C/Python but just in your favorite language of choice. Instead of considering that something should be written encapsulated into an object to maintain state, it will instead write 5 functions, passing the state as parameters between each function. It will also consistently ignore most of the code around it, even if it could benefit from reading it to know what specifically could be reused. So you end up with copy-pasta code, and unstructured copy-pasta at best.

The other gotcha is that claude usually ignores CLAUDE.md. So for me, I first prompt it to read it and then I prompt it to next explore. Then, with those two rules, it usually does a good job following my request to fix, or add a new feature, or whatever, all within a single context. These recent agents do a much better job of throwing away useless context.

I do think the older models and agents get better results when writing things to a plan document, but I've noticed recent opus and sonnet usually end up just writing the same code to the plan document anyway. That usually ends up confusing itself because it can't connect it to the code around the changes as easily.

coldtea|2 months ago

>Instead of considering that something should be written encapsulated into an object to maintain state, it will instead write 5 functions, passing the state as parameters between each function.

Sounds very functional, testable, and clean. Sign me up.

nextaccountic|2 months ago

> As of Dec 2025, Sonnet/Opus and GPTCodex are both trained and most good agent tools (ie. opencode, claude-code, codex) have prompts to fire off subagents during an exploration (use the word explore) and you should be able to Research without needing the extra steps of writing plans and resetting context. I'd save that expense unless you need some huge multi-step verifiable plan implemented.

Does the UI shows clearly what portion was done by a subagent?

je42|2 months ago

If claude ignores your claude.md you can force it to read via settings to cat it every session start for example.

dboreham|2 months ago

AI can be an FP absolutist too.

indigodaddy|2 months ago

Interesting, for me they almost always assume/write TS.

prmph|2 months ago

Nothing will really work when the models fail at the most basic of reasoning challenges.

I've had models do the complete opposite of what I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instructions are nothing complex at all.

I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.

I notice that sometimes the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.

hu3|2 months ago

Check context size.

LLMs become increasingly error-prone as their memory is fills up. Just like humans.

In VSCode Copilot you can keep track of how many tokens the LLM is dealing with in realtime with "Chat Debug".

When it reaches 90k tokens I should expect degraded intelligence and brace for a possible forced sumarization.

Sometimes I just stop LLMs and continue the work in a new session.

mstank|2 months ago

In my experience this was an issue 6-8 months ago. Ever since Sonnet 4 I haven’t had any issues with instruction following.

Biggest step-change has been being able to one-shot file refactors (using the planning framework I mentioned above). 6 months ago refactoring was a very delicate dance and now it feels like it’s pretty much streamlined.

alienbaby|2 months ago

I'm curious in what kinda if situations you are seeing the model the do opposite of your intention consistently where the instructions were not complex. Do you have any examples?

godzillafarts|2 months ago

This is effectively what I'm doing, inspired by HumanLayer's Advanced Context Engineering guidelines: https://github.com/humanlayer/advanced-context-engineering-f...

We've taken those prompts, tweaked them to be more relevant to us and our stack, and have pulled them in as custom commands that can be executed in Claude Code, i.e. `/research_codebase`, `/create_plan`, and `/implement_plan`.

It's working exceptionally well for me, it helps that I'm very meticulous about reviewing the output and correcting it during the research and planning phase. Aside from a few use cases with mixed results, it hasn't really taken off throughout our team unfortunately.

asim|2 months ago

I don't do any of that. I find with GitHub copilot and Claude sonnet 4.5 if I'm clear enough about the what and where it'll sort things out pretty well, and then there's only reiteration of code styling or reuse of functionality. At that point it has enough context to keep going. The only time I might clear that whole thing is if I'm working on an entirely new feature where the context is too large and it gets stuck in summarising the history. Otherwise it's good. But this in codespaces. I find the Tasks feature much harder. Almost a write-off when trying to do something big. Twice I've had it go off on some strange tangent and build the most absurd thing. You really need to keep your eyes on it.

hu3|2 months ago

Yeah I found that for daily work, current models like Sonnet/Opus 4.5, Gemini 3.0 Pro (and even Flash) work really well without planning as long as I divide and conquer larger tasks into smaller ones. Just like I would do if I was programming myself.

For planning large tasks like "setup playwright tests in this project with some demo tests" I spend some time chatting with Gemini 3 or Opus 4.5 to figure out the most idiomatic easy-wins and possible pitfalls. Like: separate database for playwright tests. Separate users in playwright tests. Skipping login flow for most tests. And so on.

I suspect that devs who use a formal-plan-first approach tend to tackle larger tasks and even vibe code large features at a time.

hyperadvanced|2 months ago

Same. I find that if I can piecemeal explain the desired functionality and work as I would pairing with another engineer that it’s totally possible to go from “make me a simple wheel with spokes” to “okay now let’s add a better frame and brakes” with relatively little planning, other than what I’d already do when researching the codebase to implement a new feature

AlexB138|2 months ago

This is essentially my exact workflow. I also keep the plan markdown files around in the repo to refer agents back to when adding new features. I have found it to be a really effective loop, and a great way to reprime context when returning to features.

redrove|2 months ago

I use an Obsidian MCP to essentially keep a database of plans, or versions sometimes that I can just fire off.

mstank|2 months ago

Exactly this. I clear the old plans every few weeks.

For really big features or plans I’ll ask the agent to create linear issue tickets to track progress for each phase over multiple sessions. Only MCP I have loaded is usually linear but looking for a good way to transition it to a skill.

zingar|2 months ago

I’m uneasy having an agent implement several pages of plan and then writing tests and results only at the and of all that. It feels like getting a CS student to write and follow a plan to do something they haven’t worked on before.

It’ll report, “Numbers changed in step 6a therefore it worked” [forgetting the pivotal role of step 2 which failed and as a result the agent should have taken step 6b, not 6a].

Or “there is conclusive evidence that X is present and therefore we were successful” [X is discussed in the plan as the reason why action is NEEDED, not as success criteria].

I _think _ that what is going wrong is context overload and my remedy is to have the agent update every step of the plan with results immediately after action and before moving on to action on the next step.

When things seem off I can then clear context and have the agent review results step by step to debug its own work: “review step 2 of the results. Are the stated results confident with final conclusions? Quote lines from the results verbatim as evidence.”

layer8|2 months ago

This is a bit like agile versus waterfall.

dfsegoat|2 months ago

Highly recommend using agent based hooks for things like `[review & test]`.

At a basic level, they work akin to git-hooks, but they fire up a whole new context whenever certain events trigger (E.g. another agent finishes implementing changes) - and that hook instance is independent of the implementation context (which is great, as for the review case it is a semi-independent reviewer).

zeroCalories|2 months ago

I agree this can work okay, but once I find myself doing this much handholding I would prefer to drive the process myself. Coordinating 4 agents and guiding them along really makes you appreciate the mythical-man-month on the scale of hours.