(no title)
mstank | 2 months ago
[Research] ask the agent to explain current functionality as a way to load the right files into context.
[Plan] ask the agent to brainstorm the best practices way to implement a new feature or refactor. Brainstorm seems to be a keyword that triggers a better questioning loop for the agent. Ask it to write a detailed implementation plan to an md file.
[clear] completely clear the context of the agent —- better results than just compacting the conversation.
[execute plan] ask the agent to review the specific plan again, sometimes it will ask additional questions which repeats the planning phase again. This loads only the plan into context and then have it implement the plan.
[review & test] clear the context again and ask it to review the plan to make sure everything was implemented. This is where I add any unit or integration tests if needed. Also run test suites, type checks, lint, etc.
With this loop I’ve often had it run for 20-30 minutes straight and end up with usable results. It’s become a game of context management and creating a solid testing feedback loop instead of trying to purely one-shot issues.
jarjoura|2 months ago
The biggest gotcha I found is that these LLMs love to assume that code is C/Python but just in your favorite language of choice. Instead of considering that something should be written encapsulated into an object to maintain state, it will instead write 5 functions, passing the state as parameters between each function. It will also consistently ignore most of the code around it, even if it could benefit from reading it to know what specifically could be reused. So you end up with copy-pasta code, and unstructured copy-pasta at best.
The other gotcha is that claude usually ignores CLAUDE.md. So for me, I first prompt it to read it and then I prompt it to next explore. Then, with those two rules, it usually does a good job following my request to fix, or add a new feature, or whatever, all within a single context. These recent agents do a much better job of throwing away useless context.
I do think the older models and agents get better results when writing things to a plan document, but I've noticed recent opus and sonnet usually end up just writing the same code to the plan document anyway. That usually ends up confusing itself because it can't connect it to the code around the changes as easily.
coldtea|2 months ago
Sounds very functional, testable, and clean. Sign me up.
nextaccountic|2 months ago
Does the UI shows clearly what portion was done by a subagent?
je42|2 months ago
dboreham|2 months ago
indigodaddy|2 months ago
prmph|2 months ago
I've had models do the complete opposite of what I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instructions are nothing complex at all.
I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.
I notice that sometimes the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.
hu3|2 months ago
LLMs become increasingly error-prone as their memory is fills up. Just like humans.
In VSCode Copilot you can keep track of how many tokens the LLM is dealing with in realtime with "Chat Debug".
When it reaches 90k tokens I should expect degraded intelligence and brace for a possible forced sumarization.
Sometimes I just stop LLMs and continue the work in a new session.
mstank|2 months ago
Biggest step-change has been being able to one-shot file refactors (using the planning framework I mentioned above). 6 months ago refactoring was a very delicate dance and now it feels like it’s pretty much streamlined.
alienbaby|2 months ago
godzillafarts|2 months ago
We've taken those prompts, tweaked them to be more relevant to us and our stack, and have pulled them in as custom commands that can be executed in Claude Code, i.e. `/research_codebase`, `/create_plan`, and `/implement_plan`.
It's working exceptionally well for me, it helps that I'm very meticulous about reviewing the output and correcting it during the research and planning phase. Aside from a few use cases with mixed results, it hasn't really taken off throughout our team unfortunately.
asim|2 months ago
hu3|2 months ago
For planning large tasks like "setup playwright tests in this project with some demo tests" I spend some time chatting with Gemini 3 or Opus 4.5 to figure out the most idiomatic easy-wins and possible pitfalls. Like: separate database for playwright tests. Separate users in playwright tests. Skipping login flow for most tests. And so on.
I suspect that devs who use a formal-plan-first approach tend to tackle larger tasks and even vibe code large features at a time.
hyperadvanced|2 months ago
AlexB138|2 months ago
redrove|2 months ago
mstank|2 months ago
For really big features or plans I’ll ask the agent to create linear issue tickets to track progress for each phase over multiple sessions. Only MCP I have loaded is usually linear but looking for a good way to transition it to a skill.
zingar|2 months ago
It’ll report, “Numbers changed in step 6a therefore it worked” [forgetting the pivotal role of step 2 which failed and as a result the agent should have taken step 6b, not 6a].
Or “there is conclusive evidence that X is present and therefore we were successful” [X is discussed in the plan as the reason why action is NEEDED, not as success criteria].
I _think _ that what is going wrong is context overload and my remedy is to have the agent update every step of the plan with results immediately after action and before moving on to action on the next step.
When things seem off I can then clear context and have the agent review results step by step to debug its own work: “review step 2 of the results. Are the stated results confident with final conclusions? Quote lines from the results verbatim as evidence.”
layer8|2 months ago
dfsegoat|2 months ago
At a basic level, they work akin to git-hooks, but they fire up a whole new context whenever certain events trigger (E.g. another agent finishes implementing changes) - and that hook instance is independent of the implementation context (which is great, as for the review case it is a semi-independent reviewer).
zeroCalories|2 months ago