top | item 47110595

(no title)

sparin9 | 8 days ago

I think the real value here isn’t “planning vs not planning,” it’s forcing the model to surface its assumptions before they harden into code.

LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.

discuss

maxnevermind|7 days ago

Yeap, I recently came to realization that is useful to think about LLMs as assumption engines. They have trillions of those and fill the gaps when they see the need. As I understand, assumptions are supposedly based on industry standards, If those deviate from what you are trying to build then you might start having problems, like when you try to implement a solution which is not "googlable", LLM will try to assume some standard way to do it and will keep pushing it, then you have to provide more context, but if you have to spend too much time on providing the context, then you might not save that much time in the end.

remify|7 days ago

Sub agent also helps a lot in that regard. Have an agent do the planning, have an implementation agent do the code and have another one do the review. Clear responsabilities helps a lot.

There also blue team / red team that works.

The idea is always the same: help LLM to reason properly with less and more clear instructions.

jalopy|7 days ago

This sounds very promising. Any link to more details?

hinkley|7 days ago

A huge part of getting autonomy as a human is demonstrating that you can be trusted to police your own decisions up to a point that other people can reason about. Some people get more autonomy than others because they can be trusted with more things.

All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.

If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.

vincentvandeth|7 days ago

This approach sounds clean in theory, but in production you're building a black box. When your planning agent hands off to an implementation agent and that hands off to a review agent — where did the bug originate? Which agent's context was polluted? Good luck tracing that. I went the opposite direction: single agent per task, strict quality gates between steps, full execution logs. No sub-agents. Every decision is traceable to one context window. The governance layer (PR gates, staged rollouts, acceptance criteria) does the work that people expect sub-agents to do — but with actual observability.

After 6 months in production and 1100+ learned patterns: fewer moving parts, better debugging, more reliable output. Built a full production crawler this way — 26 extractors, 405 tests — without sub-agents. Orchestrator acts as gatekeeper that redispatches uncompleted work.

antonvs|7 days ago

Since the phases are sequential, what’s the benefit of a sub agent vs just sequential prompts to the same agent? Just orchestration?

drivebyhooting|7 days ago

This runs counter to the advice in the fine article: one long continuous session building context.

synergy20|7 days ago

I think claude-code is doing this at the background now

vincentvandeth|7 days ago

This is underrated. The plan isn't documentation — it's a test harness for assumptions. I document invariants upfront (memory budgets, latency ceilings, concurrency limits) and validate every agent decision against them. Caught an architecture-level mistake this way: the obvious approach to browser management violated three constraints simultaneously. No amount of syntax-level review would have found that.

vagab0nd|6 days ago

I recently learned a trick to improve an LLM's thinking (maybe it's well know?):

Requesting { "output": "x" } consistently fails, despite detailed instructions.

Changing to requesting { "output": "x", "reasoning": "y" } produces the desired outcome.

asdxrfx|7 days ago

It's also great to describe the full use case flow in the instructions, so you can clearly understand that LLM won't do some stupid thing on its own

maccard|7 days ago

> LLMs don’t usually fail at syntax?

Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.

kertoip_1|7 days ago

Might depend on used language. From my experience Claude Sonnet indeed never make any syntax mistakes in JS/TS/C#, but these are popular language with lots of training data.

hun3|7 days ago

Except that merely surfacing them changes their behavior, like how you add that one printf() call and now your heisenbug is suddenly nonexistent

MagicMoonlight|7 days ago

Did you just write this with ChatGPT?

zenoprax|7 days ago

I've never seen an LLM use "etc" but the rest gives a strong "it's not just X, it's Y" vibe.

I really hope the fine-tuning of our slop detectors can help with misinformation and bullshit detection.