(no title)
sparin9 | 8 days ago
LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.
sparin9 | 8 days ago
LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.
maxnevermind|7 days ago
remify|7 days ago
There also blue team / red team that works.
The idea is always the same: help LLM to reason properly with less and more clear instructions.
jalopy|7 days ago
hinkley|7 days ago
All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.
If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.
vincentvandeth|7 days ago
After 6 months in production and 1100+ learned patterns: fewer moving parts, better debugging, more reliable output. Built a full production crawler this way — 26 extractors, 405 tests — without sub-agents. Orchestrator acts as gatekeeper that redispatches uncompleted work.
antonvs|7 days ago
drivebyhooting|7 days ago
synergy20|7 days ago
vincentvandeth|7 days ago
vagab0nd|6 days ago
Requesting { "output": "x" } consistently fails, despite detailed instructions.
Changing to requesting { "output": "x", "reasoning": "y" } produces the desired outcome.
asdxrfx|7 days ago
maccard|7 days ago
Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.
kertoip_1|7 days ago
hun3|7 days ago
MagicMoonlight|7 days ago
zenoprax|7 days ago
I really hope the fine-tuning of our slop detectors can help with misinformation and bullshit detection.