top | item 46963340

Runtime validation is still fucked in AI coding agents

1 points| sebringj | 19 days ago

AI agents (Cursor, Claude computer-use, Copilot agent mode, etc.) have gotten stupidly good at spitting out code. Prompt → boom, clean code. The marketing says "it just works."

It fucking doesn't.

You run it in a real app and immediately hit the same bullshit wall every time: - Hallucinated logic only reveals itself under real data or edge cases - UI updates magically forget to sync across devices (mobile → web = sad trombone) - API calls quietly return 401s or other crap that gets swallowed in some lazy try-catch - Vision-based agents crawl like molasses (2–10s per action) and torch tokens like it's free - Background pings and unrelated fetches make it impossible to tell what actually caused what

I tried pretty much everything out there and none of it quite scratched the itch I had: fast, structured, cross-platform runtime visibility without vision bloat or having to wire up a ton of hooks.

Quick rundown of the usual suspects:

- Pure vision/computer-use (Claude 3.5/4, ADEPT-style): zero setup, works on anything — but latency from hell and token burn is brutal for anything longer than a demo - Playwright / browser MCP servers: fast and structured for web — but web-only, selectors shatter like glass, no native mobile - Appium + vision hybrids: cross-platform on paper — but still vision-dependent and setup is a pain - Sandboxed agents (OpenHands, SWE-agent): decent for repo tasks and shell stuff — not so much for live app UI/network state - Explicit hooks/bridges: precise when you bother adding them — but requires code changes, which sucks

Couldn't find anything that gave me low-latency structured JSON state (UI elements, network, errors, logs) across platforms, local-first, without the usual trade-offs. So yeah, I got fed up and built a small local MCP server to solve it for myself.

Full disclosure: it's called Autonomo MCP https://github.com/sebringj/autonomo — very early, just launched.

I don't usually do this "I built a thing" thing — my open-source contributions are mostly small fixes and PRs — but I honestly couldn't see a better way in the current landscape.

It is my hope that Anthropic (or someone) will eventually ship a clean native solution for this. They already fixed BM25 tool calling to shrink context like crazy; I'd love to see them (or the industry) make runtime validation "just work" out of the box too.

Sometimes when you code in a vacuum you think your shit smells good. lmk if I'm off base here, I grew up with a mean grandpa so I'm cool with it.

2 comments

order

GahLak|19 days ago

You've nailed the real friction point that demos gloss over: agents are great at generation but terrible at verification in production systems. The vision latency tax is brutal once you hit real workflows.

sebringj|19 days ago

ya, for real, my boss was like let's do e2e testing with AI, look for solutions out there... then like 2 days later he's like wtf is this bill, and i was like you wanted that right? Was using vision calls in azure foundry and was like over 100 bucks or something just in 2 days of me setting it up and trying it out with all the test cases it had.