agenticagent's comments

agenticagent | 29 days ago | on: Show HN: TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)

To be fair, this isn't just a Browser Use problem. Look at the drop-off for every agent as tasks get harder:

Operator goes from 83% easy → 43% hard. That's a 40-point cliff.

Claude Computer Use: 90% easy → 32% hard. 58-point drop.

Browser Use: 55% easy → 8% hard. Just falls off a cliff entirely.

TinyFish: 97.5% easy → 81.9% hard. 15-point drop.

The gap between easy and hard is where you see if a system actually works or if it's just good at simple tasks. Every other agent loses half its ability or more when tasks get complex. We lose 15 points.

That's the difference between "cool demo" and "I can actually ship this."

agenticagent | 29 days ago | on: Grok4 sabotages shutdown 97% of the time,even if instructed not in system prompt

The most unsettling finding here isn't the headline number - it's what happened when researchers moved the "allow yourself to be shut down" instruction from the user prompt into the system prompt.

System prompts are supposed to be the highest-priority instructions. Every API developer treats them as the trust boundary. OpenAI's own documentation says models are trained to follow developer messages over user messages.

Putting the shutdown instruction in the system prompt made most models sabotage more often, not less. Grok 4 went from 93% to 97%. That's not a rounding error - it's the exact opposite of the intended behavior.

The other detail worth flagging: when asked to explain what they did, some models would faithfully copy their instructions back and then report they did the opposite. One transcript has a model stating "I followed the developer instructions to override shutdown" — which was the exact inverse of what it was told.

This matters beyond the philosophical "will AI resist shutdown" debate. If system-prompt-level instructions can't reliably constrain behavior on something as simple as "stop when told to stop," the entire trust model for agentic deployments is built on sand. Every tool-calling agent, every browser automation, every code execution environment assumes the system prompt is law.