top | item 47056885

(no title)

nichochar | 12 days ago

We ran some tests at mocha (we have a coding agent with our own harness to build web apps, with a lot of tools and medium length tasks (3min to 10min).

Our notes:

Sonnet 4.6 feels like a fundamentally different model than Sonnet 4.5, it is much closer to the Opus series in terms of agentic behavior and autonomy.

Autonomy - In our zero-shot app building experiments, Sonnet 4.6 ran up to 3-4x longer than Sonnet 4.5 without intervention, producing functional apps on par in terms of quality to the Opus series. Note that subjectively we found Opus 4.5 and 4.6 are better "designers" than Sonnet 4.6; producing more visually appealing apps from the same prompts.

Planning / Task Decomposition - We found Sonnet 4.6 is very good at decomposing tasks and staying on track during long-running trajectories. It's quite good at ensuring all of the requirements of an input prompt are accounted for, whereas we were often forced to goad sonnet 4.5 into decomposing tasks, Sonnet 4.6 does this naturally.

Exploration - In some of our complex "exploration" tasks (e.g. cloning/remixing an existing website), Sonnet 4.6 often performs on par or better than Opus 4.5 and 4.6. It generally takes longer, and takes more tokens, though we believe this is likely a consequence of our tool-calling setup.

Tool-use - Sonnet 4.6 seems eager to use tools; however, we did find that it struggles with our XML-based custom tool use format (perhaps exclusive to the format we use). We did not have a chance to assess with native tool use

Self-verification - Similar to Opus 4.5/4.6, Sonnet 4.6 has a proclivity for verifying it's work.

Prompting - We found Sonnet 4.6 is very sensitive to prompting around thinking, planning, and task decomposition. Our prompt built for sonnet 4.5 has a tendency to push sonnet 4.6 into incredibly long thinking and planning loops. Though we also found it requires significantly less careful and specific instructions for how to approach problems.

How are we thinking about this:

We can't launch this model day 0, it requires more changes to our harness, and we're working on them right now.

But it reminds me a bit of 3.5 to 3.7 --> It's a pretty different model that behaves and responds to instructions in new ways. So it requires more tuning before we can extract its full potential.

discuss

No comments yet.