(no title)
pcwelder | 18 days ago
In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.
To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.
manofmanysmiles|18 days ago
I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.
I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.
pcwelder|18 days ago
I couldn't decide on folding and reducing noise so I'm stuck on that front. I believe there is some elegant solution that I'm missing, hope to see your take.
data-ottawa|18 days ago
Have you had good results with the other frontier models?
thegeomaster|18 days ago
pcwelder|18 days ago
GPT models can follow tool format correctly but don't keep on going.
Grok-4+ are decent but with issues in longer chats.
Kimi 2.5 has issues with it reverting to its RL tool format.
nolist_policy|18 days ago
pcwelder|18 days ago
sergiotapia|18 days ago
nullbyte|18 days ago