top | item 46977903

(no title)

pcwelder | 18 days ago

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

discuss

manofmanysmiles|18 days ago

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

pcwelder|18 days ago

Cool! Please share your work if possible!

I couldn't decide on folding and reducing noise so I'm stuck on that front. I believe there is some elegant solution that I'm missing, hope to see your take.

data-ottawa|18 days ago

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

thegeomaster|18 days ago

Not the parent commenter, but in my testing, all recent Claudes (4.5 onward) and the Gemini 3 series have been pretty much flawless in custom tool call formats.

pcwelder|18 days ago

All anthropic models. Gemini 2.5 pro and above. Gemini 3 flash is very good too.

GPT models can follow tool format correctly but don't keep on going.

Grok-4+ are decent but with issues in longer chats.

Kimi 2.5 has issues with it reverting to its RL tool format.

nolist_policy|18 days ago

Could also be the provider that is bad. Happens way too often on OpenRouter.

pcwelder|18 days ago

I had added z-ai in allow list explicitly and verified that it's the one being used.

sergiotapia|18 days ago

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

nullbyte|18 days ago

I specifically do not use the CN/SG based original provider simply because I don't want my personal data traveling across the pacific. I try to only stay on US providers. Openrouter shows you what the quantization of each provider is, so you can choose a domestic one that's FP8 if you want