(no title)
camel_Snake | 4 months ago
Funny you should say that, because while it is a large model the GLM 4.5 is at the top of Berkley's Function Calling Leaderboard [0] and has one of the lowest costs. Can't comment on speed compared to those smaller models, but the Air version of 4.5 is similarly highly-ranked.
Topfi|4 months ago
Problem is, while Gorilla was an amazing resource back in 2023 and continues to be a great dataset to lean on, but most ways we use LLMs in multi step tasks have since evolved greatly, not just with structured JSON (which GorillaOpenFunctionsV2, v4 eval does multi too), but more with the scaffolding around models (Claude Code vs Codex vs OpenCode, etc.). Likely why good performance with Gorilla doesn't necessarily map onto multiple step workloads with day-to-day tooling, which I tend to go for and reason why, despite there being FOSS options already, most labs either built their own coding assistant tooling (and most open source that too) or feel the need to fork others (Qwen with Geminis repo).
Purely speculative, but GLM-4.6 I evaluated using the same tasks as other models via Claude Code with their endpoint as that is what they advertise as the official way to use the model, same reason I use e.g. Codex for GPT-5. More focused on results in the best case, over e.g. using opencode for all models to give a more level playing field.