(no title)
TrainedMonkey | 1 day ago
I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.
Aurornis|1 day ago
That's a much better way to say it than I did.
These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.
This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.
It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.
unknown|1 day ago
[deleted]
amelius|1 day ago
That way, we can have a benchmark that is always up to date.
lurkshark|12 hours ago
https://swe-rebench.com/
https://livebench.ai/