(no title)
lalassu | 3 months ago
I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.
I hope it's not the same here.
msp26|3 months ago
If it had vision and was better on long context I'd use it so much more.
vorticalbox|3 months ago
I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.
make3|3 months ago
CuriouslyC|3 months ago
make3|3 months ago
segmondy|3 months ago
nylonstrung|3 months ago
Whereas the benchmark gains seem by new OpenAI, Grok and Claude models don't feel accompanied by vibe improvement
not_that_d|3 months ago
catigula|3 months ago
BizarroLand|3 months ago
unknown|3 months ago
[deleted]
catigula|3 months ago