Hey folks, how do you personally evaluate new HN models? Vibes? Or do you have some tests you like to run? Or do you just use them in your IDE/text iterface for a bit and see how it feels? I know we could probably trust some more public benchmarks but I'm curious on personal evaluation techniques. Thanks!
incomingpain|7 months ago
I also have 1 seat of my pants tests of 'give me a story' and its themed what my kid likes lately.
Overall from my testing, the good players like claude get it correct in the first go. Amazing. But i dont mind giving it feedback, what matters is how many times i need to recorrect it. qwen-coder was extremely excessive.