top | item 40169872

(no title)

wesleyyue | 1 year ago

humaneval is generally a very poor benchmark imo and I hate that it's become the default "code" benchmark in any model release. I find it more useful to just look at MMLU as a ballmark of model ability and then just vibe checking it myself on code.

source: I'm hacking on a high performance coding copilot (https://double.bot/) and play with a lot of different models for coding. Also adding Qwen 110b now so I can vibe check it. :)

discuss

order

andai|1 year ago

Didn't Microsoft use HumanEval as the basis for developing Phi? If so I'd say it works well enough! (At least Phi 3, haven't tested the others much.)

Though their training set is proprietary, it can be leaked by talking with Phi 1_5 about pretty much anything. It just randomly starts outputting the proprietary training data.

coder543|1 year ago

I agree HumanEval isn't great, but I've found that it is better than not having anything. Maybe we'll get better benchmarks someday.

What would make "Double" higher performance than any other hosted system?