top | item 40169872

(no title)

humaneval is generally a very poor benchmark imo and I hate that it's become the default "code" benchmark in any model release. I find it more useful to just look at MMLU as a ballmark of model ability and then just vibe checking it myself on code.

source: I'm hacking on a high performance coding copilot (https://double.bot/) and play with a lot of different models for coding. Also adding Qwen 110b now so I can vibe check it. :)

discuss

andai|1 year ago

Didn't Microsoft use HumanEval as the basis for developing Phi? If so I'd say it works well enough! (At least Phi 3, haven't tested the others much.)

Though their training set is proprietary, it can be leaked by talking with Phi 1_5 about pretty much anything. It just randomly starts outputting the proprietary training data.

kristianp|1 year ago

Humaneval was developed for codex I believe:

https://arxiv.org/abs/2107.03374

coder543|1 year ago

I agree HumanEval isn't great, but I've found that it is better than not having anything. Maybe we'll get better benchmarks someday.

What would make "Double" higher performance than any other hosted system?