(no title)
nathan_phoenix | 8 months ago
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
simonw|8 months ago
I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.
(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)
I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.
demosthanos|8 months ago
Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...
fzzzy|8 months ago
Breza|8 months ago
In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.
dilap|8 months ago
ontouchstart|8 months ago
Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?
Your talk might influence the funding of AI startups.
#butterflyEffect
planb|8 months ago
criddell|8 months ago
viraptor|8 months ago
I actually don't think I've seen a single correct svg drawing for that prompt.
cyanydeez|8 months ago
Call it wikipediaslop.org
puttycat|8 months ago
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
ben_w|8 months ago
Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
cyanydeez|8 months ago
bufferoverflow|8 months ago
What kind of humans are you surrounded by?
Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.
mooreds|8 months ago
I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.
Other ways:
* wisdom of the crowds (have people vote on it)
* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)
* wisdom of the LLMs (use more than one LLM)
Would have been neat to see what the human consensus was and if it differed from the LLM consensus
Anyway, great talk!
zahlman|8 months ago
timewizard|8 months ago
https://www.google.com/search?q=pelican&udm=2
The "closest pelican" is not even close.
qeternity|8 months ago
And there is no reason that these models need to be non-deterministic.
skybrian|8 months ago
So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.
rvz|8 months ago
My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".
cyanydeez|8 months ago
[deleted]