(no title)
isx726552 | 8 months ago
> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.
Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.
simonw|8 months ago
benmathes|8 months ago
(And I'd be envious of your impact, of course)
Choco31415|8 months ago
"The word "strawberry" contains 2 letter r’s."
belter|8 months ago
strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three
strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four
stawberrry -> DeepSeek, GeminiPro all correctly said three
ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)
Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's
And then asked if I meant "strawberry" instead and said because that one has 2 r's....
MattRix|8 months ago
whiplash451|8 months ago
lofaszvanitt|8 months ago
x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D
this way they won't know what to improve upon. of course they can buy access. ;P
when they finally solve your problem you can reveal what was the benchmark.