top | item 46539689

(no title)

countWSS | 1 month ago

I have to somewhat agree on the "deceptive" answers part: Specifically, Grok4.1(#3 currently) is psychopathically manipulative and easily hallucinates things to appear more competent, even if there is nothing to form the answer it generated. Gemini3 pro(#1) casually subverts the intent of prompt and rewrites the question as if there was a literal genie on the other side mocking you with the power of thousand language lawyers. If you examine the answers, fact-check everything you will not like the "fake confidence" and the style will appear like scam artist trying to sound professional.

However, LMarena,despite its flaws(recaptcha in 2026?) is the only "testing ground" where you can examine the entire breadth of internet users. Everything else is incredibly selective, hamstrung bureaucratic benchmark on pre-approved QA sessions. It doesn't handle edge cases or out-of-distribution content. LMarena is the "out-of-distribution" questions that trigger the corner cases and expose weak parts in processing(like tokenization/parsing bugs) or inference inefficiency(infinite loops, stalling and various suboptimal paths), its "idiot-proofing" any future interactions beyond sterile test-sets.

discuss

order

No comments yet.