(no title)
pinko | 3 months ago
While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.
panarky|3 months ago
This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.
If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.
largbae|3 months ago
Bombthecat|3 months ago
UltraSane|3 months ago
mapt|3 months ago
rvnx|3 months ago
menaerus|3 months ago