Not sure if people picked up on it, but this is being powered by the unreleased o3 model. Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly. Seems to be quite an impressive model and the leading out of Google, DeepSeek and Perplexity.
lordofgibbons|1 year ago
It's the only tool/system (I won't call it an LLM) in their released benchmarks that has access to tools and the web. So, I'd wager the performance gains are strictly due to that.
If an LLM (o3) is too expensive to be released to the public, why would you use it in a tool that has to make hundreds of inference calls to it to answer a single question? You'd use a much cheaper model. Most likely o3-mini or o1-mini combined with o4-mini for some tasks.
famouswaffles|1 year ago
The same reason a lot of people switched to GPT-4 when it came out even though it was much more expensive than 3 - doesn't matter how cheap it is if it isn't good enough/much worse.
xbmcuser|1 year ago
willy_k|1 year ago
Sparkyte|1 year ago
bbor|1 year ago
Effectiveness in this task environment is well beyond the specific model involved, no? Plus they'd be fools (IMHO) to only use one size of model for each step in a research task -- sure, o3 might be an advantage when synthesizing a final answer or choosing between conflicting sources, but there are many, many steps required to get to that point.
xendipity|1 year ago
I wonder how much of an impact our being still so early in the productization phase of this all is. Like it takes a ton of work and training and coordination to get multiple models synced up into an offering and I think the companies are still optimizing for getting new ideas out there rather truly optimizing them.
mistercheph|1 year ago
petesergeant|1 year ago
unknown|1 year ago
[deleted]
bitshiftfaced|1 year ago
What makes you believe that?
_bin_|1 year ago
ai-christianson|1 year ago
nycdatasci|1 year ago
OpenAI is very much in an existential crisis and their poor execution is not helping their cause. Operator or “deep research” should be able to assume the role of a Pro user, run a quick test, and reliably report on whether this is working before the press release right?
maroonblazer|1 year ago
https://news.ycombinator.com/item?id=42913575