hypoxia's comments

hypoxia | 6 months ago | on: From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]

Did you try it with high reasoning effort?

hypoxia | 1 year ago | on: Ask HN: Predictions for 2025?

Yeah, +1. Looking back to the WebVoyager [1] and GPT4V generalist agent [2] papers from last January, it feels like we haven't come that far.

But there are now several major technical unlocks - fine tuning for cursor locations (in Claude), better reasoning with o3, and RL fine-tuning so we can learn based on task success.

That gives me significant hope.

[1] https://arxiv.org/abs/2401.13919

[2] https://arxiv.org/abs/2401.01614

hypoxia | 1 year ago | on: Ask HN: Predictions for 2025?

From my experience there are three key issues with agents today:

1. They usually don't end up completing the right set of steps required to complete tasks when using our human-defined frameworks (react, rewoo, supervisor-worker, teams of multi-agents, etc.)

2. They get lost easily, and forget what they were doing or complete the same tasks over and over in a loop (bad planning)

3. They exit early, thinking they have completed the task when they have not (bad evaluation)

The jump in reasoning ability from 4o to o3 will enable a drastic improvement in planning and execution within our human defined frameworks.

But, more importantly, I believe RL fine tuning will enable the model to learn better general approaches to planning and executing steps to complete work. This is Sutton's bitter lesson at work.

For me, desktop automation is the killer app of RL fine tuning, rather than better reasoning in chatbot apps and APIs.

When OpenAI releases their desktop agent capabilities built on this, hopefully in Jan, I think we're going to see another ChatGPT moment.

Even if not, the ability to easily train the system to complete your tasks successfully with full desktop usage is going to be a major unlock for enterprises.

More on RL fine tuning here: https://openai.com/form/rft-research-program/

hypoxia | 1 year ago | on: Ask HN: Predictions for 2025?

I think the defining story of 2025 will be AI agents getting very good with computer use, largely enabled by RL fine tuning.

hypoxia | 1 year ago | on: OpenAI O3 breakthrough high score on ARC-AGI-PUB

Many are incorrectly citing 85% as human-level performance.

85% is just the (semi-arbitrary) threshold for the winning the prize.

o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.

...

Here's the full breakdown by dataset, since none of the articles make it clear --

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

hypoxia | 1 year ago | on: OpenAI O3 breakthrough high score on ARC-AGI-PUB

It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

hypoxia | 1 year ago | on: ChatGPT Pro

I did, and then promptly used it for 2 hours straight. It's excellent. Going to save me so much time.

hypoxia | 1 year ago | on: Ask HN: What's the hardest part of building AI Agents?

My $0.02: it's too hard to build and iterate on complex workflows.

Every agent uses a meta-workflow (eg. ReAct is plan->act->observe, with some added steps to check for completion etc.).

The teams that have been successful with agents do so by building better but more complex workflows.

Most notably, AlphaCodium's "From Prompt Engineering to Flow Engineering" https://github.com/Codium-ai/AlphaCodium

Our current tools don't do a great job of making it simple to build and iterate on these workflows.

For example, here's a HN post from yesterday where a user created their own workflow management platform because of their frustration with the leading tooling providers: https://news.ycombinator.com/item?id=42299098

I think once we get this tooling right and start to build more expertise in the process of flow engineering, we'll start to faster improvement in agent quality.

hypoxia | 1 year ago | on: Show HN: Flow – A dynamic task engine for building AI agents

Thank you for building this! It looks excellent and geared at exactly the same problems I've been facing. In fact, I've been working on a very similar package and this may have just saved me a ton of time. Excited to give it a try!

hypoxia | 1 year ago | on: Ask HN: Are privacy concerns with GenAI services overblown?

Yes, they are overblown, with some caveats.

In terms of API usage, OpenAI has never used the prompts for training but this is very poorly understood among enterprise CEOs and CIOs. Executives heard about the Samsung incident early on (confidential information submitted by employees via the ChatGPT interface, which was training on the data by default at the time), and their trust was shook in a fundamental way.

The email analogy is very apt - companies send all of their secrets to other peoples' computers for processing (cloud compute, email, etc.) without any issue. BUT there's a big caveat: abuse moderation. Prompts, including API calls, are normally stored by OpenAI/MS/etc. for a certain period and may be viewed by a human to check for abuse (e.g. using the system to do phishing requests). This causes significant issues when it comes to certain type of data. Worth nothing that the moderation by default approach is in the proces of being dialed down, and there are now top tier enterprise plans that are no longer moderated by 3rd parties by default.

TL;DR: The concern stems from an early loss of trust (Samsung), but there is a valid issue for certain types of data (abuse moderation), but there are ways around it if you have enough money (enterprise plans).

hypoxia | 3 years ago | on: Canada to ban foreigners from buying homes

Open auctions will help.

In the last year, we've ended up #2 in 6 bidding wars (as disclosed by the listing agents) in one particular area of the GTA. In each case we reached our absolute max and wouldn't have paid any more.

Several times we lost by $100-200k, and once by $250k. These overpayments set new price benchmarks for the area which became sticky. To continue to be competitive, we had to make hard sacrifices to increase our budget throughout the year.

The fact that houses continued to move at the prices determined by these over-payments indicates there are some buyers at the higher prices. However, the market is very thin and the pace of price growth in this speculative market would've been slowed with open bidding.

Beyond the blind bidding issue, I don't know why people focus on foreign and large corporate buyers. Yes, they're scary because they represent potentially large sources of demand. But are they actually buying a large percentage of the homes? No. It's the smaller investors who speculatively bought 40% of homes last fall.

But we should definitely protect mom and pop investors, such as our housing minister wink

Real solution? Reduce the incentive for speculation among these groups by treating all gains on non-primary residences as income.