hypoxia | 6 months ago | on: From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]
hypoxia's comments
hypoxia | 1 year ago | on: Ask HN: Predictions for 2025?
But there are now several major technical unlocks - fine tuning for cursor locations (in Claude), better reasoning with o3, and RL fine-tuning so we can learn based on task success.
That gives me significant hope.
hypoxia | 1 year ago | on: Ask HN: Predictions for 2025?
1. They usually don't end up completing the right set of steps required to complete tasks when using our human-defined frameworks (react, rewoo, supervisor-worker, teams of multi-agents, etc.)
2. They get lost easily, and forget what they were doing or complete the same tasks over and over in a loop (bad planning)
3. They exit early, thinking they have completed the task when they have not (bad evaluation)
The jump in reasoning ability from 4o to o3 will enable a drastic improvement in planning and execution within our human defined frameworks.
But, more importantly, I believe RL fine tuning will enable the model to learn better general approaches to planning and executing steps to complete work. This is Sutton's bitter lesson at work.
For me, desktop automation is the killer app of RL fine tuning, rather than better reasoning in chatbot apps and APIs.
When OpenAI releases their desktop agent capabilities built on this, hopefully in Jan, I think we're going to see another ChatGPT moment.
Even if not, the ability to easily train the system to complete your tasks successfully with full desktop usage is going to be a major unlock for enterprises.
More on RL fine tuning here: https://openai.com/form/rft-research-program/
hypoxia | 1 year ago | on: Ask HN: Predictions for 2025?
hypoxia | 1 year ago | on: OpenAI O3 breakthrough high score on ARC-AGI-PUB
85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
hypoxia | 1 year ago | on: OpenAI O3 breakthrough high score on ARC-AGI-PUB
- 64.2% for humans vs. 82.8%+ for o3.
...
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
hypoxia | 1 year ago | on: ChatGPT Pro
hypoxia | 1 year ago | on: Ask HN: What's the hardest part of building AI Agents?
Every agent uses a meta-workflow (eg. ReAct is plan->act->observe, with some added steps to check for completion etc.).
The teams that have been successful with agents do so by building better but more complex workflows.
Most notably, AlphaCodium's "From Prompt Engineering to Flow Engineering" https://github.com/Codium-ai/AlphaCodium
Our current tools don't do a great job of making it simple to build and iterate on these workflows.
For example, here's a HN post from yesterday where a user created their own workflow management platform because of their frustration with the leading tooling providers: https://news.ycombinator.com/item?id=42299098
I think once we get this tooling right and start to build more expertise in the process of flow engineering, we'll start to faster improvement in agent quality.
hypoxia | 1 year ago | on: Show HN: Flow – A dynamic task engine for building AI agents
hypoxia | 1 year ago | on: Ask HN: Are privacy concerns with GenAI services overblown?
In terms of API usage, OpenAI has never used the prompts for training but this is very poorly understood among enterprise CEOs and CIOs. Executives heard about the Samsung incident early on (confidential information submitted by employees via the ChatGPT interface, which was training on the data by default at the time), and their trust was shook in a fundamental way.
The email analogy is very apt - companies send all of their secrets to other peoples' computers for processing (cloud compute, email, etc.) without any issue. BUT there's a big caveat: abuse moderation. Prompts, including API calls, are normally stored by OpenAI/MS/etc. for a certain period and may be viewed by a human to check for abuse (e.g. using the system to do phishing requests). This causes significant issues when it comes to certain type of data. Worth nothing that the moderation by default approach is in the proces of being dialed down, and there are now top tier enterprise plans that are no longer moderated by 3rd parties by default.
TL;DR: The concern stems from an early loss of trust (Samsung), but there is a valid issue for certain types of data (abuse moderation), but there are ways around it if you have enough money (enterprise plans).
hypoxia | 3 years ago | on: Canada to ban foreigners from buying homes
In the last year, we've ended up #2 in 6 bidding wars (as disclosed by the listing agents) in one particular area of the GTA. In each case we reached our absolute max and wouldn't have paid any more.
Several times we lost by $100-200k, and once by $250k. These overpayments set new price benchmarks for the area which became sticky. To continue to be competitive, we had to make hard sacrifices to increase our budget throughout the year.
The fact that houses continued to move at the prices determined by these over-payments indicates there are some buyers at the higher prices. However, the market is very thin and the pace of price growth in this speculative market would've been slowed with open bidding.
Beyond the blind bidding issue, I don't know why people focus on foreign and large corporate buyers. Yes, they're scary because they represent potentially large sources of demand. But are they actually buying a large percentage of the homes? No. It's the smaller investors who speculatively bought 40% of homes last fall.
But we should definitely protect mom and pop investors, such as our housing minister wink
Real solution? Reduce the incentive for speculation among these groups by treating all gains on non-primary residences as income.