top | item 46993054

(no title)

nikkwong | 17 days ago

> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.

I have yet to see this (produce anything actually useful).

discuss

simonw|17 days ago

How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

aeyes|17 days ago

What do you mean? The generated script just downloads the sources and runs pyodide: https://github.com/simonw/research/blob/main/cysqlite-wasm-w...

There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.

citizenpaul|14 days ago

How do you deal with the cost associated with a long running opus session? I asked it to validate some JSON configs against the spec yesterday and it burned $10 worth of tokens for what would have been a 1 millisecond linter task.

basilgohar|17 days ago

Can you share any examples of these one-shot prompts? I've not gotten to the point where I can get those kind of results yet.

gamegoblin|17 days ago

I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

nikkwong|17 days ago

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

tsss|17 days ago

How can you afford that?

addaon|17 days ago

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

wahnfrieden|17 days ago

It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

nikkwong|17 days ago

I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet

XCSme|17 days ago

Their ability to burn through tokens non-stop for hours, days or weeks without intervention.

raw_anon_1111|17 days ago

You’re mixing up Open AI for Anthropic.

Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.

I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.

TheMuenster|17 days ago

Can I just say how funny this metric is?

"Our model is so slow and our tokens/second is so low that these tasks can take hours!" is not the advertising they think it is.

johnfn|17 days ago

The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.

seunosewa|17 days ago

How did you verify it?

mikojan|17 days ago

Agreed. Optimistically let it resolve merge conflicts in an old complex branch. Looked fine at first but was utter slop upon further review. Duplication, wildly unnecessary complexity and all.

bitwize|17 days ago

PEBKAC

renato_shira|17 days ago

[deleted]