top | item 45416914

(no title)

Bjorkbat | 5 months ago

> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)

I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output

discuss

cowboy_henk|5 months ago

Interestingly the internet is full of "slack clone" dev tutorials. I used to work for a company that provides chat backend/frontend components as a service. It was one of their go-to examples, and the same is true for their competitors.

While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.

In fact, I just did a quick google search and found this 15 hour course about building a slack clone: https://www.codewithantonio.com/projects/slack-clone

sigmoid10|5 months ago

This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.

Philpax|5 months ago

Well, yes, that's Claude Code. And OpenAI Codex. And Google Gemini CLI.

Your average dev can just use those.

ChadMoran|5 months ago

Claude Code with a good prompt can run for hours.

NaomiLehman|5 months ago

That sounds to me like a full room of guys trying to figure out the most outrageous thing they can say about the update, without being accused of lying. Half of them on ketamine, the other on 5-MeO-DMT. Bat country. 2 months of 007 work.

Imagine reviewing 30 hours of 2025-LLM code.

shanecp|5 months ago

What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.

Bjorkbat|5 months ago

Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.

gapeslape|5 months ago

“30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.

Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.

zelphirkalt|5 months ago

"Slack clone" is also super vague:

(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)

Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?

Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.

mh-|5 months ago

Has their comment has been edited? A few words later it says it resulted in 11,000 LoC.

> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]

zmmmmm|5 months ago

> Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code

it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.

The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.

stravant|5 months ago

That just means that owning the walled gardens and network effects will become yet more important.

walthamstow|5 months ago

It has been trivial to build a clone of most popular services for years, even before LLMs. One of my first projects was Miguel Grinberg's Flask tutorial, in which a total noob can build a Twitter clone in an afternoon.

What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.

technocrat8080|5 months ago

Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.

ChadMoran|5 months ago

Sub-agents. I've had Claude Code run a prompt for hours on end.

osn9363739|5 months ago

Have the released the code for this? Does it work? or are there x number of caviets and excuses. I'm kinda of sick of them (and others) getting a free pass at saying stuff like this.

haute_cuisine|5 months ago

They don't seem to link any source code or demo. They could have run Claude for 10 hours to write thousands of the verge articles as well.