top | item 46994294

(no title)

postalcoder | 17 days ago

First thoughts using gpt-5.3-codex-spark in Codex CLI:

Blazing fast but it definitely has a small model feel.

It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.

Downsides:

- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.

- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.

  Bluey Bench* (minus transcription time):

  Codex CLI
  gpt-5.3-codex-spark low        20s
  gpt-5.3-codex-spark medium     41s
  gpt-5.3-codex-spark xhigh   1m 09s (1 compaction)

  gpt-5.3-codex low           1m 04s
  gpt-5.3-codex medium        1m 50s

  gpt-5.2 low                 3m 04s
  gpt-5.2 medium              5m 20s

  Claude Code
  opus-4.6 (no thinking)      1m 04s

  Antigravity
  gemini-3-flash              1m 40s
  gemini-3-pro low            3m 39s

  *Season 2, 52 episodes

discuss

order

HumanOstrich|16 days ago

Yea it's been butchering relatively easy to moderate tasks for me even with reasoning set to high. I am hoping it's just tuning that needs to be done since they've had to port it to a novel architecture.

If instead the model is performing worse due to how much they had to shrink it just so it will fit on Cerebras hardware, then we might be in for a long wait for the next gen of ginormous chips.

postalcoder|16 days ago

Agree w/ you on the model's tendency to butcher things. Performance wise, this almost feels like the GPT-OSS model.

I need to incorporate "risk of major failure" into bluey bench. Spark is a dangerous model. It doesnt strongly internalize the consequences of the commands that it runs, even on xhigh. As a result I'm observing a high tendency to run destructive commands.

For instance, I asked it to assign random numbers to the filename of the videos in my folder to run the bm. It accidentally deleted the files on most of the runs. The funniest part about it is that it comes back to you within a few seconds and says something like "Whoops, I have to keep it real, I just deleted the files in your folder."

jychang|16 days ago

> If instead the model is performing worse due to how much they had to shrink it just so it will fit on Cerebras hardware

They really should have just named it "gpt-5.3-codex-mini" (served by Cerebras). It would have made it clear what this model really is.

alexdobrenko|17 days ago

can we plese make the bluey bench the gold standard for all models always

Squarex|17 days ago

I wonder why they named it so similiarly to the normal codex model while it much worse, while cool of course.

HumanOstrich|16 days ago

Not sure what you mean. It IS the same model, just a smaller version of it. And gpt-5.3-codex is a smaller version of gpt-5.3 trained more on code and agentic tasks.

Their naming has been pretty consistent since gpt-5. For example, gpt-5.1-codex-max > gpt-5.1-codex > gpt-5.1-codex-mini.

mnicky|17 days ago

Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

postalcoder|17 days ago

Added a thinking-disabled Opus 4.6 timing. It took 1m 4s – coincidentally the same as 5.3-codex-low.

yojo|16 days ago

I’ve been slow to invest in building flows around parallelizing agent work under the assumption that eventually inference will get fast enough that I will basically always be the bottleneck.

Excited to see glimpses of that future. Context switching sucks and I’d much rather work focused on one task while wielding my coding power tools.

ttul|17 days ago

I gave it a run on my Astro website project. It definitely makes more mistakes than Codex-5.3, but the speed is something to behold. The text flashes by way faster than I can understand what's going on. And most of its edits worked. I then used Codex-5.3-xhigh to clean things up...

varenc|17 days ago

How do the agents perform the transcription? I'm guessing just calling out to other tools like Whisper? Do all models/agents take the same approach or do they differ?

also as a parent, I love the bluey bench concept !

postalcoder|16 days ago

I am using whisper transcription via the Groq API to transcribe the files in parallel. But (caveat), I cut out the transcription step and had the models operate on a shared transcript folder. So the times you see are pure search and categorization times.

re. your question about the approach – they all took on the problem in different ways that I found fascinating.

Codex Spark was so fast because it noticed that bluey announces the episode names in the episode ("This episode of Bluey is called ____.") so, instead of doing a pure matching of transcript<->web description, it cut out the title names from the transcripts and matched only that with the episode descriptions.

The larger models were more careful and seemed to actually try to doublecheck their work by reading the full transcripts and matching them against descriptions.

gpt-5.2 went through a level of care that wasn't wrong, but was unnecessary.

Sonnet 4.5 (non-thinking) took the most frustrating approach. It tried to automate the pairing process with scripting to match the extracted title with the official title via regex. So, instead of just eyeballing the lists of extracted and official titles to manually match them, it relied purely on the script's logging as its eyes. When the script failed to match all 52 episodes perfectly, it went into a six-iteration loop of writing increasingly convoluted regex until it found 52 matches (which ended up incorrectly matching episodes). It was frustrating behavior, I stopped the loop after four minutes.

In my mind, the "right way" was straightforward but that wasn't borne out by how differently the llms behaved.

jiggawatts|16 days ago

Most frontier models are multi-modal and can handle audio or video files as input natively.

I'm experimenting right now with an English to Thai subtitle translator that feeds in the existing English subtitles as well as a mono (centre-weighted) audio extracted using ffmpeg. This is needed because Thai has gendered particles -- word choice depends on the sex of the speaker, which is not recorded in English text. The AIs can infer this to a degree, but they do better when given audio so that they can do speaker diarization.