(no title)
postalcoder | 17 days ago
Blazing fast but it definitely has a small model feel.
It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.
Downsides:
- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.
- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.
Bluey Bench* (minus transcription time):
Codex CLI
gpt-5.3-codex-spark low 20s
gpt-5.3-codex-spark medium 41s
gpt-5.3-codex-spark xhigh 1m 09s (1 compaction)
gpt-5.3-codex low 1m 04s
gpt-5.3-codex medium 1m 50s
gpt-5.2 low 3m 04s
gpt-5.2 medium 5m 20s
Claude Code
opus-4.6 (no thinking) 1m 04s
Antigravity
gemini-3-flash 1m 40s
gemini-3-pro low 3m 39s
*Season 2, 52 episodes
HumanOstrich|16 days ago
If instead the model is performing worse due to how much they had to shrink it just so it will fit on Cerebras hardware, then we might be in for a long wait for the next gen of ginormous chips.
postalcoder|16 days ago
I need to incorporate "risk of major failure" into bluey bench. Spark is a dangerous model. It doesnt strongly internalize the consequences of the commands that it runs, even on xhigh. As a result I'm observing a high tendency to run destructive commands.
For instance, I asked it to assign random numbers to the filename of the videos in my folder to run the bm. It accidentally deleted the files on most of the runs. The funniest part about it is that it comes back to you within a few seconds and says something like "Whoops, I have to keep it real, I just deleted the files in your folder."
jychang|16 days ago
They really should have just named it "gpt-5.3-codex-mini" (served by Cerebras). It would have made it clear what this model really is.
alexdobrenko|17 days ago
Squarex|17 days ago
HumanOstrich|16 days ago
Their naming has been pretty consistent since gpt-5. For example, gpt-5.1-codex-max > gpt-5.1-codex > gpt-5.1-codex-mini.
mnicky|17 days ago
postalcoder|17 days ago
yojo|16 days ago
Excited to see glimpses of that future. Context switching sucks and I’d much rather work focused on one task while wielding my coding power tools.
ttul|17 days ago
varenc|17 days ago
also as a parent, I love the bluey bench concept !
postalcoder|16 days ago
re. your question about the approach – they all took on the problem in different ways that I found fascinating.
Codex Spark was so fast because it noticed that bluey announces the episode names in the episode ("This episode of Bluey is called ____.") so, instead of doing a pure matching of transcript<->web description, it cut out the title names from the transcripts and matched only that with the episode descriptions.
The larger models were more careful and seemed to actually try to doublecheck their work by reading the full transcripts and matching them against descriptions.
gpt-5.2 went through a level of care that wasn't wrong, but was unnecessary.
Sonnet 4.5 (non-thinking) took the most frustrating approach. It tried to automate the pairing process with scripting to match the extracted title with the official title via regex. So, instead of just eyeballing the lists of extracted and official titles to manually match them, it relied purely on the script's logging as its eyes. When the script failed to match all 52 episodes perfectly, it went into a six-iteration loop of writing increasingly convoluted regex until it found 52 matches (which ended up incorrectly matching episodes). It was frustrating behavior, I stopped the loop after four minutes.
In my mind, the "right way" was straightforward but that wasn't borne out by how differently the llms behaved.
jiggawatts|16 days ago
I'm experimenting right now with an English to Thai subtitle translator that feeds in the existing English subtitles as well as a mono (centre-weighted) audio extracted using ffmpeg. This is needed because Thai has gendered particles -- word choice depends on the sex of the speaker, which is not recorded in English text. The AIs can infer this to a degree, but they do better when given audio so that they can do speaker diarization.