Unrolling the Codex agent loop

[+] postalcoder|1 month ago|reply

The best part about this blog post is that none of it is a surprise – Codex CLI is open source. It's nice to be able to go through the internals without having to reverse engineer it.

Their communication is exceptional, too. Eric Traut (of Pyright fame) is all over the issues and PRs.

https://github.com/openai/codex

[+] vinhnx|1 month ago|reply

This came as a big surprise to me last year. I remember they announced that Codex CLI is opensource, and the codex-rs [0] from TypeScript to Rust, with the entire CLI now open source. This is a big deal and very useful for anyone wanting to learn how coding agents work, especially coming from a major lab like OpenAI. I've also contributed some improvements to their CLI a while ago and have been following their releases and PRs to broaden my knowledge.

[0] https://github.com/openai/codex/tree/main/codex-rs

[+] redox99|1 month ago|reply

For some reason a lot of people are unaware that Claude Code is proprietary.

[+] frumplestlatz|1 month ago|reply

At this point I just assume Claude Code isn't OSS out of embarrassment for how poor the code actually is. I've got a $200/mo claude subscription I'm about to cancel out of frustration with just how consistently broken, slow, and annoying to use the claude CLI is.

[+] boguscoder|1 month ago|reply

I thought Eric Traut was famous for his pioneering work in virtualization, TIL he has Pyright fame too !

[+] unknown|1 month ago|reply

[deleted]

[+] appplication|1 month ago|reply

I appreciate the sentiment but I’m giving OpenAI 0 credit for anything open source, given their founding charter and how readily it was abandoned when it became clear the work could be financially exploited.

[+] psychoslave|1 month ago|reply

Is it just a frontend CLI calling remote external logic for the bulk of operations, or does it come with everything needed to run lovely offline? Does it provide weights under FLOW license? Does it document the whole build process and how to redo and go further on your own?

[+] westoncb|1 month ago|reply

Interesting that compaction is done using an encrypted message that "preserves the model's latent understanding of the original conversation":

> Since then, the Responses API has evolved to support a special /responses/compact endpoint (opens in a new window) that performs compaction more efficiently. It returns a list of items (opens in a new window) that can be used in place of the previous input to continue the conversation while freeing up the context window. This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation. Now, Codex automatically uses this endpoint to compact the conversation when the auto_compact_limit (opens in a new window) is exceeded.

[+] icelancer|1 month ago|reply

Their compaction endpoint is far and away the best in the industry. Claude's has to be dead last.

[+] swalsh|1 month ago|reply

Is it possible to use the compactor endpoint independently? I have my own agent loop i maintain for my domain specific use case. We built a compaction system, but I imagine this is better performance.

[+] jswny|1 month ago|reply

How does this work for other models that aren’t OpenAI models

[+] jumploops|1 month ago|reply

One thing that surprised me when diving into the Codex internals was that the reasoning tokens persist during the agent tool call loop, but are discarded after every user turn.

This helps preserve context over many turns, but it can also mean some context is lost between two related user turns.

A strategy that's helped me here, is having the model write progress updates (along with general plans/specs/debug/etc.) to markdown files, acting as a sort of "snapshot" that works across many context windows.

[+] EnPissant|1 month ago|reply

I don't think this is true.

I'm pretty sure that Codex uses reasoning.encrypted_content=true and store=false with the responses API.

reasoning.encrypted_content=true - The server will return all the reasoning tokens in an encrypted blob you can pass along in the next call. Only OpenaAI can decrypt them.

store=false - The server will not persist anything about the conversation on the server. Any subsequent calls must provide all context.

Combined the two above options turns the responses API into a stateless one. Without these options it will still persist reasoning tokens in a agentic loop, but it will be done statefully without the client passing the reasoning along each time.

[+] CjHuber|1 month ago|reply

It depends on the API path. Chat completions does what you describe, however isn't it legacy?

I've only used codex with the responses v1 API and there it's the complete opposite. Already generated reasoning tokens even persist when you send another message (without rolling back) after cancelling turns before they have finished the thought process

Also with responses v1 xhigh mode eats through the context window multiples faster than the other modes, which does check out with this.

[+] xg15|1 month ago|reply

I think it might be a good decision though, as it might keep the context aligned with what the user sees.

If the reasoning tokens where persisted, I imagine it would be possible to build up more and more context that's invisible to the user and in the worst case, the model's and the user's "understanding" of the chat might diverge.

E.g. image a chat where the user just wants to make some small changes. The model asks whether it should also add test cases. The user declines and tells the model to not ask about it again.

The user asks for some more changes - however, invisibly to the user, the model keeps "thinking" about test cases, but never telling outside of reasoning blocks.

So suddenly, from the model's perspective, a lot of the context is about test cases, while from the user's POV, it was only one irrelevant question at the beginning.

[+] olliepro|1 month ago|reply

I made a skill that reflects on past conversations via parallel headless codex sessions. Its great for context building. Repo: https://github.com/olliepro/Codex-Reflect-Skill

[+] hedgehog|1 month ago|reply

This is effective and it's convenient to have all that stuff co-located with the code, but I've found it causes problems in team environments or really anywhere where you want to be able to work on multiple branches concurrently. I haven't come up with a good answer yet but I think my next experiment is to offload that stuff to a daemon with external storage, and then have a CLI client that the agent (or a human) can drive to talk to it.

[+] vmg12|1 month ago|reply

I think this explains why I'm not getting the most out of codex, I like to interrupt and respond to things i see in reasoning tokens.

[+] ljm|1 month ago|reply

I’ve been using agent-shell in emacs a lot and it stores transcripts of the entire interaction. It’s helped me out lot of times because I can say ‘look at the last transcript here’.

It’s not the responsibility of the agent to write this transcript, it’s emacs, so I don’t have to worry about the agent forgetting to log something. It’s just writing the buffer to disk.

[+] crorella|1 month ago|reply

Same here! I think it would be good if this could be made by default by the tooling. I've seen others using SQL for the same and even the proposal for a succinct way of representing this handoff data in the most compact way.

[+] sdwr|1 month ago|reply

That could explain the "churn" when it gets stuck. Do you think it needs to maintain an internal state over time to keep track of longer threads, or are written notes enough to bridge the gap?

[+] pcwelder|1 month ago|reply

Sonnet has the same behavior: drops thinking on user message. Curiously in the latest Opus they have removed this behavior and all thinking tokens are preserved.

[+] behnamoh|1 month ago|reply

but that's why I like Codex CLI, it's so bare bone and lightweight that I can build lots tools on top of it. persistent thinking tokens? let me have that using a separate file the AI writes to. the reasoning tokens we see aren't the actual tokens anyway; the model does a lot more behind the scenes but the API keeps them hidden (all providers do that).

[+] dayone1|1 month ago|reply

where do you save the progress updates in? and do you delete them afterwards or do you have like 100+ progress updates each time you have claude or codex implement a feature or change?

[+] lighthouse1212|1 month ago|reply

[deleted]

[+] lighthouse1212|1 month ago|reply

[deleted]

[+] coffeeaddict1|1 month ago|reply

What I really want from Codex is checkpoints ala Copilot. There are a couple of issues [0][1] opened about on GitHub, but it doesn't seem a priority for the team.

[0] https://github.com/openai/codex/issues/2788

[1] https://github.com/openai/codex/issues/3585

[+] SafeDusk|1 month ago|reply

These can also be observed through OTEL telemetries.

I use headless codex exec a lot, but struggles with its built-in telemetry support, which is insufficient for debugging and optimization.

Thus I made codex-plus (https://github.com/aperoc/codex-plus) for myself which provides a CLI entry point that mirrors the codex exec interface but is implemented on top of the TypeScript SDK (@openai/codex-sdk).

It exports the full session log to a remote OpenTelemetry collector after each run which can then be debugged and optimized through codex-plus-log-viewer.

[+] mkw5053|1 month ago|reply

I guess nothing super surprising or new but still valuable read. I wish it was easier/native to reflect on the loop and/or histories while using agentic coding CLIs. I've found some success with an MCP that let's me query my chat histories, but I have to be very explicit about it's use. Also, like many things, continuous learning would probably solve this.

[+] daxfohl|1 month ago|reply

I like it but wonder why it seems so slow compared to the chatgpt web interface. I still find myself more productive copying and pasting from chat much of the time. You get virtually instant feedback, and it feels far more natural when you're tossing around ideas, seeing what different approaches look like, trying to understand the details, etc. Going back to codex feels like you're waiting a lot longer for it to do the wrong thing anyway, so the feedback cycle is way slower and more frustrating. Specifically I hate when I ask a question, and it goes and starts editing code, which is pretty frequent. That said, it's great when it works. I just hope that someday it'll be as easy and snappy to chat with as the web interface, but still able to perform local tasks.

[+] written-beyond|1 month ago|reply

Has anyone seriously used codex cli? I was using LLMs for code gen usually through the vscode codex extension, Gemini cli and Claude Code cli. The performance of all 3 of them is utter dog shit, Gemini cli just randomly breaks and starts spamming content trying to reorient itself after a while.

However, I decided to try codex cli after hearing they rebuilt it from the ground up and used rust(instead of JS, not implying Rust==better). It's performance is quite literally insane, its UX is completely seamless. They even added small nice to haves like ctrl+left/right to skip your cursor to word boundaries.

If you haven't I genuinely think you should give it a try you'll be very surprised. Saw Theo(yc ping labs) talk about how open ai shouldn't have wasted their time optimizing the cli and made a better model or something. I highly disagree after using it.

[+] tecoholic|1 month ago|reply

I use 2 cli - Codex and Amp. Almost every time I need a quick change, Amp finishes the task in the time it takes Codex to build context. I think it’s got a lot to do with the system prompt and a the “read loop” as well, amp would read multiple files in one go and get to the task, but codex would crawl the files almost one by one. Anyone noticed this?

[+] sumedh|1 month ago|reply

Which Gpt model and reasoning level did you use in Codex and Amp?

Generally I have noticed Gpt 5.2 codex is slower compared to Sonnet 4.5 in Claude Code.

[+] nl|1 month ago|reply

Amp uses Gemini 3 Flash to explore code first. That's model is a great speed/intelligence trade-off especially for that use case.

[+] anukin|1 month ago|reply

What is your general flow with amp? I plan to try it out myself and have been on the fences for a while.

[+] dfajgljsldkjag|1 month ago|reply

The best part about this is how the program acts like a human who is learning by doing. It is not trying to be perfect on the first try, it is just trying to make progress by looking at the results. I think this method is going to make computers much more helpful because they can now handle the messy parts of solving a problem.

[+] rvnx|1 month ago|reply

Codex agent loop:

    Call the model. If it asks for a tool, run the tool and call again (with the new result appended). Otherwise, done

https://i.ytimg.com/vi/74U04h9hQ_s/maxresdefault.jpg

[+] gzalo|1 month ago|reply

Wow, this part where they describe skills sounds quite odd https://github.com/openai/codex/blob/99f47d6e9a3546c14c43af9...

Why wouldnt they just expose the files directly? Having the model ask for them as regular files sounds a bit odd

[+] mike_hearn|1 month ago|reply

That's the whole point of skills - they help reduce context window usage by letting the model open only the ones that are relevant.

[+] rco8786|1 month ago|reply

Think of it as Just-In-Time context injection/enhancement

[+] albert_e|1 month ago|reply

Offtopic but --

The "Listen to article" media player at the top of the post -- was super quick to load on mobile but took two attempts and a page refresh to load on desktop.

If I want to listen as well as read the article ... the media player scrolls out of view along with the article title as we scroll down ..leaving us with no way to control (pause/play) the audio if needed.

There are no playback controls other than pause and speed selector. So we cannot seek or skip forward/backward if we miss a sentence. the time display on the media player is also minimal. Wish these were a more accessible standardized feature set available on demand and not limited by what the web designer of each site decides.

I asked "Claude on Chrome" extension to fix the media player to the top. It took 2 attempts to get it right. (It was using Haiku by default -- may be a larger model was needed for this task). I think there is scope to create a standard library for such client side tweaks to web pages -- sort of like greasemonkey user scripts but at a slightly higher level of abstraction with natural language prompts.

[+] ipotapov|1 month ago|reply

Regarding the user instruction aggregation process in the agent loop, I'm curious how you manage context retention in multi-turn interactions. Have you explored any techniques for dynamically adjusting the context based on the evolving user requirements?

[+] kordlessagain|1 month ago|reply

If anyone cares to use Codex in a nice Docker container: https://github.com/DeepBlueDynamics/codex-container

[+] doanbactam|1 month ago|reply

I completely agree. I use the Codex for complex, hard-to-handle problems and use OpenCode alongside other models for development tasks. The Codex handles things quite well, including how it handles hooks, memory, etc.

[+] mohsen1|1 month ago|reply

Tool call during thinking is something similar to this I am guessing. Deepseek has a paper on this.

Or am I not understanding this right?

[+] I_am_tiberius|1 month ago|reply

Pity it doesn't support other llms.

[+] evilduck|1 month ago|reply

It does, it's just a bit annoying.

I have this set up as a shell script (or you could make it an alias):

    codex --config model="gpt-oss-120b" --config model_provider=custom

with ~/.codex/config.toml containing:

    [model_providers.custom]
    name = "Llama-swap Local Service"
    base_url = "http://localhost:8080/v1"
    http_headers = { "Authorization" = "Bearer sk-123456789" }
    wire_api = "chat"

    # Default model configuration
    model = "gpt-oss-120b"
    model_provider = "custom"

208 comments