Shipping with Codex
768 segments
I'm Thibaud.
And I'm here at OpenAI and I build
Codex.
With Codex, we're building an AI
software engineer.
I personally like to think about it as a
little bit like a human teammate.
You can pair program with it on your
computer.
You can delegate to it. Or, as you'll
see, you can give it a job without
explicit prompting.
There's been, recently,
a massive vibe shift.
This has started from August, where we
had pretty decent usage, and since then,
thanks to all of you,
we've grown tenfold.
Today, I want to start by sharing some
of the recent updates that have created
this vibe shift.
Then, we'll bring some engineers from
OpenAI to show you some examples of how
we use Codex day-to-day.
Some of them are building here on the
Codex team. Some of them are just really
excited users of Codex at OpenAI.
Let's first talk about some of those
updates.
Codex now works everywhere you build.
Whether it's in your IDE, your terminal,
GitHub, web, or mobile.
No matter where you are,
it is the same powerful agent under the
hood.
The first and most important improvement
we made was to completely overhaul the
agent.
We think of the agent as a combination
of two things.
The reasoning model under the hood and
its tool harness to allow it to act and
impact change upon the world to create
value for you.
First, the model.
In August, we shipped GPT-5, our best
agentic model thus far.
That was until we listened to your
feedback.
And we approved upon it by shipping
GPT-5 Codex, a model that was further
optimized for work within Codex,
improving by being smarter, better
following code style, and adapting its
thinking time.
One of my favorite quotes from the
feedback from you all was that it feels
a little bit more like a true senior
engineer because it gives such few
compliments. And it also pushes back
on bad ideas.
Next, we completely rewrote the harness
to make most of the new models.
Add support for planning, MCP, auto
compaction,
so that you can have these really long
conversations and interactions, and so
much more.
At this point, we started seeing the CLI
usage take off.
But, there's more feedback.
The model felt really good. The agent
was useful, but the CLI felt early.
We appreciate the feedback, and so we
decided to completely revamp the Codex
CLI. We simplified approvals modes,
created a more legible UI, and added a
ton of polish polish.
And by default, it works with
sandboxing, so it is safe by default,
but you always have control.
It's been a work in progress, and we
shipped a big update last Friday.
We'll ship a new release today again.
More feedback from you all. A bunch of
you collaborate with the agent and want
to look at the code at the same time.
This is why we shipped it in the IDE
directly as a native extension.
Here, it works with your code alongside,
you know, you having control over your
IDE, get this little collaborator. It
works in VS Code, it works in Cursor,
and other popular forks.
This immediately took off.
Within the first week, we had 100,000
users. Many of you, I'm sure, are in
this room. A lot of our users prefer to
use Codex in their IDE directly. Part of
the magic here is that it is the exact
same agent.
It is the same open-source harness that
is powering the CLI bundled right within
the extension.
At the same time, we're also upgrading
Codex Cloud
so that
you could run many more tasks in
parallel.
For us, this is still the beginning, but
we think it's incredibly cool to be able
to command Codex through your phone.
Cloud tasks now run 90% faster
faster. They can set up their
dependencies automatically and verify
their work by taking screenshots and
sending them to you.
Giving the agent its own computer really
feels magical when it works.
And then, you can start working with
agents like this in tools like GitHub.
Or now Slack.
Here's an example of one engineer. Some
of you might know him.
Who had a question. And then another
engineer, Steve Lee, immediately jumps
on it and delegates it to Codex. Here,
Codex receives the entire context from
the thread and just gets to work. A
couple of minutes later, it posts a
solution together with a summary. It
actually went, explored the whole
problem, and wrote some code.
All of this progress means that we can
write code so much faster,
which also means that we have a lot of
code collectively to review.
Validating and reviewing code is now
becoming a huge bottleneck.
This We've been thinking about this for
a while.
Past experiments with code review at
OpenAI showed that it could be
useful, but also oftentimes noisy.
Previous attempts had to be turned off
because users were complaining about the
lack of signal.
So, we purposely trained GPT-5 Codex to
be great at ultra-thorough code review.
It goes through the dependencies, all
the code in depth inside its little
container,
truly explores the contract of like your
intent and what actually happens in the
implementation,
and then comes back with high-quality
findings. We now find that many teams
decide to enable it by default and
almost and want to make it mandatory
because it is such a high signal
finding every time.
You can trigger it while pairing with
Codex, or you can completely automate it
by running on every PR in GitHub.
Okay.
It's been a busy few months for a small,
growing team.
We've been using Codex to build Codex.
There's really no way we could have done
it without it.
Even more fun has been seeing OpenAI as
a whole get accelerated.
Today,
92%
almost all of OpenAI technical staff
uses Codex daily.
Up from 50% around last July.
Engineers that use Codex submit 70% more
PRs per week.
And pretty much all PRs are reviewed by
Codex.
When it finds an issue, people are
actually excited. It saves you time. You
ship with more confidence.
There's nothing worse than finding a bug
after you actually shipped the feature.
When we as a team see the stats, it
feels great. But even better is being at
lunch with someone
who then goes, "Hey, I use Codex all the
time. Here's a cool thing that I do with
it. Do you want to hear about it?"
And so we wanted to give you a taste of
that.
So, let's get to lunch with a few
teammates
and hear about their stories. They'll
show you real workflows of our teams,
how they use it every day.
Please welcome Nacho to the stage to
talk about iterating on UI
for the ChatGPT iOS app.
[Applause]
Thanks, Thibaud.
Hello. My name is Nacho Soto. I'm a
member of the core iOS team at OpenAI.
Going to do two things today. I'm going
to tell you about a workflow that I use
frequently when building the ChatGPT
app. And I'd like to share a demo that
shows you how I do this.
Let's start with the demo.
Thibaud asked me to build a weather app.
So, I have a starter project with just
an empty window.
And I've also asked ChatGPT to make a
mock-up of what I want the UI to look
like.
So, I'm going to ask Codex
to implement that design.
Great. While that's running, let me tell
you what's special about how Codex is
going to implement this.
Working in the ChatGPT core team means I
spend a lot of time on infrastructure,
performance, but also do some amount of
front-end work.
Recently, I worked on this small feature
where we simplified our personalization
screen to make our new ChatGPT
personalities more discoverable.
And I'm sure you've run into something
like this before. With that last 10% of
polish, like getting these headers and
footers aligned, it's actually taking
90% of the time.
But Codex can help you with that 10%.
And you can work on that while you do
something else. Maybe you're watching
some of the other DevDay talks.
You can even have nine other terminal
tabs running Codex if you want to be a
true 10x engineer.
Who here has been sent a pull request
from a junior engineer, and within a few
seconds you you that they didn't
actually test it, cuz there's no way
that it works.
If you used ChatGPT or any other agent 6
months ago, you were working with that
junior engineer.
But Codex is not.
Like Tevo said, I would argue that Codex
is now a senior engineer.
It doesn't just write the code and
assume that it works. It will verify
that it does.
I'm a big fan of TDD, test-driven
development. And I think Codex really
thrives with that workflow.
It will run your tests, fix the code,
run your tests again, over and over
until they pass.
But why stop at unit tests?
Codex is multimodal, which means it can
also verify its work visually.
A few weeks ago, we gave Codex that
superpower of being able to see images.
So, I taught it to generate snapshots
for the UI code that it writes.
And best of all, it's actually very
simple.
First,
I made this very simple makefile that
runs the unit tests to extract the
SwiftUI previews.
And that calls a small Python script,
which Codex wrote, by the way.
And that extracts those images, puts
them in a folder so Codex can find them.
Then in the agents.md I just told Codex
about that script, and I've asked it to
use it to verify its work.
We use this workflow to build the
ChatGPT iOS and Mac app. You could do
the same on web, for example, with tools
like Storybook or Playwright.
So, that's my workflow. I give Codex
some tools to generate screenshots so we
can verify the UI code that it writes.
Let's check in and see how Codex is
doing.
Okay, so if I scroll back,
looks like it wrote some code, started
with a plan
uh to review the existing code, uh
implement the UI, and provide preview
data to verify that it's good. So,
great, looks like it wrote all that
code, ran the snapshot tests.
Cool. So, uh
no, I guess no, but for 3 minutes, go
ahead and run that up.
Cool.
So, obviously this is a very simple
example, but it actually scales with how
many changes to match larger projects
like the ChatGPT app.
And it can run for many hours depending
on the tasks, iterating over and over
until it's pixel perfect.
And speaking of working for many hours,
I like to pass it over to Freal, who's
going to show us how to scale these
verification loops to run for longer
periods of time and more complex
problems.
Thank you.
[Applause]
Thanks, Nacho.
I'm Freal, and I work on developer
productivity.
Here at OpenAI, I've set high scores for
the longest sessions or the most tokens
produced.
I'm known as the guy that gets Codex to
do this.
For being able to use Codex to one-shot
big features and complex code changes.
I've seen the GPT-5 Codex model work for
over 7 hours productively. That was my
prompt. Or process more than 150 million
tokens over the course of a marathon
session.
This is one of those projects.
It's a complex refactor, a major feature
for my my personal JSON parser project.
And for large projects like this, there
can be long periods of time where it
seems like
all of the tests are failing until the
work is complete, especially when you're
making that core change.
Now, this is a JSON parser built for
streaming tool calls,
a parser for the AI age.
And this person this PR has over 15,000
lines of code changed, and it was
created over many hours of work from
Codex,
but only a few minutes and a handful of
prompts from me.
Let's walk through how I go from prompt
to pull request.
We'll do this in just a couple prompts.
First, we'll tell Codex that we want a
plan to implement our feature.
Then, we're going to review that plan
and tell it to execute. And finally, we
ship.
Here, I've opened my project in VS Code,
and I'll open up our Codex extension as
well.
Uh I have a fairly complex feature I
want to implement, and I've prepared
that in a document for me for me to
read.
And I'm going to tell Codex that I want
it to write a plan to implement this
feature, and I've described the end
state.
But I want Codex to do the heavy lifting
for me and research how to integrate
this library into my parser.
So, what I do is I ask Codex to write a
spec.
And I'm going to go ahead and kick that
off.
And actually, I'm going to turn off auto
context here. A little bit of an aside,
I've rehearsed this a few times, and
it's actually found finished specs from
my Git history and cheated and copied
right to the end of the process.
So, I'm going to have it really do the
work.
And uh
you'll notice I don't need to tell it a
lot. I've given it my example, I've told
it to do some research,
follow the example of the code that I've
already got.
So,
while that's working, let me show you
what I mean by a plan or an exec spec.
This is my plans.md file. I've
abbreviated everything here so we don't
have to read all of it. It's 160 lines.
Uh but really what I'm doing here is I'm
writing a design document for design
documents.
Codex is now a senior engineer, after
all, so we should be asking it to do
some of its own paperwork, too.
And like most engineers writing a design
doc, it's going to start by copying from
an example. I've got one here.
And I tell it,
you know, this plan is going to contain
is going to be a living document. It's
going to describe its big picture. It's
going to have a to-do list and progress
that it keeps up to date.
And I also want to say,
why do I keep on saying exec plan?
And I'm doing that because I want to
give the model a term to anchor on and
know when I use the term exec plan,
use plans.md to design that, to iterate
on it, and follow up.
It's good to give it a a term that's
unique so that it knows to reflect back
on that. And when I say that, it's
something special, not just any design
doc or implementation spec.
So,
in this spec, we've got our progress,
our surprises and discoveries. We even
have a decision log in here for me to
keep track of what it's been working on.
Now,
normally I don't ask engineers to write
this much. I only do that when maybe I
don't like their project.
[Laughter]
But in this case, this helps Codex steer
towards a completed project. It is its
memory as it works on this large plan.
And after this talk, we'll upload the
plans.md recipe to our OpenAI cookbooks
so any of you can adopt it in your
repositories.
Now,
how does it know
how to use this plans.md? As I mentioned
earlier,
I've used my agents.md.
I drop a couple lines here in my
agents.md, just a few instructions. When
you're working on something complex,
this is what an exec plan is, refer
plans.md,
make sure that you're following that.
Now, as you can see, it's doing quite a
bit of research on the side here, so
let's go ahead and look at a completed
spec.
So, I've switched over to a completed
session here, and it's written my spec.
Let me open up that plan here.
So,
I can review this. I can give it
feedback. I can look at Okay, that looks
like, you know,
quite a lot of words, but it is what I
wanted to do, and it has a plan.
Looks like a couple spikes, some
features that it wants to implement, and
of course documentation. So, that looks
good to me. I'm going to go ahead and
tell
Codex,
let's go ahead
and implement.
And we can't type today. There we go.
And
so while that runs, uh I like to keep an
eye on Codex. It keeps something
scrolling on my screen. My manager knows
that I'm still working.
And I like to watch the tests. So,
what I'll do is I'll kick off these
tests. They run very fast. Uh Codex
helped me write all of these, by the
way, from simple property tests or
simple uh unit tests to exhaustive
property tests. There's even some
fuzzing in this crate.
And uh so, I'll keep an eye on this, and
if it stays red for too long, I might
intervene and say, you know, Codex,
maybe we need to back out. Maybe that
plan is going a little off the rails.
All right, let's go ahead and look at
what it's completed in this project. So,
I'm skipping ahead to Codex having
finished that task. By the way, that
took over an hour in my my previous
session, so we're skipping ahead quite a
ways. And it looks like it's written
some new tests.
Um they're all passing, which is great.
Uh let's go ahead and look at the
changes.
Okay.
Wow, and it looks like it vendored in
and even maybe forked or updated the
upstream library to make some changes to
implement what it needed to do.
Now,
again, I don't have all day to read all
of this to you, so I'm going to go ahead
and open up the plan again.
So, I open up the plan, and I can see in
the progress, it's checked off some big
items. It's completed some spikes.
It's updated documentation in the
readme.
Plans.md specifies that all of these
plans have to be a living document, and
so I can use this as an executive
summary to know what it's accomplished.
That way I don't have to read all the
code myself.
Okay.
It looks like it's done and the tests
are passing.
So,
uh
what I've shown you today is we can go
from
uh
an implementation idea feature
a prompt to a PR in only a few steps.
Rigorous planning and thorough testing
enabled the model to work on this
feature for a sustained period of time.
And let's just see how many lines of
code it's written.
Crashed.
Okay.
Okay.
4,200 lines of code and just about an
hour of work.
Incredible.
Now,
I could just merge this as is, but I
would really like another set of eyes on
this code.
Thankfully, we have Daniel up next to
talk about code review.
Hello.
All right, my name is Daniel and I'm an
engineer here on the Codex team. So, uh
today I want to talk about code reviews.
As Thibault mentioned, we launched code
reviews on GitHub a couple months ago
and it has been a huge hit.
Um
both externally, but especially
internally, we love code reviews. We
have them running on all of our PRs and
it's finding so many bugs that we would
have otherwise missed. And some of these
bugs are so complex that you have to
like read and reread the comment a
couple times to even understand what
it's saying. So, I highly highly
recommend you enable code reviews for
all your GitHub PRs.
Um
here's an example of one of my PRs
that's on the Codex repo. It's open
source.
So,
I uh
pushed a feature and then immediately
Codex started reviewing my code and it
found a P1 issue. Great.
Uh so, then I said, "Thanks, Codex.
Please fix it." And that kicked off a
background task um to to make that
change. And then once that got merged, I
said, "All right, Codex. Um
now that you have all this, now that you
have this change, review it again. Make
sure we don't have any issues."
And then it found another issue.
And then uh I was just embarrassed.
So,
this got me thinking.
What if you could have a workflow where
you create a feature
and then you review it for bugs and then
if there are any bugs, you fix it and
then you review it again and then you
fix and review and fix and review until
theoretically your code doesn't have any
issues.
So,
uh we decided to make this super easy by
bringing code reviews to local as well.
And I'm going to show you how to do that
with slash commands. And this is what I
do every day before I even submit the
PR.
Okay.
Uh so, I'm working on a little feature.
Uh you can see it has like three
different commits. It's a pretty small
one.
Um and I have the CLI running on the
side. So,
all I have to do is write {slash}
review,
hit enter,
and then you'll see there are a couple
different options here.
So, the first option is reviewing
against a base branch, just like a PR.
So, this would take a some of your well,
all of your commits in your base branch,
compare it to main,
uh just like a normal PR, and then look
at the whole diff and try to find any
uh issues with it.
There are other options, too, like
reviewing uncommitted changes or
specific commit or custom review
instructions, but what I usually do at
the end of the day when I have a bunch
of different commits is just review the
whole thing. So, I select the first
option. Now, I have to select a base
branch. Usually main is the first one,
so I hit enter again.
And now code review begins.
So,
a question I have is
why is it so good?
Why is GPT-5 Codex so good at code
reviews? Because we actually trained it
specifically on finding very technical
bugs and it will go on for a very long
time researching all sorts of different
files and then when it has a hypothesis
for something that could be wrong, it'll
even write tests, um scripts, execute
them to make sure that it gives you like
one or maybe at most two critical issues
that you have to fix before you land
your PR. It doesn't give you like 20 or
30 different things that it one shots
from just looking at your diff. It
doesn't waste your time.
So, um
yeah, there's actually a bug here.
Uh
If anyone gets Oh, nice.
It It got it for us.
Um so, it's a P0. Great. Um
and it's exactly correct. So, we aren't
supposed to be hardcoding the string
here in the code. We should be getting
that dynamically. So,
all I have to do now is tell Codex,
please fix.
Uh and usually I don't even read the
comments, so
uh
it just goes.
But,
yeah, and the ni- the nice thing about
reviews in the CLI is that it actually
spawns a separate thread from the
parent. So, let's say you've been
working on this feature and it is like
super biased, you know, you have to do
this feature like this, you have to
implement it like that.
The review thread is separate. It has a
fresh pair of eyes, a fresh context, new
chat, uh so it doesn't have that same
implementation bias and it'll help find
these bugs for you.
Uh so, yeah, that is going to go ahead
and and uh you know, give us some
changes. While that runs, I want to
actually show you how you can enable
um reviews on all your PRs. So, go to
chat.openai.com/codex.
And then
you just connect your GitHub.
And then there's a button here called
enable code review.
So, this will take you to the code
review settings and you can have like
repository level settings to say like I
want this repo to get code reviews, I
want that one to not, but I just have
this toggle over here that I just say,
"Review all of my pull requests. Please
make sure I don't ship any regressions
to prod."
So,
let's go back.
Fantastic.
Uh it made the change. Let's see.
That looks correct. Yeah, now it's
getting the prompt directory
dynamically. So, now that this is done,
what I want to do is I want to run
{slash} review again.
So, I hit {slash} review, enter, enter.
Great. So, this will start another
review thread. And then once that goes
on, hopefully it won't find any issues,
but if there are any issues, you can
continue it again.
Um and then once that's done, it gives
you a thumbs up.
You commit.
You push to get. And then you get one
final thumbs up from uh Codex on your PR
and you're merged.
So,
that is what I do every day
using {slash} review in my daily
workflow before I even create a PR.
Thank you so much and I'll hand it back
to Thibault to wrap it up.
[Music]
All right, folks.
That's it.
I hope today's demos gave you a glimpse
of
how we're shipping faster and with more
confidence with Codex and a little bit
about where we're going.
If you haven't tried Codex yet, just npm
install. This will give you Codex right
in your terminal. Then you just type
Codex and you could get going and use a
lot of the things that we demoed you
today. Everything we showed to you today
is real and you can use it right away.
Gabriel Peel, one of the people here
working on the Codex team, actually just
sent me a message that the V045
of the CLI is out like right now. It has
a few
incremental updates and also support for
uh OAuth MCP, which I think is very
cool. Uh so, just go and install it. Um
and this will give you the latest
version. And then if you want to hang
with a few of the people building Codex,
uh just come and join us at the booths.
Uh there will be some of us there and
also some of the, you know, top users of
Codex here at OpenAI. We also have
uh a Q&A
on uh Discord that you can join and uh
this will uh start shortly. So, come and
say hi. Don't be shy. And uh thank you
for joining today.
[Music]
[Applause]
Ask follow-up questions or revisit key timestamps.
Thibaud introduces Codex as an AI software engineer, describing it as a human teammate for pair programming and delegation, which has experienced tenfold growth due to recent updates. Codex now functions across various environments like IDEs, terminals, GitHub, web, and mobile. Key technical improvements include an overhauled agent with the GPT-5 Codex model (smarter, better code style, adaptive thinking) and a rewritten harness supporting planning and long conversations. The CLI was revamped, an IDE native extension was launched (gaining 100k users in a week), and Codex Cloud tasks run 90% faster with visual verification capabilities. Codex also integrates with tools like GitHub and Slack for task delegation and solution generation. A significant feature is its ultra-thorough code review capability, powered by GPT-5 Codex, trained to find critical technical bugs, which can be automated on GitHub PRs or used locally. Internally at OpenAI, 92% of technical staff use Codex daily, leading to a 70% increase in PR submissions and almost all PRs being reviewed by Codex. Demonstrations included Nacho Soto using Codex for UI development with visual snapshot verification, Freal leveraging it for complex, sustained refactors over many hours using detailed "exec plans," and Daniel showcasing local code reviews with automatic bug fixing and re-review cycles. Users are encouraged to install Codex via npm to use its features immediately.
Videos recently processed by our community