Build Hour: Agents SDK
1352 segments
Hey everyone, welcome to OpenAI Build
Hours. I'm Christine. I'm on the startup
marketing team and I'm here with Steve.
>> Hi everybody.
>> So you might might remember Steve.
Welcome back to build hours.
>> Thank you. It's great to be back
>> from our last session on responses API.
Um but today we're going to be talking
about the agents SDK.
So if this is your first build hour, uh
the purpose of build hours is to empower
you with the best practices, tools, AI
expertise to help you build with our
APIs and models. Down below on your
screen is our homepage. So you can find
all of our past sessions including
Steve's one on responses API on demand
and this will also be on demand after
our live session. So to give you a
little snapshot on what we're going to
be talking about today. So, first we'll
be going through a lot of the new
updates on agents SDK. There's now a
codec style harness um that's really
improved it and Steve will walk you
through all of these new releases. Um we
also have some new releases in the API
that we'll walk through and then we'll
move into a live demo. So, this will be
the majority of the time that we'll be
spending today. Um and we'll be building
an agentic uh task tracker and then as
always we have a lot of time set aside
for Q&A. So on the right side of your
screen, you'll see a Q&A tab. So feel
free to submit your questions. Um we
have our team who is dialing in
virtually to answer them during the
session as well as saving some to answer
live towards the end. And we'll have a
very special guest be joining us
virtually uh to help answer some of
these. So with that, I will turn it back
over to Steve to walk through some of
these slides.
>> Awesome. Thanks, Christine. Uh so great
to be back on Build Hour. My name is
Steve. I'm an engineer on the API team
here at OpenAI. First thing I want to
talk about is some of the updates we've
made to the agents SDK in the last
couple months. Um kind of uh kind of the
backstory here is that models are
getting way better at doing work over
long periods of time. Uh a lot of folks
have seen the sort of meter chart that
shows the ability of models to kind of
work uh on their own for longer and
longer periods of time that's just kind
of going continually up and to the
right. And there are a lot of great
examples of this uh even within our own
products. So many of you know and love
Codeex, our agent coding tool that you
can use to kind of build software and do
tasks from end to end. You've probably
seen it run for maybe minutes or up to
an hour. If you've used the goal
feature, maybe you've seen it run for
much longer than this internally. Folks
have gotten Codex to run for days, up to
a week on tasks. So these models are
really really able to kind of do things
over really long trajectories, which is
really cool. Um we also uh have this
sort of security agent that's built on
top of Codeex. um you know the models
are getting really really good at
finding security vulnerabilities
especially in legacy software and so we
have an agent that's powered by codecs
that actually scans repos and our our
own code and code that we depend on for
vulnerabilities that might affect the
security of our systems and then we also
have an internal data agent that's kind
of powered again by codeex that's
connected to our data links able to
answer any questions so I can say
something like how many people are using
the agents SDK how many requests were
there in the responses API two days ago
and where previously I might have spent
an hour writing that SQL query, now it's
just a simple prompt and a few minutes
later I have an answer which is which is
really awesome. Uh the reality is though
that building production grade agents in
your own systems is is still pretty
hard. Uh it's pretty hard to kind of
strike this balance between maximizing
performance and getting and having sort
of a really uh flexible maybe cross
model provider platform. um because it
more and more the models are kind of
being tailored to the individual
harnesses and so it's kind of hard to
stay in distribution while also
maximizing that flexibility and that
customization. Um there's also a lot of
challenges that come with sort of modern
computer using agents. So we think of
codeex as a coding agent uh that can do
things within a sandbox. Most most of
the time when you're running codeex that
sandbox is just your local laptop. So
those two things are together. We'll
talk about how we switched that up with
the agents SDK. But if you're deploying
agents in production, oftentimes you
want to be able to run these agents in
sort of an isolated environment and then
you have to deal with problems like the
container dies, it expires, it goes
away, what do you do? Does that thing
loadbearing? Did you have state on that
then rehydrate? Uh this stuff is is
really really hard. Uh and then kind of
the third thing is like at the end of
the day, you want to be able to have
this sort of like really customizable
framework you can use to bring your own
data in and uh maybe add your own skills
and and all of this stuff like this. And
so this combination of things is kind of
makes deploying agents in production a
little bit challenging. Uh the cool
thing about the agent SDK, we've added
we've made some updates to it, but sort
of the baseline is this open- source
customizable framework that maybe you've
used since we released the agent SDK
last March. It's been around for a
while, continues to be open source and
and super flexible. Uh we've added in
this concept of sandbox using agents and
you get to decide where these sandboxes
are run. So if you are you know maybe a
modal user or an EDB user or a
cloudflare user and you're kind of
already using those sandbox products or
maybe you're new to sandboxes and new to
sandbox using agents and you want to but
you're already using one of these
platforms we have provided first class
support for using the sandbox products
in a bunch of these uh different
platforms like ETB modal cloudflare in
addition versell uh blackel daya uh and
just things like docker and uh if you
just want to use your local laptop for
testing. Those are kind of all supported
first class. And then on top of this is
this Codex style harness. So bringing in
all of the great stuff that makes CEX
Codex, all the great tools that are in
distribution with our models, but work
great with other models as well. And
then some of the cool things like skill
use, uh, computer use, memory, and we'll
be demoing some of this stuff here in a
little bit. So uh, yeah. So basically
what the agents SDK is is it provides
this sort of model native harness that
the that the agent can use in order to
kind of do productive work over long
periods of time. So if we think about
maybe what building an agent might have
looked like you know a year a year and a
half ago where you're just kind of using
an API directly an LLM API directly you
would have done something like you know
you set up a loop you make a request to
an LLM API you get some stuff back maybe
you get a tool you execute that tool you
update the context and then you kind of
repeat until the model has no is no
longer calling any tools you might be
layering in stuff in addition to that
like web search file search MCP kind of
like building spending most of your time
building this orchestration layer around
this loop where actually you should be
spending most of the time building
product building stuff into your product
that that makes it it much better for
your end users and so agents SDK is kind
of designed to help you do all that
stuff and not really think about the
orchestration. So the agent loop is
built in again very CEX-like in its uh
design. Uh also web search, file search,
MCPS, code interpreter, skills, these
are things are all you get out of the
box or really easy to turn on. And then
we've kind of added in some of this like
sandbox using functionality that I'll
talk about uh pretty shortly. The thing
that I'm most excited about that we've
kind of done here is we've split the
harness from the compute. And uh what
does that mean basically? So if you can
kind of imagine a world where the model
is working sort of directly on a uh
computer that it itself is running on.
So if you can kind of imagine like
codecs running on your laptop is the
sort of example where the harness and
the compute are tied together. Uh
imagine putting that into production
right so you have sort of a container
where the agent is both running. So like
the the loop that is calling your LLM
API in addition to working on the same
file system those things are together.
This kind of creates a few problems when
you run this in production at scale. A
your sandboxes become all of the sudden
loadbearing and if one dies or goes away
then all of a sudden all that state is
gone and you don't have some sort of
external place where maybe you can
refresh from. In addition, you have to
sort of do interesting uh sort of
gymnastics to manage your secrets. You
ideally don't want to have any secrets
on your sandbox otherwise you might be
vulnerable to prompt injection attacks
or exfiltration. And so if you kind of
split those things up, then you can
treat the sandbox as this totally
ephemeral thing that you don't really
have to worry if it lives or dies. It
can expire, go away. And then the uh
harness, which is running maybe in a
temporal job or in AWS or you know kind
of wherever your infrastructure is
already running can sort of handle the
rehydration and snapshotting and all
this stuff. And the agents SDK is going
to make that really easy. And I'll kind
of show some examples of this. Um, so
agent SDK kind of like already fully
packed with a ton of features that you
might have already been using things
like web search, file search, uh, you
know, agent memory, text in addition to
some modalities that you've probably
been using, uh, text and voice. Um,
there are some new things that are we're
pretty excited about that we've added.
So this is kind of first class support
for skills. Uh, containers is kind of a
new thing sort of in the API, also in
the SDK. And then we've also added agent
memory so your tasks can improve over
time um and get better as they as they
go. So some of the kind of core features
here again uh code a really you know
codeex inspired harness a lot of those
same tools sandbox use so your models
can work with files in sort of a
virtualized environment you know do work
over long periods of time write bash
write python write whatever to kind of
accomplish tasks and then as always open
source super customizable
socy harness what kind of concretely
does that mean um the kind of things
we've brought from codeex into the
agency SDK are these things like uh
autoco compaction uh especially you know
this computer use through through uh the
shell. So the model is able to write
shell commands. Um we've you know the
codec style loop is this sort of async
shell interaction loop where it can
write a command maybe wait a little bit
if it doesn't finish it can go do
something else and then kind of come
back to those things and it kind of
keeps track of what commands it has
running at any given time. And so we've
kind of brought that into the agent SDK
to allow you know coding agents to work
kind of in any domain in any
infrastructure. And then sandboxes are
super crucial because you know modern
work means working over a lot of files.
If you think about kind of the work that
you and I do every day maybe you're
working in a big codebase with a bunch
of files editing files and and making
your changes. Maybe you're working with
a bunch of PDFs. Maybe you are you know
creating word documents or powerpoints.
a agents are no different and they need
access to those files and a kind of a a
a place to create those files and store
them. And so that's kind of where
sandboxes come into the picture. It's an
isolated environment where the model has
access to some files and can do stuff
and and produce uh meaningful output.
And uh kind of again agent SDK open
source and customizable. It's been open
source since day one. Super, you know,
uh model agnostic. If you want to use
other model providers, you're more than
welcome to. Anything that kind of uses
the responses API format will plug in
really well. Uh it you can make this
multi-tenant super easily. So if you
want to have a system that's running the
agents SDK where you're kind of
processing many requests from users at
once again super easy. Made session
management really simple. So agents SDK
is going to manage the snapshotting uh
rehydration of tasks. You can kind of
start from where you left off. All that
stuff is really simple. Kind of show a
demo of that. And then really
configurable. You can bring your own
tools, bring your own MCPs and and kind
of all that stuff. So I want to switch
gears a little bit and talk about some
other cool things that we've shipped in
the API in the last couple months. Uh
first one is this hosted shell tool in
the responses API. So idea here is I
want to make a just a quick API call. I
want to provide some files there that I
want the model to do something with. Um
and behind the scenes we're going to
spin up a container, load that up with
the files that you've asked us to. The
model can write code, do some stuff in
that container and then return the
results back to you. So think about this
as like a lightweight version of the
agents SDK. It's kind of this ephemeral
container. You can put a few files in
there. The model can work in that
container and then it can finish and you
can you know create these out of band.
We have a containers endpoint where you
can go and create a container, upload
files to it, spin it up, attach it to a
responses API request or you can kind of
use it in auto mode where you just add
some files and then we'll spin up the
container for you and then spin it back
down when you're done. Really easy way
to kind of get started with this, you
know, sandbox using paradigm.
We've also added the ability for you to
kind of control network access in the
containers you spin up. So if you want
to do things like domain allow lists,
lock it down entirely so you have no
egress or ingress into that container.
Uh these are kind of like all things
we've added so you have a little bit
more security control over your
containers. And we've also added the
skills API. So if anybody has used
skills in the past, there's kind of like
a pretty common problem that comes into
play which is that you kind of need a
central source of truth for these
skills. whether that's maybe a GitHub or
a bucket or now we have a skills API
where you can kind of upload your skills
too. Uh and a skill is kind of this
bundle for for those who haven't used
it. It's kind of a bundle of files that
has sort of one skill MD file that has
all the top level stuff in it, but they
can also contain scripts or other
resources that the model can use to kind
of do a pretty specific task. So, the
example I always use is like maybe you
have a tax prep skill that kind of
defines like, okay, here's all the
things that I need to know as an agent
in order to do somebody's taxes for like
tax year 2025. Has all the like IRS
rules and maybe some scripts to like
process documents and like fill in a
1040, something like that. Um, in the
with the skills API now, you could
upload that, you can iterate on it over
time and create versions. You can set up
default version and then reference those
versions with the hosted shell and all
that stuff works together super
seamlessly.
Uh yeah, so kind of a couple curl
snippets here of how you might use this
thing. Pretty easy. If you already have
zip files of your skills, you can just
kind of upload them directly to the API.
And then the last thing is we've also
shipped uh the sandbox agent stuff that
we will be talking about today in
Typescript. This is a big ask from a lot
of folks after we ship the Python
version in April. Uh sand the TypeScript
version is now available. So if you are
a TypeScript user, uh knock yourself
out. So, uh, want to switch gears a
little bit and kind of go over to the
demo. Uh, we're going to be kind of like
building a task tracker with the agents
SDK and kind of making it aic and then
offloading all the work so I don't have
to do anything anymore.
So, we're going to
Oops.
Flip over here. And uh I kind of put
this kind of like vibe coded this uh
kind of like sort of like a linear
board. Doesn't really do anything right
now, but I can kind of create tasks and
uh assign them to folks. So the sort of
idea here is that we are going to be
doing some conference planning. So you
can kind of imagine this is a tool that
I might use as a conference planner to
put together some talks. So I might have
tasks for you know creating program
assets or refining somebody's talk or
you know uploading stuff to the website
things like that. Um my goal here is not
to do any work myself. So we're going to
kind of automate a lot of this with the
agents SDK. So I'm going to create a
fake task here and I'm going to say uh
you know create program assets and we'll
say we can leave this blank and we have
a couple of things down here. where we
have we can have a assignment. Uh I need
to refresh this.
Sorry guys, I need to reset my demo.
Cool.
Okay, there we go. So I can uh I can
create a blank task here that's we can
say create program
assets
and uh I can assign this. Right now it's
Right now it's just me, but just got a
sneak peek at what might be coming in a
few minutes. Um, and I can upload files
to this. So, if I just go ahead and grab
my fake files here,
can drag these in. I can create an
issue. Uh, and in a pre-agentic world, I
would assign this to somebody. Maybe I'd
assign it to myself and I would I would
do this work, but that's not really that
much fun. So, we're going to go ahead
and create an agent to kind of help us
automate this. So, if I hop back here,
uh, we are just going to comment in this
agent definition. So, I want to take a
sec to kind of walk through what we're
looking at here. So, if you've been an
agent SDK user in the past, this will
look pretty familiar to you. Uh, the
sandbox agent is a new type of agent.
Uh, it actually is just a subclass of
the agent class that you might have been
using before. So, a lot of the same
things will still apply. A lot of the
same parameters are still going to be
there. We are having we have a couple
new things that we'll kind of talk about
in a sec, but suffice to say that we are
adding some instructions that are kind
of defined up here. Uh, and then we'll
be adding some other stuff later. So, if
I go back to my demo, refresh,
should be able to assign this to the
program editor agent. And what I'll say
is like, please take a look at these
files and uh, edit them for clarity.
I'll hit send
and then we should be able to see that
kickoff. So the first thing that's going
to happen is uh because I've configured
this first agent to use docker as its
sandbox can actually see that docker
container running and uh the agents SDK
is going to handle uploading these files
to that container. So if I pop this
might be a little hard to see but uh we
can see that a lot of these files have
now made it into the container. So the
agent is going to be sandboxed to just
this folder and we'll have access to all
of these files that it can then use to
look at. We can as we kind of look over
here, we can see that it's executing
commands. It's kind of like looking
through the files and figuring out
what's what. And then it's going to kind
of produce some output. And so it just
finished and it gave us it just kind of
gave us a little log of of what it did
and kind of told us that it's missing
some data here. That's kind of okay for
now. So,
uh, this is great, but we kind of want
to be able to customize this agent and,
uh, kind of give it some like really
specific context that is pretty
particular to what it needs to do. And a
really good way to do that is through
skills. So, prior to this demo, I kind
of put together this skill that's for a
conference, the conference program
editor. It's got some stuff in here.
Again, this is kind of entirely vibe
coded, so the actual content doesn't
matter. But what I want to point out is
that GitHub is actually a really great
place to store skills and kind of
addition to the skills API. The reason
GitHub is so great is because you can
have a lot of this a lot of these uh
kind of concerns of producing and
editing skills are kind of taken care of
for you. So you know if you want to have
version control over your skills, if you
want to have pull requests so that you
can review changes to skills, this is
all really great and a pretty common way
that people are starting to create
repositories of their skills is kind of
in Git. And we have first class support
for that in the agent SDK. So if I come
back here and I comment this stuff back
in
and hop down, we can see that we have
this uh skills capability. And I want to
pause for a sec to kind of talk about
what a capability is in the new SDK.
Capability is kind of uh this object
that is able to combine a bunch of
different concepts all into one. So if
you want to bring in uh tools, if you
want to bring in additional
instructions, maybe add some stuff to
the manifest so that a capability can
put things into the computer when it
starts up. These are kind of all things
capability can do. The sort of default
set is file system. So this is the
ability for the model to view images and
run the apply patch tool. So kind of
like edit files in line. And then we
also have the shell tool which is
actually a combination of of two tools.
This is as I mentioned earlier exact
command and write standard in. This is
this async bash command flow. And then
we have compaction which is allows the
model to compact its context and keep
going even when it exceeds the context
window. And this allows the model to
keep working for a super long period of
time, hours, weeks, technically uh
there's no limit to to how long it can
work. So these this is kind of the
default set. We're also adding the
skills capability. And then what we're
saying here is that we want to load our
skills from this git repo. And so I'm
pointing it at this git repo that I just
created. Uh we're saying we want to pull
from main. If you're not using GitHub,
you can actually change the host here,
but uh it's we're kind of like using
GitHub by default. So I'll go ahead and
save this. We'll flip back to our demo.
I'm going to ask the question, tell me
the most
interesting
thing from your skill file.
Will it send? This is going to kind of
spin back up, but I want to take a sec
now to actually talk about how we do
resume uh like pausing and resuming
behavior between sessions at the agents
SDK. When you start a sandbox agent, the
agents SDK will handle spinning up that
sandbox for you based on how you define
it. So, if I hop back to the code here,
we can go and say for by default, we're
using Docker. And so, I'm kind of just
providing these two classes here to kind
of define where I want to run the agent.
In this case, I'm saying I want to use
the Docker sandbox client. the and I can
provide some options. Different backends
might have different options. Docker, I
can provide an image. In this case, it's
just uh the Python 312 image. And uh you
can kind of easily swap this out for any
of the backends that that I mentioned
before. And we'll kind of show an
example of that in a little bit. Um but
what's cool is that when that when this
task spins down, when it stops, the
agent will automatically stop that
container and then we'll take a snapshot
of the file system and put it somewhere
you define. Right now that place is just
my laptop. So if I go over here
uh we can see some of the snapshots. So
this is kind of the a couple tasks that
I recently created. And if I untar one
of these and look, it's basically just
the exact file system that the model was
using. So when I resume that task, it is
grabbing that tarball from the place
that I defined, spinning up a new
container if the last one doesn't exist
anymore and then rehydrating that file
system onto that new container so that
it to the model looks exactly the same
as it did when it left off and the model
is actually unaware of the fact that
it's running on a new container is
totally oblivious to that. And this
makes resuming really easily easy and
seamless. And if you don't want to use
your local file system or tarballs, we
have sort of provider specific ways that
you can do this. So if you're a modal
customer and you use volumes or their
snapshot concept, this stuff plugs right
in super easily. But kind of by default,
we're using this sort of like naive
tarball approach to kind of help make
this easy across any provider. So uh we
can kind of go back and if I refresh
should be done. And it was able to kind
of like look at the skill file. If I
look at the output here, we can see that
it probably was kind of graping over
this skill file. Uh, it did quite a bit
quite a bit of stuff. So, um, but it was
able to kind of give me some output
here, kind of, uh, telling me what was
the most interesting thing here. So, now
we have an agent. It can use a skill. We
want to kind of hoist this thing up into
the cloud. We don't want to be running
this in Docker forever. Docker's great
for testing, especially on your laptop,
because just kind of works out of the
box. It uh is a really great place to
kind of do local development, but it's a
bit hard to deploy in production.
There's a bunch of companies that have
built really great first class agent
sandbox products that you can use that
we've added first class support for. In
this demo, I'm going to be using modal,
but you could also be using Cloudflare,
ETB, Verscell, Daytona, Blackel, uh or
you know any of the other you can kind
of bring your own implementation if if
you choose to. So I've already set up a
uh modal kind of app here for us. So I'm
going to go back to my
uh code here and we'll go to step three
and we will enable modal and here we're
kind of returning this sandbox provider.
Um but the kind of the key thing here is
just this client and options which are
the things that the the agents SDK
expects and here we're just kind of
providing this modal sandbox client. If
I go look at the definition of this
thing, uh this just inherits from base
sandbox client just like the docker
version and we are saying here that uh
we are going to use the same image. So
that same Python 312 image and then uh
in addition we're going to be using R2
for snapshots. So that both our
snapshots are now off my laptop and into
the cloud. Additionally, the modal
sandbox is going to be moving into the
cloud as well. So we'll kind of show an
example of what this looks like. So save
all this. Go back to my
here and we'll say uh edit these files
for clarity.
Go back,
load this,
and we'll assign it to our program
editor and we'll say get started,
please.
And so now we're starting it. We can see
that we're starting a modal sandbox
instead. And if I hop over to modal, I
can see that now we have a sandbox uh
that's starting. Zoom in a little bit if
that's hard to see. And this will take
uh usually they start up in just a few
seconds or around a second. So now it's
running. What's happening behind the
scenes is the agent SDK is handling sort
of the applying all of the files that we
have defined onto that sandbox. And so
want to take a sec now to actually go
and look at this manifest object that we
talked about. And a manifest is is
pretty simple in concept. You're
basically saying this is what I want the
file system to look like when the agent
spins up. Here's how I kind of attach
files to this task in a way that I can
share across tasks or share across
agents. So manifest is just kind of a
simple class. Um we're saying the
entries are this tree here basically
kind of defining a directory structure
and this this is a pretty simple
example. So we're just saying we want to
kind of establish this base directory
structure and then we'll upload files to
the sandbox directly. But you could also
here plug in things like files if you
want to copy things from, you know,
wherever the harness is running. Or if
you want to plug in an R2 bucket, S3
bucket, an Azure blob store account, all
these things are fair game, a GitHub
repo, kind of a whole flexible set of
primitives here that allow you to put
the files on the container that the
model needs in order to kind of get
started. And what happens is that we
render a version of the file tree to the
model when it starts up so that it can
kind of see an example of its workspace.
Uh so that it doesn't really have to do
too much grepping or lsing to understand
what's on the file system. And so you
can kind of add descriptions to these
things too, but we won't get too far
into that. Um so now if I go back to the
demo, we can see that this finished. Uh
and what I want to show here is that now
if I go to my Cloudflare account and
refresh, we can see that we now have
snapshots that are stored in the cloud.
So if I want to resume this task and I
dep have this app deployed in
production, it's really easy for the the
agent to just kind of like pull its
context for the agent to pull its
context and the snapshot from R2 instead
of you know anywhere else. Uh it just
provides like a fast pretty clear
interface for this and so just want to
give a quick demo of that. So now uh we
kind of want to we have this agent it's
deployed in the cloud. It's pulling its
snapshots from another cloud provider
and we want to go ahead and add our own
tools into it because you know bringing
our own context and our own
functionality into agents is is pretty
critical. So if I go back to my program
editor agent, I'm going to comment this
back in
and we have some predefined tools that
we built sort of before this demo.
There's there's three of them. We have
one for updating the status of a task.
We have one for updating the assenee of
a task. And then we have one for
searching all of the assignees. So, uh,
if you've used the agents SDK before,
function tools work exactly the same way
as they always have. They're essentially
just Python functions or in TypeScript,
TypeScript functions, and you decorate
them using this function tool decorator.
And what this does is it allows us to
translate this function call into sort
of the API representation of the
function tool with all the parameters
uh, clearly defined so the model knows
what to call. And then when the agents
SDK receives from the API an instruction
to call a specific tool, it routes it
automatically to this function, calls
this function and then sends the
response right back. So you don't have
to sort of like be implementing this
function calling loop yourself. Uh and
so we will uh just show a quick demo of
this. I'll go back here and I'll say um
please assign this
task to Steve and mark it as ready for
review.
Let's send. While this is spinning up,
I'll kind of cover some of the other
cool things you can do with the function
tool. Um kind of a bunch of things you
can do here. So if you want to kind of
override the name, like what the model
sees is different from what the name of
the function is, you can do that. You
can provide a description to kind of
help steer the model to how to use this
tool. You can provide guard rails around
uh whether you know if the model's call
trying to call the tool with specific
arguments. You can add hooks around that
to kind of you know control the
execution flow. You can add timeouts and
things like this which is which is
pretty cool. Um and you can kind of like
set whether it's enabled kind of
dynamically maybe based on a feature
flag or something else. Uh so you don't
have to like comment and comment out
these these lists of tools which is
which is pretty cool. So go back to our
demo here
and we should see it uh assign the task
to me which is cool. So we can kind of
see like using these tools and this is a
great way to sort of bring your own
context or your own functionality from
your app into your agent whether maybe
you have want to give it access to your
database or want to have it be able to
you know uh use Slack through an MCP or
something like this. These are kind of
like great ways to bring in your own
functionality into the SDK and into your
agent. So, uh, now we are going to talk
a little bit about like tool call
approvals. So, ideally, um, there are
certain tools that maybe kind of require
a human of the loop step in order for
you to have confidence that it's kind of
doing what you ask and not going off the
rails. If you have been a Codex user in
the past, you're pretty familiar with
this. It tends to ask you a lot, do you
want me to run this bash command or am I
allowed to access the internet? This
sort of thing. You can do this in the
agent SDK pretty easily by just
providing a function. Uh you can I can
either say you know true directly or I
can provide a function that returns true
or false dynamically based on the
parameters. And so in this example we've
said if you're trying to set the status
to done then it needs my approval first.
I'm not going to let you mark something
as done without taking a quick look at
it first. And so if I refresh this and
go back and we'll go back to this other
task we have and I'll say please mark
this task as done.
We'll send that and this is going to
spin up a sandbox rehydrate the state.
Uh it will kind of like examine its
current state and then it should call
the tool and then we should kind of see
a tool call approval um pop over and
over here. Uh while this is working
another thing we can kind of show is the
ability for the agents to kind of hand
off to each other. So let's say I want
to, you know, I kind of have this
built-in mechanism here where we have
the ability to search and update the
assigne. And this is great for agents to
kind of work together. So I have two
agents now in my system. I have one for
producing assets. Um, and then I have
one for just like the program editor
agent that we looked at earlier. And so
I can go ahead and create an issue new
issue here and say like uh please
um
refine and uh
this content
and produce some assets. We'll hit uh
drop some files in our same set and then
we'll assign this to the program editor
agent and we'll say like please uh
refine this content
and then assign to the what's the other
one called asset producer agent
and we'll create that. Then we'll kick
this task off. the program editor agent
will be the first one to pick it up. It
will kind of do its thing and then it
will uh reassign this to the asset
producer agent and then that agent will
kind of kick off and then you know be
able to create some assets and then put
them back in the sandbox. And so kind of
coming back to my uh previous task here
it looks like this marked as done. If I
go to activity here I can see that we
now have this request uh for approval.
And if I go ahead and approve this, then
this will kind of kick the flow back off
and then it will eventually kind of get
marked as as done here.
So, uh let's see. I think the last thing
I want to show here is sort of the
ability to
to mount uh external buckets to your
manifests in a way that uh allows you to
kind of bring in external data. So, what
are the reasons you might want to do
this? Let's say you have either a ton of
files, let's say you have like hundreds
of PDFs that you need the model to be
able to work through and it's kind of
annoying to just, you know, have to copy
all of those to a sandbox every time you
want to spin it up and have it do some
work. Or let's say you have data with a
a strong freshness constraint, so the
data is changing all the time and you
want the agents to be able to work over
data that is like super live. In either
one of these cases, mounting data from a
centralized source of truth instead of
copying it in where it might already be
out of date by the time the agent starts
working is pretty critical. And we've
built that natively into the agents SDK.
So if I go to
uh
if I go here, what I'm going to do is
kind of comment this in. And we've built
this uh R2 attachment store. And so this
is kind of doing a couple things. This
is a taking our uploaded files and then
instead of just putting them on my
laptop and then copying them to the
container, we're actually now uploading
them straight to R2. So that's going to
be our source of truth just for our
application layer. And then back in the
manifest, oops,
uh we're saying if we are using this R2
attachment store, then we're actually
going to use a different manifest here.
And we have a similar layout. So we have
task and then these four folders. But
instead of the input being an empty
directory that we then copy files into,
we're instead saying I want to use this
S3 mount type. And we're technically
using an R2 bucket, but we're using the
S3 mount. The reason is is because R2
and S3 are API compatible. So we have
one object for them both. But as we look
through the parameters here, we can see
we're using this R2 bucket name passing
passing we're passing the access key and
secret key. And then we're using this uh
modal cloud bucket strategy. A lot of
these sandbox providers, modal included,
have native ways to mount external
volumes to their sandboxes. And so we're
using the sort of built-in one for
modal. But if you're using a Docker
container or sort of any generic
container, we provide a couple out of
the box. You can either use our clone or
fuse. And these both work equally well.
And it will depend a little bit on your
systems which one will work better for
you, but they're both provided out of
the box. So pretty easy to just plug
this stuff in. And so now uh I will go
and create another issue here. We'll say
edit these assets again please. And
we'll grab our files.
Drop these in.
Hit create.
And
we should see so this has been I didn't
assign this but we'll sign in a sec. Uh
what I want to show here though is now
we have if we go back to our bucket we
now have tasks uh and we have a a
reference to our task here that has all
the files in it. So now if I go back and
then assign it to our program editor,
program editor and say get started,
hit send, it's going to be able to we'll
watch this thing spin up and then it
will be able to read and write to this
volume just like it would if it were on
the file system local to the container.
So you can kind of bring in these places
of context of you know file storage from
kind of anywhere across the internet and
the model will be able to access it as
if it were on the local file system.
It's completely opaque to the model that
it's actually over operating over the
network. So there are some trade-offs
here. You might incur some additional
latency as the model is kind of like
lsing maybe a large bucket that takes
you know a long time to list all the
files through. Uh, but there's sort of
this, you know, really nice trade-off
that you get where you get to, you know,
maintain your freshness constraint or
have it work over a ton of files without
having to sort of like copy all them all
onto the container where you might be
sort of like limited in space. Um, so
cool. We can wait a sec for it to see
the spin up, but that kind of brings us
to the end of the demo here.
Uh, you know, we kind of covered a bunch
of cool things. Uh we showed how we can
build agents that both inspect files,
spin up sandboxes, and kind of any
environment that you choose, maybe one
that you're already using. Uh agents
that can edit code and kind of like work
on these long horizon tasks for maybe
minutes or hours or longer uh over, you
know, whatever period of time that you
want and whatever context you want. So
yeah, with that, I think we'll do some
questions.
>> Yeah.
>> Cool.
>> Okay. Um next slide. I'm really excited
to welcome Nish on stage with us. So
Nish is product for agents. Nish, do you
want to give a quick intro?
>> Hey guys. Uh yes. So product manager on
agents. Uh really excited about the
stuff we're doing with agents SDK. I've
been answering some of the questions in
the channel. Really great questions and
uh yeah, excited to get into it.
>> Awesome. Okay. Um next slide here.
Perfect. Um, so the first question, when
are harnessed based agents best versus
building with just responses API?
>> Cool. Yeah. So I think that if you the
responses API is great for a huge
variety of tasks, you know, I think the
places where it really shines is and
it's an agentic model by default and or
agentic API by default and we kind of
talked about that in the responses API
build hour from October. um it really
shines I think if you are using if maybe
you're using a language where you can't
find a really good uh sort of agentic
framework that you like. So maybe you're
building an elixir, maybe you're
building in HASLL or something a little
bit more say esoteric, you can always
just build on the API. You can kind of
recreate a lot of these concepts
yourself. It might be a little bit more
work. Um the you know sort of like
harnesses are becoming uh sort of really
key to how the models operate. So in
particular, the codeex harness with
openAI models is sort of like becoming
more and more coupled as time moves on.
And so you know in the future, it's very
very possible that you will be able to
get the best performance from the
harnesses that the models are trained
with. And uh so that's where kind of the
harnesses are are really going to shine.
Um and also if you you are using uh an
environment where you can kind of drop
these things in, it just is making if
you're building agents, it's going to
make a lot of this stuff super easy. Um,
however, if you're doing, you know, kind
of oneshot tasks like maybe a
translation use case where you just need
to like just oneshot translate a
document or, you know, taking
unstructured data and structuring it
into JSON or some other format,
sponsor's API just out of the box is
going to be really good at that stuff.
And you can use all the hosted tools to
kind of have these like really light
agentic loops that are uh pretty easy to
to orchestrate in just one API call.
>> Cool.
>> Okay, next question. Does the agents SDK
feature any built-in out ofthe-box
persistent state management to handle
this pause and resume behavior natively?
>> Yeah, totally. So, I showcase a little
bit that snapshot mechanism. That's
totally out of the box. That works if
you just start using it. Um, the only
thing you have to define is kind of
where you want to store the snapshots,
whether it's in R2 as I mentioned or
just kind of local to the to disk which
is the default. Um, and so this is kind
of that, you know, really out of the box
thing where it will take one sandbox's
file system and then zip it up and then
put it somewhere else. Uh, so that
another sandbox can resume from the
point where it left off. Um, but also
the run state is sort of managed. So you
can think about two pieces of state
here. It's kind of like the file system
and then all of the messages in that
conversation so far that the agent will
need to continue from agents SDK manages
handling kind of like both of those
pieces in in one shot. And so you can
take that thing that has both a pointer
to where the snapshots stored and then
also the full roll out and then you can
store in your database. It's just a JSON
object and then load it back up later
and then you can kind of resume
losslessly maybe if you have sort of a
multi-node system or something like this
that just has the database to share
state.
>> Nish anything to add there or we can
move on to next question. Yeah. No, I
think that covers it. Like essentially
we store the roll out and we also
snapshot the file system which is uh a
little bit unique compared to other
SDKs. Um so you should be able to uh
resume very like long running uh dayong
agents pretty easily.
>> Cool. Okay. Next question. How would you
compare the code executing harness in
the agents SDK via the sandbox agent
versus the actual codeex harness? Any
capability? Uh Nish, you want to take
this one or?
>> Yeah. Yeah. Um so there are some sort of
small uh differences between the sort of
like actual codeex binary and the agents
SDK. Agents SDK does come with a lot of
those core capabilities. Uh some of
those are like the bash tool. Uh the way
codeex edits files, skills, memories. Um
there are some things in the codeex
binary that uh we are still going to
like build into agents SDK eventually.
uh the codex binary is like this sort of
like multi-threaded thing that you can
run uh multiple different sub aents in
agents SDK we have this concept of
handoffs so you can do something like
very similar in agents SDK you should be
able to do like a lot of the things that
you can do in the codex binary and
agents SDK um but yeah ultimately like
agents SDK is like very much uh in
distribution uh through the Codex
harness today.
>> Awesome. Thanks Nish. Um, next question.
Can agents persist and restore
longrunning workflows automatically?
>> Yeah, definitely. This is kind of what I
showed in in the demo. It's kind of like
very easy with very little. We didn't
look at a lot of my orchestration code,
but there's not really that much to it.
Just kind of kicks off the SDK, stores
the the roll out, that JSON file in a
file just on the laptop, and then is
able to kind of resume uh pretty easily.
So yeah, I think mostly most of the sort
of engineering that you'll do is
figuring out what tools to bring in,
what what skills should I add to the
agent and uh the sort of orchestration
of starting and stopping and resuming
tasks is is sort of uh just out of the
box.
>> Okay. Can the sandbox get auto destroyed
after the task is completed or is it a
better pattern to keep the sandbox
running?
>> Yeah, this is a good this is a great
question. Uh so I I actually only demoed
one mode of the ages SDK. But there's
two ways that you can use sandboxes. You
can do the sort of what we call um you
know SDKowned sandbox where when you
spin the task up if you just provide it
the recipe for creating the sandbox. So
you provide it one of the clients like a
docker client or an ETB client it will
handle spinning up that sandbox for you
and then when the when the turn is done
it will spin that sandbox down. But you
can also eagerly load a sandbox. So you
can kind of create a sandbox out of band
and then pass the reference of the
sandbox to the agent and then that's
what we call like user own sandbox and
so you would handle managing the
lifetime of that thing. It the SDK won't
automatically spin it down at the end of
the turn. So you can kind of keep it
warm and then use it across like
multiple turns instead of spinning it up
and spinning it down. Um, but the out
of- thebox behavior is that we'll handle
spinning it up and and and down so that
you don't end up in a situation where
you've built your demo and then all of a
sudden you have 100 sandboxes running in
the cloud that you're paying for and
it's kind of providing a bad outcome. So
there's a bunch of different ways that
you can use this. Um, but yeah, by
default we the HSDK will will handle the
lifetime.
>> Hey, how should I populate and persist
my file system? Yeah.
>> Yeah. Happy to take this one. Um, so
essentially we have this concept called
the manifest uh in agents SDK. Uh, the
thing that I really like about the
manifest concept is you can kind of
choose how you want to populate and
persist uh the file system into the
sandbox. Uh, so you can essentially copy
over all of your files from cloud
storage into the sandbox or you can
mount uh, you know, all of those files
uh, onto the sandbox. And there's sort
of like different benefits of of either
if you copy over everything like during
run the runtime the the model will be
able to read the files a lot more
quickly but you know startup will be
slower because you are copying like sort
of this big blob uh into the sandbox and
there's sort of the second world where
you can optionally mount something on
the sandbox and uh startup will be like
uh very very fast but like over the
runtime it has to like read these files
over the network. Um, but the nice thing
about agents SDK is we kind of like let
you do multiple different ways to kind
kind of like mount or copy over things
in the sandbox and then persist that uh
after the sandbox spins down.
>> Awesome. Thanks, Nish.
>> Okay, last question here. Can a
supervisor agent monitor and coordinate
hundreds of specialized agents
simultaneously?
>> I think the answer is yes. It would
definitely it would take a little bit of
work to kind of spin this up. And we
have some uh cool stuff around sort of
like multi- aent frameworks that's kind
of coming in the near like weeks and
months. Um so it will get a lot easier
to do this in the future, but uh there's
nothing kind of stopping you from doing
this now. You can kind of, you know, go
crazy with agents and spin up a ton and
then have one that can kind of check on
the status between them. I think like
ways we've seen good coordination
patterns is either by through messages.
You can give the parent agent a tool
that allows it to understand what sub
aents are running and then check in on
them. So kind of like check in on their
progress or you can sort of like have
them all communicate via the same file
system or like a database or something
like this with kind of a variety of
communication mechanisms you can use to
coordinate them. But yeah, nothing
stopping you from from doing this. And
um you know, I think that this sort of
massively parallel work will become sort
of the default in the near future.
>> Love that. Okay, so let's wrap up with
some resources. So on the screen here
are everything you need to know to kind
of get started with agents SDK. Um we
had a ton of questions also in the chat
about your demo and where they can where
they can find this. So on the very
bottom is our build hour repo. Um we
have all the code from the last build
hours and we'll be we'll be posting that
um very shortly right after right after
we're off camera. Um then for any
upcoming build hours, feel free to check
back on our homepage. Um so if you hit
the next slide, we show you just the
link again. Um and after this session,
we'll email out a survey. Let us know
what topics you want to hear about next.
And we'll also include the link for this
recording so you can catch up if you
missed the beginning. Um you can catch
get up to speed um there. And that's all
for now. Thanks so much for joining us
and we'll see you next time.
Ask follow-up questions or revisit key timestamps.
In this Build Hour session, the OpenAI team introduces the latest updates to the Agents SDK, highlighting its ability to handle long-running tasks using a Codex-inspired harness. The session features a live demonstration of building an agentic task tracker, showcasing features like sandbox isolation, persistent state management through snapshots, and the use of tools, skills, and cloud-based deployments like Modal to run production-grade agents.
Videos recently processed by our community