WF2026: Autoresearch & Keynotes ft. Anthropic, Google DeepMind, Amazon AGI, Sonar, Arena, Recursive
12127 segments
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Heat. Heat. N.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat. Heat.
Launch control. We have a go. Roger.
I need
I need it.
I need
the right
baby.
Baby
Ladies and gentlemen, welcome to the AI
Engineer Worlds Fair. Thank you for
joining us as we continue an exciting
week of innovation, technical insights,
and conversations shaping the future of
AI. Now, please join me in welcoming
your MC, developer advocate at IBM, Tjas
Kuman.
Good morning, AI engineer.
We are here. We made it. We are here. It
is day two. It is such an honor and a
privilege to see so many of you here
today. This conference has broken
records, right? Last year uh was was way
fewer. This year, 7,000 people.
Incredible. Huge round of applause. This
is this is it. This is it.
This is where it happens. Listen,
there's announcements. There's
takeaways. There's content across 18
tracks. 18 track. There's expo sessions.
There's breakouts. There's all kinds of
things, right? And and undeniably, I'll
say this, there is value. Yes. If you've
got value, make some noise this morning.
Absolutely. Absolutely. I have learned
so much uh from so many brilliant people
here and and I have no question that you
have as well. Uh we had an incredible
keynote yesterday. We had so many
keynotes yesterday um where where Swick
started the conference talking about
loops. Um the the theme was was loops.
Why is that funny? Um
okay, it wasn't a joke but um but we had
more keynotes after that about the the
golden age of AI, right? Um one thing
that really stuck out to me and I'm sure
many of us was wiring the agent into the
intent upstream really unlocks more
work. When we start to say why things
are important, we're able to unlock more
work and quality work. We don't just
hand it the task, but we say do this and
and this is why and this is how you
verify, this is how you deploy. We we
get so much more done. Um Teresa talked
about reliability, how important it is.
She talked about the 30x productivity
gap between leaders and laggers. Uh
showing us that really it's about
reliability more than anything else. Um
huge focus about evals at this
conference. Um, and finally, I I was
really struck by by Daksh yesterday who
talked about uh reviewing 1 million AI
generated PRs and and found some
incredible insights. If you didn't catch
that, I highly recommend the videos, the
live stream. So cool. Uh, one thing that
stood out, Claude, uh, code generates,
what was it, three times uh, more off
bypass vulnerability code, unfortunately
for now, but it's just so cool all the
insights that come out of this. Um,
today we've got a lot of things. It is
jam-packed day and I'm very, very
excited about it. There's the newspaper
if you haven't yet read the news. We
have a newspaper now analog uh just to
balance you know the AI. Uh so there's a
there's a daily print newspaper
available for you. There's a live stream
audience. Hello live stream. Thank you
for joining. Um there is over 100 expo
partners. Anyone been to the expo? These
expo booths are incredible. I've seen so
many cool things. There's robots lying
around. So much stuff. There's also this
cool device that I got the B one of the
sponsors. uh it's a notetaker but for
in-person meetings. Anyway, check out
the expose. It's so incredible. Uh we've
got 3.5 days of expo and four uh stages
as well, expo stages. So, look forward
to that. We want to offer a huge thank
you and a massive round of applause for
the incredible sponsors. Honestly, this
conference would not happen without the
support of our sponsors. So, please
everybody, your hands together for the
sponsors of the conference.
We've got Microsoft, the presenting
sponsor. Keep it going. We got
Microsoft.
We've got the lab and platinum sponsors.
We've got You've got to keep it going.
We've got the gold sponsors.
We've got silver and bronze. We've got
so many sponsors. And this conference
genuinely would not be possible without.
So, we're very very thankful. Um, now we
get to introduce we get to open the
state. This is so cool. Today is going
to be such an incredible jam-packed
agenda and I hope all of you can make
all that you want. I mean, there are
quite a few tracks, but don't worry. Uh
there's a live stream, there's also
videos. We're going to start introducing
our first speaker. Oh, I'm excited about
this one. Who saw the announcement about
Fable yesterday?
>> Yeah, let's go. I This is so exciting.
So, so uh coincidentally,
the first talk has changed today. Uh
we're going to This conference moves at
the speed of AI. It's so cool. Um, our
first speaker, uh, Tariq comes to us
from Antropic. Give it up for Tariq.
Comes comes to us from Antropic.
Oh, I'm excited about I was talking to
him backstage and I said, "What what's
this going to be about?" Um, this talk,
I think the first time it's ever been
given, if I'm not mistaken, is about is
going to teach us all how to work with
the new mythos class of models uh, of
which Fable is going to be soonly
available. So, your biggest round of
applause for Tariq.
Please welcome to the stage member of
technical staff at Anthropic, Tariq
Shihipar.
Hey everyone, I'm Thoric. Uh, I work at
Enthropic on Cloud Code. Uh, before we
get started, we have a tradition on
cloud code where we take a selfie before
a talk. So, if you don't mind, if you
strike a pose with me, I'll, uh, take a
quick selfie at AI engineer.
Okay. Incredible. Well, uh, yeah, to
kick things off, like we said, Fable is
back. Um,
we're rolling it out later today. Uh,
keep stay tuned for exact timeline. Me
and Cat Woo and Simon Wilson will be
doing a fireside chat at 12:30. We might
have some updates for you then. Um,
but Fable is a model I'm just so so
excited about. It's one of those
anthropic models where you just like
you're just going to remember it. Like
Sonnet 3.5 new, Opus 4, Opus 4.5. It's a
model that I just have a lot of like
affection and excitement for. And the
best way to describe Fable to me is like
the the map is opening up, you know,
like you are playing like an RPG and
you've been on the tutorial and now you
get to the point where the like, you
know, the open world starts, right? And
there's so much that you can do and
explore. Uh but there's also it's also a
little bit intimidating and confusing,
right? Because there's so much you can
do. And so what I wanted to do in this
talk is give you guys a field guide to
fable, right? How do you work with this
new class of models?
So I've got four parts to it. I've been
working on this as a series of articles
and blog post. Uh but you know when we
announced Fable was coming out I was
like okay let me do uh all of this at
once at the talk uh you know uh
speedrun. So there are four parts
unhobling claude finding your unknowns
dealing with the grief and being
unreasonable.
So first unhobling claude
uh I think something we say really often
is that the models are grown not
designed right we don't wake up and be
like we need 99% on su bench right like
the models are you know something we we
grow carefully we give it data and
feedback and compute um but ultimately
it's you know something that we it's a
little bit organic and we sort figure
out and learn with the model as we use
it. And so um that what that also means
is that what contains them is us, right?
The harness we put them in and the way
we prompt them is basically like a
function of our understanding of Claude,
right? And by unhobling it, I mean how
can we understand Claude better to
unleash it? And we need to understand
Fable more. So I think one of my points
is that you know uh we're still so early
and I think there's a lot more
understanding in Fable uh to unlock
and uh I think I'll give you a quick
example about how models get smarter
because it's a little bit unintuitive
right like there I saw this viral tweet
a couple weeks ago being like you know
why can't LLM say which Pokemon end in
aw there are a thousand Pokemon right
and turns out there are two who whose
names end in AW crocodile and dreadnot,
right? And it turns out if you ask like
a normal chat model, it can't answer it,
which is kind of confusing because like
you know it definitely knows all the
names of the Pokemon, right? But if you
uh ask cloud code, it can, right?
Because what it does is that it fetches
every Pokemon and writes a script to
filter for AW, right? And so this is
what I mean by like unhobling claude. We
call this capability overhang, right?
Cloud gets smarter in spiky ways. So it
doesn't just remember every Pokemon and
reason through it, but if you give it
the code execution tool, it can find the
two Pokemons that end with AW, right?
And so this is I think part of the
challenge with Fable is figuring out
this capability overhang. What is now
possible? And I think this is like a
discovery that I'm excited to go on with
you. Uh to make this a little bit
clearer, I'm going to talk about a few
different examples of how models have
progressed in the past. Um one of the
big examples obviously is like chat. You
know the chat models were had to be
given context, right? Like maybe you
paste in your codebase and maybe naively
you might have thought like you know the
way we solve coding is by the context
just gets really large and I can just
paste in my entire codebase. You know
it'll be a 100 million context window.
But it turns out that instead if you
give it arms like you give it the bash
tool and ways to work with the
environment it can build and search its
own context and that's sort of like the
insight that led to cloud code right and
so again spiky like a new like
innovation kind of right in how we think
about and work with the model and then
recently we've rolled out cloud tag uh
and what's sort of unlocked cloud tag is
its ability to work proactively and
multiplayer uh cloud code, you know, is
something that you have to prompt for it
to do work, right? And uh this ability
for cloud to wake itself up and do work
is something that we think is unlocking
the new wave of agents. But there's
there's more here. So for example, uh we
recently removed 80% of the system
prompt for cloud code, right? And this
is one of the ways in which models, you
know, and what they need uh changes over
time. So originally like you know maybe
back in Son of 3.5 new the best
practices for a system prompt was a
small system prompt few tools and lots
of examples right and then as the models
get smarter you can give them more
information and more instructions and
they start following them and so it's a
larger system prompt with lots of
examples and many tools right but most
recently we found this new class of
models want fewer want a smaller system
prompt the examples tend to constrain it
because it's actually more imaginative
than the examples we give it. And so uh
and we tried to give it con context and
not just constraints. We really try and
avoid being like do not do this. Um
which is really necessary for the
previous models. Um and so this is like
a way that the system prompt is changing
and and probably will continue to
change. Uh another feature I really like
is the ask user question tool. This is
something I worked on when I first got
to cloud code and and it's uh when
claude, you know, a is is planning or
wants to ask you a question, it can show
you a multiple choice dialogue. Uh for
Opus 4, it could barely call it. I had
to like really tweak the tool to make
sure that it was uh that it would work,
right? And then sometime opus 4.5, I was
like, well, what if I asked it to like,
you know, ask me 40 questions about the
spec, it can start interviewing me,
right? And so its ability to ask
questions jumped, right? And then most
recently with Opus 4.8 and Fable, I can
now build a whole HTML report with the
questions embedded inside of them. And
uh it's just like a whole new way of
interacting with uh with Claude, right?
And and so this progression of like how
Claude can get information from you is
also changed. Um speaking of which, uh
markdown and HTML is something I've also
talked a lot about. um you know it
turned initially markdown was a a good
output for the model um you know it
could show a little bit of rich
information and then you know with plan
mode it started to be for you like you
could understand what cloud was about to
do um and now you know claude can build
you these in-depth HTML reports right
and so again a way of this the models
getting smarter in a spiky way I really
like to emphasize that this is closer to
a biology than a physics, right? It's
still very empirical, very organic. Um,
we don't know all the rules, but there
is some sort of science behind it,
right? Like there is an intuition to
build as well. And so I really, you
know, encourage you to treat Fable like
that. Uh, one of my favorite papers uh
that at Enthropic that we've written is
on the biology of a large language
model. Um, all of our research papers
are meant to be read by, you know,
people with various degrees of technical
expertise, but this is one of my
favorites. So, uh, if you're looking to
learn a little bit more, suggest you
check it out.
But, so, uh, yeah, we talked about
unhobling Claude, but it turns out when
you're working with Fable, you also need
to unhobble yourself, right? And so, one
of the things that I think a lot about
is that the map is not the territory,
right? When I'm working on a coding
problem, the plan and prompt and spec
that I have in my mind is the map,
right? But the territory is the actual
codebase, the real world, the
constraints that Claude needs to
navigate, right? And whenever Claude
runs into something in the territory
that's not in the map, I call that an
unknown, right? Claude has to figure out
what to do about it. It's a decision
point that I haven't specified. And
Fable is one of the first models where I
felt that like I really have to figure
out my unknowns because if not it's
going to traverse such a large area that
like it's going to run into a lot of
them. So how do you figure out your
unknowns? Um
it I fable bottleneck my abil by my
ability to match the map and the
territory to find my unknowns. So a few
um few ways to think about this. I like
to think of it in a a matrix. So like
for any problem, I have a bunch of known
knowns. This is usually like what I
write in my prompt. What do I want?
Right? Then I have known unknowns.
Things that like I know I haven't don't
really know yet, but I just haven't
figured out yet. I can um uh yeah, then
I've got unknown known. Like what's so
obvious that I just wouldn't write it
down, you know, but I I know it when I
see it, right? And then finally,
unknowns. Unknowns. What haven't I
considered at all? What do I not know?
right? Like what is something that if I
knew could change how I prompt Claude?
And and luckily you can use Claude, you
can use Fable to find your unknowns. So
I'm going to go over a few examples of
how I do that with Fable. Um the first
is I like to do what I call a blind spot
pass. So I like to say something like,
"Hey, I'm working on a new O provider
that I know nothing about uh like in
this codebase. Can you do a blind spot
pass to help me figure out my relevant
unknown unknowns and help me prompt
better? Right? And so this like might
have Claude go through the the O module
and figure out like, oh, you know, this
is kind of like a hairy little uh dead
end that comes up a lot. Maybe searches
my git diff or slack. I might tell it
where there's context, right? So that I
can learn about, you know, all the
gotchas. And and you can use this very
broadly, right? You can use it to teach
you about new fields. I I recently did
this for color grading when doing video
editing. Um because I think this is
really powerful and and Fable is
incredible at it. Um in many ways the
model knows more about you know almost
everything than I do. I just need to get
it out of it. Um then I like to use
brainstorms and prototypes. Uh this
helps me figure out my unknown known
right things like especially for design
for me it's like know it when you see
it, right? So, I might ask it to uh
create a dashboard. Um, and I tell it I
have no visual taste. Uh, make me an
HTML page with four wildly different
design decisions so I can react to them,
right? And and you know, you tweak this
as you want, but like the idea is to
sort of get an idea of like what are the
things that you uh you know, you can't
describe in words, right? And uh like
work with the model to help figure that
out.
Uh then f then interviews. So once I
have an idea of like, you know, this is
what I want to do. Uh there's probably
still a lot of like uh unknowns here,
right? Where I might not have considered
something. I might not have specified
it. And so I'll ask Clog to interview
me, right? And I'll give it a little bit
more context in any of these questions.
Like giving it a little bit more context
about you and the work and the stage
you're at, like, hey, yeah, prioritize
questions that would change the
architecture is extremely helpful.
Uh then references. One of the best ways
to give Claude a map is to give it
another map, right? So instead of me
writing out the spec, uh I can just say,
"Hey, here's some code that represents
what I want to be done, right? It could
be in a different uh system or language.
Uh but just read this code, understand
it, and then use that to start your
work, right? And uh again, this can be
in a lot of different ways. If I'm
making a a React component, I might have
an HTML mockup that is my map, right,
that I pass in as a reference. I think
this is really really powerful and Fable
is really incredible at it. Uh something
else I've like really appreciated is
implementation notes. So if uh while
you're running Fable uh and it runs into
an unknown, ask it to log it, right? So
that um you uh you can see where the
deviations happened and then you can
sort of figure out why as well. you
know, we'll usually give you some
context about what happened.
And then finally, I like to get a fable
to quiz me about what happened. Uh, just
to make sure I understand what I'm doing
and I can represent this work, you know,
when I'm creating a PR or merging it.
Um, this is a really great way of like
making sure that you're like really in
the loop with Fable. And I think that's
like one of the most important parts of
Fable is like staying in the loop and
making sure that you uh you get what you
want. So
um those are some of my tips for working
with Fable. Uh I also want to say that
the first time I used a mythosclass
model uh used Fable I felt both a huge
sense of like gain but also a sense of
loss and I I wanted to talk a little bit
about that you know um when I think
about coding before LLMs it feels like a
foreign country you know like I used to
run a YC startup about 30 people and we
were just constantly forced into
trade-offs because of how hard code
right? Like we could make the the app
fast or we could try prototyping a new
feature and and this might take a month
or this would take two months and so we
had to choose and it was just really
really hard. Um and now I went back to
that codebase a couple weeks ago and I
thought about some of the things that I
wanted to do and uh it was just way
easier. It was like the things that
would have taken me weeks I could do in
hours, you know? And uh at some point
it's like yeah like how can you not
laugh but also how can you not cry
honestly like it's like one of these
things where um I really really loved
programming and writing code by hand. I
love the feeling of like seeing the
codebase in my mind and like rotating it
but I also remember just you know like
staying up late nights trying to debug
working on things for weeks without
working right. I just remember swimming
in failure. I just remember that like
the most of the projects I've ever
worked on have failed. Most startups go
bankrupt. You know, I think just overall
programming and coding is extremely hard
and like as much as I enjoy those highs,
I I can cannot go back, right? And uh
the way my reflection here is like the
only way out is through, right? There's
still a lot to learn with the coding.
There's a lot to learn with Fable. Uh
but I think if we try really hard and if
we like stay in the loop, we unhobble
it, uh we can get there, you know, and
we can come out on the other side, uh
with just um so much more. And so the
last bit I wanted to talk about is is
the so much more part, right? I call
this being unreasonable.
Um one of my favorite parts of anthropic
is that we believe that trade-offs are
not real. Um, like I think that very
often I like in my previous company I
was very used to being reasonable. So
I'd like write down this list of
priorities and I'd be like, "Well, I
guess we can prioritize this against
this, right?" Um, and uh, like, you
know, that makes sense. So we'll we this
will be our priority this quarter, but
what if you uh just did all of it? You
know, what if you forced reality to show
you the trade-off, right? Um, this is
something I've really valued at our
culture and anthropic. And my reflection
going forward is that I'm going to be a
lot less reasonable. Um, I think one of
this like the math of Claude and Fable
really changes how you think about
trade-offs. And there are so many
trade-offs that you make implicitly in
your head, right? Like good, fast,
cheap. Now it's pick three, right? Um, I
think that like the best way to like do
more ambitious work is to uh like
reframe and make big make ourselves more
ambitious because I think the only way
to prove that agents work is to do the
best work of our lives faster than ever
before. Um, you know, for example, I
made this deck last night in about four
hours with Fable. I feel like it's a
it's a deck I really like and I I really
enjoyed it, but I also um you know did
it really fast. Uh and I think that if
you're here, you know, at AI engineer,
the world is kind of looking at you to
prove that AI works, right? That it's
not just like a fad or something, but
that it can make us more productive and
also save us time. And and that's my
resolution for this year is to work, be
more productive, but work less and spend
more time with people I really care
about.
Uh, I think it's also worth calling out
that building is easier, but generating
value is still hard. And I think this is
something that we run into, you know, as
AI engineers sometimes where we think so
much about the process of building and
our our setups. Um, but the the point is
to generate value, right? And uh there
it takes a lot of swings. It takes a lot
of tries to find the valuable stuff. Uh,
but that really is the goal. And that's
like you know again what the world is
looking to us to prove that AI can
really transform it. So to to end I just
wanted to say like go explore make it
real and uh yeah be less reasonable.
Thank you.
Please join me in welcoming the chief
executive officer at Sonar, Tariq
Shakat.
Morning everyone.
Do you enjoy that last talk? That was
amazing. Um, you particularly love the
end, the being unreasonable part. I
thought that was awesome. Um, I also
want to just I'm trying to calculate the
odds of Tar following Tar as the first
two sessions in the morning. Uh, I think
the odds are pretty low on this one, but
uh, thrilled to be here today. Um, as as
we just mentioned, I am with Sonar. We
are in the code verification space and
I'm here today to talk about
verification. And I think we're all here
uh in large part because we believe to
some extent that AGI is here. It's
coming. The models we just heard about
Fable, it's really incredible what is
going on in the in the world today. And
yet we work almost exclusively with
enterprises around the world. And the
conversation that we have more is the
question mark version. Is AGI here? And
why are they asking these questions?
It's because you can read the news every
day. And I'm not trying to name and
shame here, but if you look at KPMG
putting out reports that they have to uh
uh retract because of hallucinations, uh
EY doing the same thing, law firms
getting into lots and lots of trouble
because of madeup citations, madeup case
law, things like this. I think we can
really start to question how do we get
value out of AI? The models are amazing
as we just heard, but the hard part as
the other target just said is getting
value out of it.
The struggle is that AI slop is
everywhere. I'm sure you all see this
inside of your organizations. I'm sure
you see this in your everyday life that
AI is amazing. The models are incredible
at generating very plausible output.
They're incredible at generating things
that sound correct, but are they
correct? And how do you know that
they're correct is a big problem. And
it's a big problem in professional
services as we saw. It's a big problem
in legal. But really, I think if we're
honest, it's it's a big problem in every
sector, in every field, whether it's
marketing or finance or you name it. You
have this question of how do you
actually know if it's true? How do you
know if it's good or if it is slob? And
the question that we we deal in the
coding space in particular, we deal with
software development. And the question
we get as we talk to I'm sure many of
the people here in the room and a lot of
our customers is, isn't software
development different?
And we can look at the data on this and
uh the mythos models. Um this is data
from um meter. Uh you may have seen this
MER um the coding agents are getting
better uh very quickly. They're getting
a lot better very quickly. And you can
see uh the progression the exponential
curve here. What this shows on this
chart is how capable are the models at
completing tasks that humans would take.
So can they complete a task that takes 1
hour, 2 hours, whatever it is. is the
latest Mythos model at least per the
benchmarking which was done a month or
so ago in the preview mode was you're
getting to 16 to 18 hours. So they're
actually able the agents are able to
complete longunning tasks and it really
is starting to transform how work is
happening. But the critical caveat when
you read the data is this is at a 50%
success rate. Okay. So it is again able
to complete tasks but is it able to
complete tasks correctly is the
question. So if you start looking at
let's dial up the accuracy rate you dial
it up to 80%. And there's still progress
but it is much slower progress. Instead
of 18 hours you're at about 3 and a half
hours or something along these lines.
And by the way this is still at 80%
accuracy. And as I was presenting this
to the CTO of one of my uh large
customers, his response was, "Betaric, I
would still put someone who gave me 80%
accurate information on a performance
review probably, right? This isn't
necessarily enterprisegrade.
The problem is that the models
themselves, and full disclosure, we have
not yet uh done this benchmarking on the
Fable models obviously because they are
just being released. But as you look at
the models, the models are getting
smarter, but they still produce a lot of
problem problematic code. This is
benchmarking that we do. We give the
models a series of over 4,000 problems
and we basically ask it to generate the
response to the problems and then we
analyze both the functional correctness
which is critical and they all do
extremely well on this notion of
functional correctness, right? Um, but
then we look at how complex is the code,
how buggy is the code, how secure is the
code. And what you see with even the
state-of-the-art models is that
complexity is still high. It's actually
quite variable as you can see here. Um,
GPT55 is done particularly well on the
complexity side of things. It still
generates bugs. It doesn't generate
massive amounts of bugs, but it still
generates bugs and it still generates
security issues. So this is the output
of the models that are going into the
agentic workflows. And again, this is
not, you know, I'm at the AI engineer
conference. This is not me saying AI is
fake or or um incorrect, but it is um
trying to address this question of how
do you really get value in a production
setting out of AI? This is a study that
was done in Carnegie Melon uh University
and it looked at what is the actual
productivity benefit that you see from
the use of AI coding agents. And what
you see I think really resonates with a
lot of what I see firsthand in the
market which is you have a initial just
amazing boost of productivity of
velocity in particular. what you see is
a three to 5x boost in productivity or
in in velocity. Um that dissipates in
three months. At the end of three
months, it starts to come back to the
the normal before you were using the
agents. And if you ask why, it is
because of the two pieces in red here
that you start to see there's an
increase in velocity, but there's an
increase in security issues. there's an
increase in maintainability issues.
There's an increase in reliability
issues and there's an increase in
complexity. So essentially you're
building the technical debt as quickly
as you are generating the code or maybe
even more quickly and that creates a
different set of work. it creates a
different bottleneck. And so to us, this
is now the critical question in AI,
which is in a world in which code is
provable. And there's sessions that um
uh I'm actually very much looking
forward to attending about formal
methods and proofs and things like this.
Code is provable, but when you start
dealing with large code bases, software
is not. It's still very complex. It is
still very messy. there's lots of um
dependencies. There's lots of uh
technical debt already in most code
bases. And so this question of
verification is actually key. And what
I'm going to be arguing is that you can
treat verification as an afterthought or
you can bake verification into the
process. And if you bake it into the
process of generating code, of doing
software development, you can actually
start to get materially better outcomes
from the coding agents than if you view
it as an afterthought, if you view it as
just the old school code review.
So as we've been thinking through this,
we basically have constructed a
framework and there's lots of competing
frameworks around this, but I'll just
talk you through uh ours. We call it the
agent centric development cycle. for
shortand we call it AC/DC sometimes and
the idea here is how do you get
verification powered to Gentic loops at
the center's a lot of focus on the code
generation piece like how do you
actually get the models and the agents
to generate the code that you need to
solve the problem and what we argue is
that you should surround this with the
right disciplines the right tools the
right processes to do three things to
guide the agents and tar is talking a
lot about different aspects of this
actually. Guide the agents, verify the
outcomes and then solve the problems.
And you have to make this part of the
discipline, part of the process, part of
the new software development life cycle
if you want to be successful in the AI
world. So if I double click on some of
these pieces, what do we mean by guide?
We've done a lot of experimenting around
guide. We've just launched a product um
yesterday I think called sonar vortex
that starts to get into this area. What
we find is critically important is to
think about guide as context and
constraints and we separate out context
and constraints very deliberately
because context is you have your code
repositories. How do we make it easier
for the agents to understand for the
models to understand what is in your
codebase? If you have a million lines of
code, if you have a hundred million
lines of code, you have a billion lines
of code, the agents work better if they
understand your codebase. So, how do you
give it architectural awareness? How do
you provide uh semantic navigation uh
maps um and uh and help them understand
the territory to borrow what Tar was
just talking about and we find it
equally valuable and I don't think this
part is talked enough about to provide
the constraints as well. You have
guidelines that you want your code to
follow. You have dependencies you are
okay using. You have dependencies that
you are not okay having. You have coding
standards. You have guardrails. You have
intended architecture. We spend a lot of
time talking about existing
architecture. But what about where you
want to go? And so this idea of context
and constraints uh we've found in our
testing generates a massive improvement
in agent effectiveness and a massive uh
improvement in token consumption. O over
30% reduction in tokens being used to
solve a given problem. And and if you
ask why, it's because you're actually
making the life of the agent easier.
You're helping it navigate better.
So then we get into the heart of this
and we really think of guide as
preemptive verification. How do you make
sure there's less to verify, less to
fix, this sort of thing. Then you get to
the heart of verification and what we
believe quite strongly and what we've
seen work in practice is this idea of
zero trust multi-layered verification.
Zero trust every model has biases. Every
model produces has a character has a
personality. So, let's make sure we use
different models and different
techniques to make sure your code is
safe, to make sure it's reliable, to
make sure it's secure. And multi-layered
really speaks to the earlier point that
software is complex. Software is very
messy. Software has lots of of of
intricacies involved with it. And so
what we believe and again have found to
be quite um impactful here is that a
combination of algorithmic verification
looking at things like data flows,
control flows, known patterns, secrets,
these areas combined with what is now
possible with agentic verification
looking at intent, business logic, the
unknown unknowns. Actually again to
borrow from the last uh presentation the
fusion of these things the the
deliberate multi-layered fabric that you
put in place can actually you can see
the results of this in production. So as
we look at our partners and customers
who use a multi-layered verification
approach they are reporting AI derived
production outages being 44% less
frequent than the ones who do not. So
you can start seeing a material
improvement in reliability, in security
and in maintainability.
And then the last point I mentioned is
technical debt does explode. Right? As
you generate code, technical debt is
also generated. And again, this is not
stop doing it. This is be aware and
let's start controlling it. And so what
we um have seen be super effective is to
have an active process to have an active
discipline again around code maintenance
and thinking about how you do verified
code maintenance. Um I won't walk
through every step of this but a the
agents whether that is a set of
remediation agents whether it's a strong
discipline around verification does keep
your codebase clean
and a lot of people have asked me all
right but do agents care about clean
code human developers care about clean
code do agents care about clean code and
what we find again is they absolutely do
because the agents have to understand
the codebase if they're going to operate
on it so this is is a oneshot view. Um
we think this is something that
compounds. But if you just do the exact
same agentic tasks on a typical codebase
and then one that has been cleaned, you
see a material reduction in the amount
of tokens, reasoning, energy, etc.
needed for those cleaner uh code bases
versus the typical code bases. Right? If
you make the life of the of the agent
easier, if you maintain your codebase,
then you'll actually see compounding
effects.
Now the important thing in our mind is
to construct the system. And this is how
I started is saying, you know, I'm sure
all of us do code reviews, you may use
static analysis tools, you may use AI
code uh review tools, a whole range of
things. And we believe that you have to
put this in a system. And again, uh,
we're happy to in our booth downstairs
talk through what this looks like, but
we really believe that the construction
of the software development life cycle
in an AI world um, needs to embed this
notion of guide, verify, and solve
inside of it. And you need to do it in
three loops. And you need to think about
these three loops. There's the agentic
loop, which I think is the key buzzword
of the conference. Um now but how do you
provide the agents as it's generating
the code as it's doing the work with the
context and constraints
with the inloop verification so that the
agent is getting verification as it's
working and how do you fix problems
that's that's the blue loop here what we
what we talk about is the inner loop
verification piece there's a second
which is your continuous improvement
process and how do you really combine
the power of algorithmic and agentic to
generate your your pull request, review
the code and by the way the velocity of
this has to go up massively. So to
review the code using agents and to this
multi-layered verification and then you
have your evals and I think the opening
speaker talked about how eval may be the
buzzword of the um conference. You have
your evals and you have your quality
gates to check are you actually passing.
So you have your your code maintenance
loop, agentic loop, CI verification loop
and deliberate design of these loops
with verification at the center is a
compounding system. It's a system that
reinforces itself and it reinforces
itself in the positive and it reinforces
itself in the negative. And we've seen
customers who uh have kind of neglected
as they've rolled out AI coding tools,
they've neglected verification. they've
neglected this idea of code quality, of
code um maintenance, things like that,
and you get into a downward spiral
pretty quickly. This is what the
Carnegie Melon uh case study uh or study
actually shows is that you actually have
all the benefits start to dissipate or
you can get into this self-reinforcing
loop. And one of the tests we did with
one of the large banks who are using
some of the cutting edge the folks who
are all around here today um cutting
edge agentic coding tools they can get a
92% reduction in issues if you actually
take this guide verify solve approach
inside of those agentic loops. If again
this compounds it's not that each loop
is 92% better. it's that as you go
through solving the problem over minutes
and hours that you actually see a
compounding benefits.
So that is uh essentially how we see the
benefit here. The how we see the
controlled um value creating use of AI
in enterprise settings. And when I say
enterprises, people with existing code
bases, people with with you know
millions of lines of code already.
There's the agentic loop, there's a CI
verification loop, there's the code
maintenance loop. I am required by my
marketing team to put up a version of
this that has our products on here. So
these are our products and you can come
and see us later. But the most important
thing is really to say our
recommendation is this agent the ACDC
agentcentric development cycle. The core
part is deliberate verification built
into the system. So if you'd like to
learn more um we have a booth that's the
big red booth downstairs. We'd love to
talk more. We have some doubleclick
sessions coming up. So please do uh join
those and uh have a great conference.
Thank you all.
Joining us on stage is a member of
technical staff at Amazon AGI lab onjab.
Good morning. It's so great to be back
here at the AI Engineer Worlds Fair.
Just a year ago, the hard problem was
getting an agent to find a button and
click it on a screen, especially screens
it had never seen before.
Now agents can drive browsers and
they're starting to also drive desktop
apps.
But what we figured out click clicking
was actually the easy part.
What we didn't solve is the actual work.
And what do I mean with this? Let's take
a very simple example. A new team member
starts on Monday. And maybe your job is
to set up their accounts, add them to
your Slack channel,
book intros with colleagues,
order the laptops, etc.
And nobody really owns this end to end
process in the company and it might be
also touching five different systems.
Now, agents can most likely perform each
single individual individual step of
this workflow,
but agents still struggle to do this end
to end because the real work lives
within the seams of all of those
different applications, of all of those
different steps you have to take. And
this is mostly where it all falls apart.
The agent can use every single tool you
give it, but it still can't do the full
work.
So why do we see this gap?
Think about for a minute what we
actually built.
We taught computers to use computers.
So what do I mean with this? We started
building out the basics. We taught them
clicking, scrolling, typing, calling an
API, filling out a form, and we got
those steps, these steps really
reliable, and you can string them
together in a workflow. And agents these
days are fairly good at like operating
those workflows.
So, why can't you not just hand them
more of your work and then literally
just walk away and trust it to be
completed?
So all the things I talked about like
using a tool models itself,
tool use, stringing agents together,
this is all capabilities
and we mostly figured out how to add
capabilities to models.
Now the next hard part is really
reliability
and without reliability we cannot really
build up trust in those systems.
So here's a quick gut check and maybe
all of you can just think about an agent
doing work in an end to end workflow.
How often do you think that actually
succeeds these days? Maybe 60 maybe 80%
of the time.
And it sounds really fine, but if you
look into this,
if your agent
one in four times deletes a database,
you will never touch that agent again,
right?
So
when you need this reliability, you
really need to be it in the nines. You
need to have the trust that it actually
can do the work successfully.
Now, there's actually one place
where we made enormous progress on
reliability and trust and this is
coding, right? Think about how fast
coding evolved.
I still remember the first time when it
started autocompleting for you, right?
You just tapped autocomplete. Amazing.
Then short time later, it started to
write functions. And we thought that is
amazing.
And now look at these days. Coding
agents write the code. They open up the
pull requests themselves. And we had it
earlier this week. Code keeps flying by.
So once in a time we were able to just
every single line that it generated we
felt like the urge we need to really
read it and make sure it's correct right
I think most in the audience here can
still relate to that
these days I think hardly anyone is
still doing that like we cannot even do
that right code is generated at such a
pace
at the same time coding made that jump
so why is that because we were able to
bring it from just being capable the
coding agents to actually be reliable
and then trusted.
So why is that? Why was coding first
solved?
It's because code is verifiable.
You can run it, you can test it, you can
check it and you can be for sure that it
worked.
So reliability showed up in the first
place you can actually verify the
answer.
But here's the catch.
Most of the work we do if you look at
the broader knowledge work areas is not
like that.
Knowledge work is messy and heck the
whole real world is really messy.
Did the report I created land? Is the
design on brand?
Did it get it what I actually meant? So
there is no unit test that can answer
those questions.
So verification really hits the wall
right where most of our work lives.
It's living in the seams of all of those
applications we're using on a day-by-day
basis.
And nobody really has cracked this part
yet.
How do you make an agent reliable when
there's no way to verify the answer that
easily? And that's a field that is still
wide open.
So, how can we solve this?
Well, so how do humans handle messy
work? I mean, we're successful at it,
right? Each of us like every day we work
across different systems. We manage out
how to onboard a new colleague. We do
this.
Well, we're doing it by figuring things
out together. You grab a colleague, you
jump on a Zoom meeting, you're
discussing things, you're looking at the
problem to solve, you're discussing p
pointing at systems, and maybe two
minutes later, you solved it. You're
done. But none of this work is actually
directly verifiable.
And we do this all day.
So one of the things is we're looking
mostly at the same screen, right? If
you're jumping on a meeting with a
colleague, you see the same screen, both
of you, and you can actually like figure
out really quickly what needs to be
done.
So this is what the agent these days is
missing.
You don't necessarily need a bigger
brain. What you need is this shared
context. Because if we're looking the
agent and myself at the same screen, I
probably have much less explanating to
do.
So what kind of agent do we really need
to build to achieve this?
And today's agent, as I said, they can
already see a screen, right? and they
can click and take actions in it. That
part works.
But if they fire off actions, what they
usually do, they move on. They don't
watch what happens or recover if one
step didn't succeed or something goes
sideways.
And we need an agent that can actually
work like you do, like humans work.
And one example is robotics. If you just
look for a moment as how robotics do it,
a robot perceives what's around it and
it plans what to do and then acts. So
this loop here from perceiving to
planning to acting, this is actually
what we also would need on a screen.
And it starts here really with the first
word which is perceive.
The agent has to take in the screen the
way you do,
not scrape the code behind the page, but
what's actually rendered, the layout,
the state, what just changed the work,
what we're doing, and then do it.
And it would also have to keep up in
real time. Think about how we as humans
work together.
You jump in, you react to build on top
of what each other you say.
And today agents can still don't do it.
What we're doing is we're sending a
prompt, we're waiting, it goes away and
at one point the agent come back and we
might have to take a couple of turns,
right? Because what the agent come back
with is not exactly what we might want
to do. So we're sending another prompt
say, "Hey, go back, do this, do this
differently." And we have this long back
and forth which we got so used to from
our chatbot experience and from this
rhythm taking turns.
But what we actually would need, think
about it, is an agent that can react
while you're still working. Wouldn't
that be really cool, right? Like at the
same time you're working, it can also
come up with suggestions, can help you,
and there is no waiting time.
So basically an agent that perceives
what you perceive and understands what
you mean.
We call them perception agents.
So why perception agents? Why do they
matter? So first they complete the loop
on computer use.
Today's agents again they can act, they
can click, they can type, they can
scroll,
but what they can't do well is looking
at the results and whether it actually
worked out. A perception agent can read
the rendered screen so it can confirm
its own output instead of just firing
off those actions
and then hoping.
Second, it doesn't need an API or
backend process.
And that's important because it works
off the rendered interface. It sees the
same pixels and the structure you see.
And most of today's software people use
every day don't expose APIs at all.
And then third, the input also goes the
other way here. Instead of writing a
long paragraph to describe what you want
to change, let's say you're working on a
website and you want to describe all the
changes you want to apply. Instead of
writing this really long description,
wouldn't it be great if you can just
point to it and say, "Hey, here this
heading needs to change. Hey, can you
update this section?"
This is a much more precise signal and
less lossy than text.
and the agent can act exactly on what
you marked.
So this is where we started and I'm
happy to share that we just recently
launched the first two pieces of our
perception agent harness
open source.
There's two pieces. There is annotation
which you can use to tell it what you
want.
And then the second piece, the
verification part gives the agent the
capability to check its own work.
So let me show you the first one. So
here's a very quick demo on our
annotation tool.
This one is a Chrome extension, so it's
super easy to use. And I'm going to play
here this quick video demo.
So you have the extension installed and
then you can just select different
elements on a screen. So this example,
we're just drawing around the heading
there, marking the section. And maybe
you want to change it. Why not? Let's
change it to red.
You could also select the elements on
this page. You see how if I hover over
it, finds the right element. You click
it, you select it, and say something
maybe double the font size. And you see
also how the agent here captures on the
screen exactly the feedback, the
location, the style elements and it
creates this complete summary which you
can then use and then give your agent to
implement. So there is no back and forth
anymore because you captured exactly
what you saw on screen and the agent can
see the same thing.
Now let's have a very brief look at the
second one at verification.
So the idea of verification is that you
can describe let's stay in this case of
the web development. You can describe in
a design MD file what your design rules
are for this.
And then what happens if I play this
video here, the act the agent can
actually check its own work against
those design specs.
So it will take what you defined, the
colors, the components, your layout, and
it turns it into those rules if you
don't have it written before yet. And it
does two kinds of checks. Then it does a
visual check, which is really cool. So
everything is on brand, for example.
it's the right layout.
The other part is also checking user
flows. So what it does there, it
actually walks through this experience
through the app for example depending on
the tasks available. It might add a
task, it might delete a task like a real
user would. So it helps you walk through
those user flows as well in an automated
fashion. And then once it's done, it's
writing a report which you can review
and it's going to call out which tests
passed and it's going to tell you
anything that didn't.
So ultimately, you're the one that
doesn't have to click through this at
midnight at the end of the day because
great work. The agent already did this
job for you.
Now there might not always be a screen,
right? So I talked a lot right now. I
called it perception. I talked about the
agent sees what you see on a screen.
But there are times in your day where
you don't have a screen. Maybe you're in
the office. You're walking into a
meeting with a colleague.
So I did a fun experiment yesterday
at the conference here. So I grabbed my
colleague Giovanni who is also here and
actually on the second floor there's a
great like little meeting booth. We
found that by coincidence. So we went in
there and we had our design meeting. And
the goal here is really kind of show you
how perception is so much more than just
the visual part. So in this example,
what we want to show you is perception
can also be listening in the room to
what you're discussing.
And you can see here on the picture,
both of us are wearing our B devices.
Big shout out to B for sponsoring these.
Um, so we're sitting there. We have our
B devices that can do a transcript.
They're listening to what we're saying.
And then we have this design meeting.
And I had a couple of great ideas how to
change this website. Um, you will see
them in a in a second here. So let's
have a quick look how this changed the
same workflow on this website using this
device.
So we had the discussion the be did the
transcript and you can see here on the
right we're pulling this meeting
transcript right in there is a whole
detailed summary of the meeting.
There is
what we discussed and then it basically
captures those insights. We have them
right here and we can click apply. So
what this apply button does is it sends
it straight to the agent. And you can
see here my crazy ideas to turn the
background to yellow, turn the heading
to red, and also change an emoji
directly applied. And it also straight
kicks off the verification right away.
So it creates this report and and
luckily this color scheme was apparently
into in the approved rules otherwise
this would have looked like you did some
weird things here. But again you could
change those rules if you don't want to
have yellow backgrounds and it will make
sure um that we still adhere to those
guidelines. It would flag anything
that's off. So you have the judgment
call if you want to either update the
design specs because you actually like
yellow or you take an action and say no
um fix this violation.
But this is really the very first step.
These two pieces are the very first
beginning and we're building out the
rest in the open because these patterns
can only get better if more people are
using them, building on top of them,
breaking things. So my ask here to you
is go and try them out. They're on our
GitHub repos.
Tell us what we're missing. Give us the
feedback what you would like to see
where this should go next. because
ultimately none of us get smart alone
and that's the whole point. We want to
build AI that makes all of us smarter
together.
Now, if you're interested in a little
bit more on human agent interactions and
how we see those patterns changing, I
would highly recommend this podcast by
my colleague Danielle Persik. She is a
cognitive scientist and runs our AGI ACI
team at the lab and discusses a lot
about human computer interaction
patterns with experts in the industry.
You can find the podcast on on a popular
podcast platform.
We also have more sessions this week. Um
so check them out. We have a booth down
there. We have expert talks. We also
have another computer use track talk
coming up with my colleague Gav Mishra
at 1:30 in the computer use track.
Highly recommend checking out his talk
from RL to IRL.
And then ultimately come find us. We
have a huge presence down at the expo
hall. We would love to continue the
conversation with you all. If you're not
here in person, you can also check out
our code on our GitHub repo and check
out our website.
And with that, thank you very much.
Please welcome to the stage the vice
president of research at Google
DeepMind, Benois Schillings.
All right, good morning. Uh this is
really quite exciting to be here and
have a chance to to speak with all of
you. Uh my name is Benois Shellings. I'm
actually a bit of a noob when it comes
to to machine learning. Till a year and
a half ago, I was working for Google X
which some of you may know. We've done
things like Whimo which seems to be at
every street corner now. Uh we also do
things like Glass. So you know we we had
a mix of hit and success but in many
ways this was for me an interesting
formative experience on how to run a
research team in a place like deep mind.
I do have an incredible team. Uh my team
goal in deep mind is basically to
develop whatever technology will be
needed to make Gemini incredible between
one month and one year from now. So one
month because if you start to work on
what is needed in one week, that's a
very different type of job and one year
because I don't think anybody can really
predict anything that far. So that's
already pretty ambitious in my opinion
to think about things that would happen
one year in the future.
We do many things under that role. Uh a
lot of it is related to code which will
be the main subject of my talk today. uh
but we also do a lot of research on what
is the evolution of reasoning for models
for instance or we do topology research
what are new type of network that might
bring better performance uh we do
fundamental work in the science of
reinforcement learning which is so
fundamental to what we're doing today
with ML
let's do a bit of an origin story Um,
we started the project at X named
Pitchfork in 2018
which was aimed at looking at how ML
could really improve the way code is
being written. And this was very
interesting because in 2018 when we
presented that at Google
honestly nobody would give us the time
of day. uh there was that point like why
would you ever need ML to to write code?
Um at the same time I think that we
totally underestimated how fast this
could go. When we did that project
originally the idea was to look at how
we could speed up the evolution of a
piece of code. How could we make many of
those small changes which slows down
code speed development? you know the
small edit which requires a review that
takes three days and how we could
compress that cycle.
Some people were talking about vibe
coding writing code in English and at
the time honestly I totally dismissed
that I was that's why we have
programming language English is not a
programming language. Well, I I I guess
I was pretty wrong on that front, but
the resistance we felt at the time
reminded me of how my own career was
pretty resistive to to change. Um, I've
been writing code for
45 years. Uh, I started by writing video
game for Apple 2 and Commodore 64. So,
uh, my formation was to write assembly
language. And when you spend a long time
writing assembly language, you look at
compilers with a lot of suspicion,
right? Are those things really working
correctly? And then when you switch to
C++ and use compiler, you lose you look
at garbage collected languages as this h
that's not real programming. You need to
manage your memory. Well, today I use
Python and VIP coding. So even old dogs
can learn new tricks. So uh but but I I
I do understand what happened there.
I think that we have a number of eras in
what happened with software and and the
first one was you know the one where I
started writing code where the
fundamental limit was really the machine
and there was a lot of work to go and
extract the last ounce of power out of
those machine and that was the days of
assembly language where you really
needed to be incredibly accurate in the
way you were writing code computing
became much cheaper and we switched to
the modern cloud era where getting the
best performance is not the most
critical aspect. You can actually brute
force many problems but really what
became the limiting factor was the
ability for us to design in a modular
way. You know this was the era where
software was write it only once and this
was this whole idea of how are you going
to build libraries? How are you going to
write functions? How are you going to
break down that problem into something
that is long-term manageable?
The limitation there and that determine
a lot of how our software process are
working where actually the human brain.
Uh a traditional human typical human is
able to get the context between seven
and nine tokens. I mean we have very
rich tokens but you compare that to
modern ML where the context is basically
going to be infinite pretty soon. uh
that fundamental limitation of human
determined a lot of how software was
being written. This is over and we're
switching now to that AI frontier where
really writing the code is not the
challenge anymore. uh I'll speak some
more about it but the bottlenecks are
really how do you ensure that that code
is what you really wanted because
writing the code is easy but getting
what is needed for a specific problem
can be much harder to to specify
so humans at least in the near future
will be that role of architecture or
thinking of what are really the
implication of that piece of code I'm
getting the ML to to design inductive
thinking is another category where I
think Humans still have a a very clear
edge which is to look at a system in a
much wider context and to be able to
detect patterns and from those pattern
take some decision.
So where are we today? U superhuman
syntax generation.
When is the last time I built Gemini to
write a function for me and I looked at
the function and I was like I can do
that better.
It's over. uh I think that the minutia
of code writing I mean you can fight you
can argue you can find counter example
but that time is is gone where we still
have a lot of work to do is multi-step
code base uh software engineering is not
about writing code software engineering
is the first time you join a company and
you realize that there are 35 million
lines of PHP in the codebase and that
you need to make some changes that
that's the day you understand what
software engineering is and that's a
place where our models today or frontier
models are progressing but this ability
to manage that extreme complexity and
break it down into man manageable pieces
is a place where the frontier is still
moving
um it goes all the way to architecture
you look at I don't know the Google
architecture
thanks god we have Jeff Dean which was
you know the the key architect there but
that's the level of thinking which has
many implication which can go from how
do you do hardware optimization how do
you manage security how do you build a
system so that 10 years later you're not
full of regrets and I think this is
really the the range of progress we are
working on today so code is over but
there's plenty to do there's plenty of
progress to be made
now code is a very unique problem and in
some way that's the reason We we did
pitchfork on this. Um
first of all, code is a lot of data.
There are other domains where you can
find a lot of data to train your model,
but code was so incredible. You could go
and go on GitHub and start to to scrape
GitHub. So this was one of those problem
where the amount of training data was a
very unique situation.
It is also a domain where doing
verification is reasonable. You can run
a piece of code, you can compile it, you
can have unit test. So the ability to
figure out is the model generating
something correct was something that was
pretty reasonable to do. That brought us
where we are today. But today what
happened is that we ran out of training
data. I think that 80% of the new code
added to GitHub today is machine
generated. So the notion of human
bringing some knowledge that can be used
for mining and to train model is
reaching an end. But the good news is
that we can do selfplay and selfplay is
something we always liked a lot at deep
mind. I suppose all know alpha zero.
Alpha zero became a superhuman go and
chess player without any human knowledge
just by playing against itself. We are
now at that stage where frontier model
for code are able to do the same where
they can create their own challenge.
They can judge the validity of the
answer. They can even to some extent
judge the architecture. So that ability
to do those hundreds of millions of
hours of selfplay writing code is the
thing that will bring us to the to the
next layer. You know it's interesting.
Um
do the experiment. Take a a brilliant
software engineer, lock him in a room,
lock him or her in a room for two years
and feed pizza and give the mission you
need to become a better software
engineer. What do you do as a person?
You you give yourself some challenges.
Challenges that you can verify and you
keep working and coding on those
challenges. We can do the same here. So
this is an issue of how much compute,
how much selfplay time we can have, but
that will bring the horizon of how far
we go in superhuman coding.
So the economics of code are changing
dramatically. You know, as I say, we
developed a whole software engineering
culture and infrastructure and set of
companies based on the assumption that
writing code was the hard part, that
this was the expensive part. We're now
in a world where writing code is free or
nearly free. That's why I've got the
tilda there.
That means that the amount of code that
we're going to see produced is going to
explode. And there are some hard
implications to that. First is the
question of design and adequacy. How in
front of that mountain of code which
will be written or written dynamically,
how do we keep systems which works and
are reliable at the microscopic level?
Great role for human.
It is also the issue that you know we're
writing code and we're not reading it
very much anymore. I mean I know we
still have code review but I would
predict that in one year we'll let
Gemini or other model generate the code
and nobody will actually look at it. You
know it's similar to compilers who still
check the assembly output of their
compiler and maybe someone there but
that's probably the end of it. So the
same thing is going to happen to code
and that brings some question of what
are the new process that we need to put
in place to keep that manageable
and that's where I've got a a bit of a
list active guard rails. I mean you've
all seen the news of mythos looking at a
piece of code and detecting a
unreasonable number of vulnerability in
that code.
there is a rush to go and patch those
vulnerability but I think that's going
to be a never ending process you know
we're going to get a certain layer of
vulnerability discovered by models we're
going to fix those models will get
smarter they will go a bit deeper and
find even more subtle vulnerability so I
think that the first aspect is that we
need to think at least as much about
code security and the implication of a
piece of code than on the code writing
itself and the grail and you know
something my team is working actively on
is instead of detecting the
vulnerability and then suggesting some
fix how about teaching model to write
correct things from the start
and that is very very hard to do because
it is very context dependent
the other aspect is that you know that's
what I call inductive architecture
uh I think that models today are still
not very good at transferring knowledge
of taking knowledge from one domain and
applying it to another one or taking two
concepts and finding the intersection of
those context to be those context to be
able to do deductive thinking. If we
really want to write those very complex
software system using ML that is a skill
that we need to teach and you know one
aspect of that is to really teach models
how to do correct planning in front of a
problem. How do you look at a very
complex problem and decide what is the
right decomposition of that problem that
will bring the best clarity or
correctness to the to the problem.
We also need to change the way we do
evaluation. I mean u threebench is
infamous in in my book because
threebench verifies if a piece of code
runs and produce the right output.
That's only a small part of as I
mentioned earlier of code engineering.
So for instance, I think that we need
some problems much more in those
benchmarks that we use which are
open-ended problem. I I'll give an
example. Uh I love the question of text
compression. How many bits per character
do you need and how far can you go? So
that's a very simple eval to to write.
You just take a piece of 10 megabyte of
code and you tell the model write the
best compressor you can that is lossless
and the loss function in that case will
be you know the size of the compressed
file plus the size of the source code
that's never ending I mean those
problems are I think what's going to
force those model to do novel things
like creating totally new algorithmic
for instance and I I think we're now
getting to that stage
Writing code or doing software
engineering is not thinking as a chain
of tokens.
Thinking and reasoning today is chain of
toad which has been you know very
successful and improve models a lot. But
humans of course are much more complex
in the way they think about problems. I
always think that code writing is a very
visual activity and that can be I don't
know the block diagram of what you're
doing or the flow of data through your
code. uh but saying that code will be
just a set of token that you emit that
are going to be the code I think goes
only up to a certain point that's a very
interesting aspect to what we do at
Google Gemini we made that choice from
the onset that this would be a
multimodal model that you know text was
only one of the modality that Gemini
would be able to apply and we're
starting to see you know how can a model
start to think in term of spatial or
dynamic representation to to solve
problem and I think that's going to
become a must have
another interesting question is is this
time to create a new language for models
Python you name it have been invented
for humans and those language are not
very good to write safe or reliable code
I mean they're great to write code but
they're certainly not the the best thing
I think We're getting to the point where
since the pain of writing the code does
not exist anymore. How about we make
writing the code much harder by having
you know very strongly typed languages
or you know some inspiration from lean
on how to write code that by design it's
not going to be perfect. I mean program
is something which has some limits but
at least putting the burden of
correctness on the model. So I don't
know if we have some language designers
here but I I I think there's something
really to be done there and it doesn't
need to be human readable. I I don't
think that that will matter anymore.
So beyond code um code is a universal
language to solve problems. I think that
what we're starting to see is this
ability to experiment very quickly in
code is impacting other domain very
quickly because doing experiment becomes
basically free. So I think that looking
at that intersection of code writing and
atoms or science is another big front
that we are opening that is the place
where true novelty is going to appear.
two which are especially exciting for me
is chemistry. Um you know as humans we
do not understand chemistry or we
understand a very very small sliver of
chemistry. Once you have more than 20
atoms in your molecule it's like wow we
don't know what that thing is going to
do. I think we're going to see
incredible things emerging out of that.
I mean once you are able to put 10,000
atom together that starts to look like
life. So what are all the other things
you can do with 10,000 atoms?
Biology. You probably heard plenty about
it, but you know, biology is the case of
nature did an incredible engineering job
and terrible job at documentation.
Um, but we can crack through that now.
Models are able to find those
relationship that might be elusive for
us. So I think that that is something
that will open incredible door. And then
there is what I call the gold we cannot
see. Humans are incredibly biased in
what we feel is the correct solution. I
mean, we're the result of an
evolutionary training that help us
survive in the jungle, right? Not doing
quantum computing. So, I think that even
though we can be brilliant and
innovative, there are a whole bunch of
progress and breakthrough that can be
done which we just cannot see or
perceive. If I had more time, I would
give some examples. I think that's one
of the thing where ML is such a
different viewpoint on many of those
problems that we're going to get the oh
my god this was in front of us the whole
time and we could not see it. So
exciting times ahead. Thank you very
much.
Ladies and gentlemen, as we continue
today's program, please welcome back
your MC developer advocate at IBM, Tea
Scamar.
What an incredible start to the day.
Woo! Everybody's leaving. This looks
amazing from here. Um
before we break off uh or after um let's
take a moment and acknowledge the
sponsors. Honestly, this would not be
possible without them. We're going to
get the slides up. Listen, you need to
give them your biggest round of
applause. I mean, it is so cool. Thank
you. Thank you. Thank you. Thank you,
Microsoft. Thank you to all the other
sponsors here. This event would not be
possible without them. There's plenty of
other things happening um in the other
stages, but there's no doubt that um
evals are a huge deal in AI. In fact,
they're the gate of quality, right? Um
we can ship a lot of things, but if
they're not eval, we ship a lot of slop.
And so, uh our next discussion, our next
session is going to be from Aparna
Dinakan from Arise, who's going to talk
to us a little bit about EVAS. Please,
your biggest round of applause for
Aparna.
Please welcome to the stage co-founder
and chief product officer at Arise
Aparna Dinakaran.
Hey everyone, can you all hear me? All
right, let's go. Oh, let me go one back
here. Awesome. Well, hey everyone. My
name is Aperna, one of the founders of
Arise. We work with some amazing teams
to help them build evals. Um, and we
have an incredible lineup of talks for
you all today at the Evals track. Um,
it's happening in room 2005
and there's going to be amazing speakers
from Turnbench and Uber and Snorkel kind
of all happening after this. Um, but
today I'm here to talk to you about the
future of evals. Evals have gone from
the new skill that every PM and every AI
engineer has to learn to the thing that
every serious AI team is betting on.
We've been really fortunate to get to
work with some of the best AI teams in
the world. So we get a front row seat
into not just what's happening when
they're building their actual agents and
before they actually ship, but actually
the eval teams are running on their live
production agent via their traces.
Little bit of some stats for you guys.
We run over a 100 million evals every
month. The average team runs about 12
different eval jobs with the top teams
running over 3,800 different evaluators.
And offline evals, online eval, they
each have their own place. But today,
what I'm actually going to talk to you
about is the teams that are running
evals on their traces. This is actually
what's helping teams figure out what's
working, catch their failures, and
that's the type of data you need to fuel
your continual learning loops.
And the industry kind of agrees. I mean
all the CPOS of Anthropic, OpenAI, all
you know GDB, you have Gary Tan saying
eval are everything you need. And the
whole industry kind of agrees. So we
added evals. They catch all the
failures. Right?
Here's the problem. While we were
building all of these firstgen evals,
the thing that we were actually
evaluating has changed underneath us. In
2023, it was about just answering a
prompt. In 2024, we started to see all
different frontier models. They've added
tool calls. They've added reasoning.
They've added deep research. Now, what
we have is teams running loops on real
world data with sub agents kicked off on
um long horizon tasks. Every one of
these was actually a massive jump in
complexity. And we didn't just make the
problem harder. we actually got a
fundamentally different type of problem.
What that meant is that as these systems
got more complex, so did the way that
they actually fail. We're really lucky
because we have our own agent that we've
built, Alex, that lives in our UI and we
get a kind of get to feel this pain
ourselves. Every time the Frontier Labs
added new functionality, we added it to
our agent. And now Alex can has much
longer memory. It has the ability to
create dynamic UIs. it can go search
across an enormous volume of traces. But
we also realized that it would forget
context. It wouldn't know when something
was done. Um, sometimes it would just
get stuck in these loops. And the key
thing here is that the classical LLM as
a judge evals that probably many of you
have written in this room just weren't
enough for us to be able to catch all
the types of failures that we were
experiencing. I mean, it's just
fundamentally different, right? you have
a deterministic flow and now what we
have is literally every time a user
interacted with Alex it would create a
new UI that's a fundamentally different
trajectory
so this led to our really big revelation
what if the best way to evaluate an
agent was actually with an agent
doesn't mean that all of the ways that
we did eval with deterministic evals
with LLM as a judge classic eval doesn't
matter anymore but it just means that we
have a different type of tool to solve a
different type of problem. Agent as a
judge is about adaptive dynamic
analysis. LM as a judge just gives you a
fixed rubric with these fixed scores.
It's what everyone's doing. But when
your agent's doing completely different
trajectories every time a user puts in
data, it just means that you need a
fundamentally different type of eval.
My take is that most teams today are
doing the first two, but the future of
eval is actually having all three.
And today I'm actually excited to share
we've released agent as a judge um to
help our teams on their eval journey.
We've released signal. Signal is
actually a longunning agent that can
read traces sent in discover patterns of
issues. Um, it can figure out types of
problems that a classical LLM as a judge
eval just would never be able to do with
these deterministic rubrics. It's helped
us figure out um very subtle failures
that you wouldn't even think of doing
such as something going on in a loop for
multiple times. It was calling the same
tool uh for repeatedly long time. The
trajectory was inefficient. And actually
what this does is because it has all
that analysis, it can go put up a PR and
put up a fix. So, if you want to learn
more, come to our come to our booth.
We're right by the OpenAI booth. We'll
give you a demo. We'll show you a bit
more about it. Um, we're also, like I
said, taking over the eval track. So,
come to room 205. We're going to be
talking a lot about the future of evals
and what they look like. And if you just
want to hang out with our team, we're
throwing a viewing party for the USA
World Cup uh game tonight. So, uh check
out the Luma and register to come join
us. Awesome. Thank you all so much.
story of how this all kind of came to
be. Uh we're going to talk about OGs uh
big bet on effect uh a little bit into
our core agent loop. Uh we're going to
talk about the A2A protocol, eval.
We're going to talk about how we manage
long context.
Hi everyone, my name is Gabe Dees Mesa.
I'm an engineer here at OpenGV and today
we're going to be talking about agents
in production, specifically how OpenGV
built and scaled OG Assist. Uh so um
this presentation is going to be
jam-packed with just so much good stuff.
Uh we're going to talk about uh AI
agents. We're going to talk about our
harness. We're going to talk about um
eval observability traces. We're going
to talk about um tools and skills. Um
it's there's going to be a lot of good
stuff in here. We're going to talk to
you guys about uh what we do at OpenGV
and how we operate at the scale that uh
we operate at um in production. So
you'll be able to see a real use case
and workload uh with AI agents. Um so
without further ado, let's get started.
Okay, agenda. So just really quickly
going to go through uh high level what
we're going to talk about today. Uh I'm
going to tell you guys a little bit
about OG Assist and what uh Open Gov is.
I'm going to tell you guys the origin
story of how this all kind of came to
be. Uh we're going to talk about OG
Assist's uh big bet on effect uh a
little bit into our core agent loop. Uh
we're going to talk about the A2A
protocol, eval.
We're going to talk about how we manage
long context. We're going to talk about
um monitoring observability, how we
collect feedback uh and how we iterate
on that feedback. We're gonna lastly uh
also talk about tools and skills and how
at open gov uh we use um AI not only
externally uh that we uh serve to
customers but also internally to improve
our development workflows.
Just a little bit about me before we go
any further. My name is Gabe. I'm a
software engineer here at OpenGV. I work
on the AI agents team and uh I'm one of
the folks that helped build uh OG Assist
and some of the systems that you guys
will be seeing today.
So, a little bit about OpenGV. OpenGV is
a software company uh on a mission to
power more effective and accountable
government. Um so, OpenGV sells ERP
software. That's things like budgeting,
procurement, asset management, and
permitting. And um we were founded about
14 years ago. And what's cool is um we
have this thing called OG Assist. And OG
Assist is this little button on the top
of all of our products in the in the
navigation bar. And what's cool is um
all of our product suites and product
teams um have built tools and skills in
order to power this button. So, for
example, if I open up uh this this um if
I click this button and I open up OG
Assist, it says, "Hey, um I'm going to
ask about rate codes, which is very
specific to utility billing, the current
product that I'm in." And you can see
that inside of this kind of chat
interface, I'm able to speak to an
agent, and the agent is able to make
tool calls in order to um look up
information against data inside of that
suite. So, it's really cool um to be
able to kind of first party create these
experiences uh through the capability
that we've built called OG Assist.
Okay, so just a quick story about how
this all came to be. So, um, a little
while back, we we we saw that AI was
really starting to take off and a
principal, uh, spun up this new team
called the AI agents team and asked me
to join and, um, instantly I said yes
and OG Assist started to to grow and we
started to integrate, uh, OG Assist into
all our products and, uh, not only our
back-end capabilities, but also our
front-end capabilities as well. So,
you'll see that one of the capabilities
that we give the agent is it's able to
um see what's on the screen and and see
and and and take action on what's on the
page. So, you could see that um I'm
asking the agent here, hey, hey, what's
on the screen? Can you maybe highlight
uh some of the next steps that I could
take? So, you can see that the agent
here is thinking. It's saying, okay,
what tools do I have available to use?
And hey, let me go and highlight
something that you could actually click
on and and tell you more about it. So
just another capability of OG assist and
just a little short story about how this
all came to be.
So the big bet on effect. Um so I really
wanted to include this slide because um
here on the agents team we made a huge
bet to um to to bet on effect and
suffice to say it's paid off in
dividends. Um we write effect. So effect
is this library for typescript. It's
open source and it helps you write
better um typescript code. uh you know
it's got a lot of uh stuff baked in it
like a sk a schema similar to like ZOD
if you've ever used that. It's also got
um things for error handling uh for
logging for traces for uh it's just got
so much in there. It really helps write
better code and structure your code
better and helps with architecture,
spinning up new services for uh and and
for us on the agents team really helping
uh design and build the the core agent
loop. So you'll see throughout this
presentation sprinkled in um how effect
on our team uh has paid off in
dividends. So we we really love effect
here at open gov and we encourage other
folks to try it out and um yeah let's
keep going
the effect native loop. So originally we
were on lang graph and that was fine
until the team really started to scale
uh and our use cases started to evolve.
So we decided to move over to our own
kind of effect native agent loop to have
full regency over this uh agent loop
such that if we have complex use cases
or features that we need to build we
could kind of get in we we had full
control of the of the agent loop. And
not only that but now we're fully on
effects. So all the cool things you get
with effect is now propagated throughout
the entire agent loop like the tracing
structured concurrency, the logging,
everything is more fine graining control
and it it really allows us to really
unlock the full potential uh having our
own agent loop from the ground up. Um so
another thing I wanted to mention is on
the left side you'll see a code example.
This is really the basics of the effect
loop that we're using. Uh we're using
this thing called the effect AI package.
And in that package, there's this thing
called um there's a chat and a language
model. So with the chat, you can
instantiate like an a chat for example.
And then you could stream text using um
that that kind of stream text function.
You could pass in a prompt. And what's
cool is uh with a language model under
the hood of since we're kind of doing
dependency injection, we could pass in a
different language model if we were to
uh hot swap to another one for example.
So really just having full control of
our own agent loop just kind of gives us
all the levers and it really just
unlocks the full capabilities of the
model and uh for the team as well to
have full agency over this this loop.
Another thing I wanted to mention is the
agentto agent protocol. So here on the
agents team, we've had a lot of success
with this protocol. So this protocol
being the protocol that Google created
um kind of an open protocol for agents
to intercommunicate. But um we found
this very useful for uh defining our
agent routes for example in the back end
and our model and our schema to follow
this kind of uh agent protocol. So we
modeled so for example there's this
thing called an agent card which you see
here and it's got the name of the agent
a description etc right and having this
kind of rigorous protocol this rigorous
spec really helped drive our development
and drive alignment because you know all
we had to do was um align with this spec
and follow this spec and we knew that
this was kind of the contract that our
front end and backend and would both
consume and and produce. So, um this uh
I would say also has been uh very
helpful for us and and what's really
cool is A2A has a lot of extensions,
right? So, you could extend the protocol
uh add in like metadata. Uh there's also
A2I
um so lots of fun stuff uh with A2A
protocol, but uh this is kind of what's
worked for us. So, just sharing that
with with you folks.
feedback and eval. So here the quote is
shipping is the start not the finish. So
what we do here uh on the agency team is
we have kind of multiple ways we do
evals and collect feedback. Um obviously
you know we'll have folks uh call in or
or email us or or just let us know and
tell us but the main way is we have this
thumbs up and thumbs down mechanism. And
here uh someone is able to tell us, hey,
this this worked really well. This was a
great response or that wasn't a great
response. And that signal we take and
we're able to iterate on uh and we could
take it back and help improve uh you
know the response in the future. Um we
also have automated evals. So in in the
in RCI we we have evals that run against
real completion. So we could test the
prompt against hey did it hit some
tools? Did it do what it's supposed to
do? And that also helps with our
accuracy. So, uh those automated evals
in conjunction with collecting feedback
really help us um improve our
our our tools, our skills, um our
harness and and that's really how how
we're able to iterate so fast and so
quickly.
Humans in the loop. So this is a really
cool feature we built where we
deterministically interrupt the agent
loop. If there is a tool call approval
required. So if an agent tries to make a
tool call that it needs human approval
for it'll show this UI and the human uh
can click accept or reject. So
explicitly rejecting or explicitly
accepting uh the action that the agent
is trying to make. And this ensures that
uh you know we're building trust and
also ensuring that uh you know we're
being safe especially when the agent is
trying to do a mutating operation and
always always always making sure that um
humans are in the driver's seat
sandboxing. So, another thing that we uh
worked on um kind of similar to the
safety slide we just saw was um whenever
an agent tries to execute code or tries
to create files, it does so in a
sandbox. So, we gave our agent sort
All
right. All right. Hello everyone. Really
excited to be here. It's a big room.
Very uh very cool conference so far. Uh
I want to talk to you today about
something that's been on my mind for
many many years. This is actually the
first time I I talk about it. Sort of my
version of going to Mars. Um and that is
the Eureka machine. A machine that will
eventually invent pretty much all future
inventions for humanity. Uh and the way
we're going to get there is uh by taking
a step back and thinking about what else
has given us a lot of really incredible
inventions uh namely evolution and how
that leads us to automating research and
pushing the scientific frontier forward.
And this is uh joint work with a lot of
uh amazing folks uh at recursive.com
uh and even some uh folks at AIX
Ventures. And some of these slides are
uh actually inspired by uh and taken uh
partially from one of my co-founders at
recursive Tim Rockel.
So uh why do I talk about evolution and
why is it so important? Uh, I think
basically evolution is this like
open-ended process that has gotten us to
a lot of different things that we really
like. Uh, it started in biology. It's
moving to science, technology, and
eventually I. And I think it can inspire
us in a lot of different ways to build
better AI systems as well. In fact, uh,
whenever we take out and there's this
famous saying, whenever I fire a
linguist, my accuracy goes up. Uh I
think that's true for machine
translation back in the day. And it may
be true that we should fire all the AI
engineers uh and that that are here uh
and have them mostly manage an actual AI
engineer that is AI and works on AI. Uh
and so that may be uh one of the
conclusions of this talk. Uh, and I
think most of us are going to be excited
about it because it means that we'll all
become managers of such an AI rather
than having to do the nitty-gritty
ourselves. All right, so let's start
with evolution, right? The really really
big picture, three and a half billion
years or so. Uh this is kind of the
incredible process uh that has led from
you know simple bacteria and plants and
fish and amphibians and so on to after
many billions of years us. Right? That's
that's a good starting point. That gives
us some indication that evolutionary
processes can do pretty amazing things.
Right? But now let's zoom in and uh go
maybe down to a few million years. There
we can also see how in the very first
primitive ways technological evolution
has basically increased the world uh
sort of product uh in terms of monetary
value. It's a little bit harder to
estimate in the beginning, but we can
see these sort of sequences of
exponentials and most exponentials
eventually become S-curves. They flatten
out. But humanity has done pretty well
by basically developing uh many of these
very basic technologies, hunting,
farming, but then also thinking about
science, the scientific method um in the
early days of the enlightenment and of
course the industrial revolution. So now
we can zoom even further. Uh and no
worries, we're eventually going to get
to nanohat and actual auto research and
and what we're doing. Uh it's a very
very quick zoom. Um and now we can zoom
down to the last few thousands of years.
And what we're seeing there is that with
more technology, we were able to sustain
more people, right? So when we're
working on pushing that frontier
forward, uh we're very certain that that
will lead to more human flourishing,
right? And especially in the last few uh
hundred years, we're seeing this
incredible explosion in the population
of people because of technology and the
evolution uh that it brings and in many
cases that evolutionary process is run
by us. So it's sort of conscious uh but
there are sort of interesting uh
inspirations that we can take from that
as we're thinking about the evolution of
AI in the next cycles. uh in fact and I
might not agree with everything with
Mark Andreasen but uh he is very smart
and we agree on a lot of things. Uh and
so I think he wrote this really great uh
technoptimist manifesto in which he I
think correctly points out that the only
perpetual source of growth for the
entire economy. A lot of people worry
about AI taking jobs and things like
that but the truth is it will very very
likely increase uh the economy massively
and that will benefit benefit a lot of
us. And so the perpetual source of
growth is technology. Uh in fact we can
go even further and say that there's no
material problem and again it's not sort
of psychological problems and things
like that but no material problems uh
that cannot be solved with even more
technology. Right? We have a problem of
starvation. We invented a green
revolution, darkness, light, uh cold,
indoor heating, heat, air conditioning
and the list goes on. So I think we can
kind of realize that this evolutionary
process has been going on for a very
long time and continues to make a huge
amount of progress. In fact, the
progress is so fast that there can
within one lifetime be a major major
shift. Right? If you're born in 1900, uh
then three years when you're three years
old, the first human ever was able to
thanks to the Wright brothers kind of
have sustained motored flight. And then
about 60ish years later in 1969,
humans flew all the way to the moon.
Right? So that within one lifetime,
humanity went from like no one can fly
for a very long time other than sort of
gliding down a hill or something. No one
can really fly to we all fly to the
moon, right? And so for us, I think what
that means is we're probably, and I
sometimes say this, we're like too late
to explore Earth. We're too early to
explode the stars, but we're right on
time to build an AI that could actually
do what flying did for some in one
lifetime due to intelligence. We can
build and move from AI being worse at
everything that we do to possibly being
better at any specific task that we do,
right? And that that will probably be
our our 60-year time frame. And because
everything moves faster, it might only
be 30 years or so. So then uh there's an
interesting connection between
technology and science and theory right
like sometimes the application comes
first and then we develop the theory
later and then improve the technology
sometimes the theory comes first and
from that we can build new kinds of
technologies and so it's very helpful to
think a little bit about the philosophy
of science and no better uh to be
inspired there than popper wrote that
just like in other types of evolution
when we choose a theory We also choose
one that is best uh in competition with
other theories. Of course, you need if
you wanted LMS to do that, they need to
find them. You need web search for
instance. Um but uh in the theory that
best holds its own uh it's one that just
like evolution has a certain natural
selection process, right? It proves
itself. Uh and there is also a sort of
survival of the fittest going on in
scientific theories.
And uh in fact uh a lot of science
according to Popper is basically us
proposing a new theory hypothesis or
explanation or description and then
subjecting it to rigorous empirical
testing. That is the uh essentially
evolution evolutionary pressure of
scientific theories.
And basically that was a very short uh
run through uh sort of the history of of
open-ended evolution uh which hopefully
makes us all realize that more science
will lead to more technology which will
lead to more growth which will lead to
more human flourishing. And so that then
begs the question does it make sense for
us uh to try to just scale up and spend
a lot of our resources as humanity to
scale up scientific discovery in order
to lead uh to this flourishing.
uh when when you double click into that
you kind of realize um which Dislam uh
already realized a long time ago uh that
the exponential growth of science will
actually be at some point halted by the
lack of people working on it right
there's so many niche subfields now in
all the different areas of science that
is very hard to get a million people to
work on that particular thing uh and so
as a result of this incredible widening
of the scope he says uh the number of
people focusing on any single section of
it has decreased. And that then leads us
to really thinking about how could we
automate this and automate scientific
discovery. And that then leads us to
what I call the Eureka machine. This is
basically uh our attempt at trying to
build a machine that automates the
process of scientific discoveries. And
uh in fact I like in a couple months
I'll have a book coming out on on this
uh exact idea. Uh and so I'll just give
you a super high level highlight of how
such a Eureka machine could be built for
basically everything from physics,
chemistry, biology, neuroscience,
medicine, uh economics, astrophysics and
so on. And there are essentially four
pillars that are all extremely important
to this machine. One is of course you
have to understand what knowledge is
already out there. Uh what uh things
humanity has already invented. uh you
have to get all the scientific
measurement uh data into as the second
pillar this machine. Uh then for things
that you cannot yet measure we don't yet
know you should try to then build
simulations. Anything you can simulate
you can verify and you can then solve
with AI. Uh and if all else fails or at
the very end of these processes, you
still need to have some kind of uh
physical industrial like a lab uh that
actually can run real experiments in the
real world. And on top of all of this uh
you'll have basically uh an agent swarm
that will deal with all of these
different sources of knowledge and data
and experimentations and and rewards. Uh
and in terms of you know the
foundational model of knowledge of
course we also you know it basically is
is a good example of how every single
technology we've built so far especially
in AI but also before that the internet
browsers GPUs and so on we can rethink
and there are a lot of startups possible
in rethinking every single one of the
layers of technology as infrastructure
for super intelligence right at UW.com
for instance we work on web search for
LMS, right, and agents and so on. Uh,
and that actually is quite different,
right? Uh, agents can read thousands of
very long snippets um, rather than just
10 blue links with like a very short
snippet. And so you can rethink each of
these different uh, layers of technology
that we've built for people uh, and uh,
rebuild them for AI in order to use them
as tools to then build uh, super
intelligence.
Now that is essentially uh the sort of
why like like we want to build super
intelligence in order to automate
science. Uh and to me that will be the
next big step function change uh in in
humanity uh and technology as we know
it.
Now how do we actually build it? Uh I
think the best way to build it is to
have it built itself. Right? We moved as
a field and especially natural language
processing for instance which I've
worked on for many years. We moved from
not having linguists, this feels like
ancient, you know, BC uh history, uh but
before Chat GBT, um we we moved from
having linguists tell us a bunch of
things about language and then training
statistical models on top of that. And
when we allowed neural networks to
actually automate learning those
features with word vectors and uh other
neural network architectures and
backto-back uh end to-end learning and
back propagation, we basically uh were
able to get much bigger improvements. Uh
then we did a bunch of architecture
engineering. Now a bunch of people at
least are working on a unified
architecture. Uh but even that unified
architecture has a lot of manual
processes. And so it's clear over and
over again in AI that when we take out a
manual process and we replace it with a
learned system, improvements will
follow. Uh and so that's why I think uh
we should try to build a speaker machine
by having an RSI that builds itself. And
the beauty is that
only now um AI can actually do this
because AI is code and AI can code. Now
this this ability to really code in
longer and longer time horizons has
really only happened in the last like
six to eight months and that now enables
such an RSI to work on itself to develop
almost a certain sense of self-awareness
of its own shortcomings and then fix
those shortcomings. Uh and then once we
have that machine that has gotten really
really good at doing research in AI
itself, we can then use it to do AI
research for a lot of other things uh in
in other scientific fields. And so at a
high level it's quite easy right we have
three steps ideation implementation and
validation of ideas. That's true for
basically almost every scientific field.
And so uh to end maybe on some very
specific examples uh we have built this
first kind of version of such a Eureka
machine uh and we wanted to just show
that it works on some small uh samples
that a lot of people know and are aware
of. And so we basically started uh with
three things that show you and give you
a very first glimpse of and and sort of
simple proof points uh of what such a
machinery can do. And that was basically
better training, faster training and and
better kernels uh for for Nvidia GPUs.
Um the first one nano chat um I'm sure
many of you have heard of it. A lot of
people think that's already recursive
self-improvement and it is kind of a
weak form in the sense that usually when
you do auto research it's it's not
recursive self-improvement, right? True
recursive self-improvement is when you
have an AI that has a sense of
self-awareness of its own shortcomings,
full access over everything uh in its
arsenal from pre-training to RL training
and harnesses and everything and then
actually updates that entire system in
the next version of itself. Now you can
also take such a system and just ask it
to improve some other process some other
AI like a small nanoad run where you can
train something in five minutes and that
is really exciting. It's an important
milestone but it's not actual RSI. So
here basically showed three examples of
such an auto research um uh system and
what it can do and uh after a very very
short time it essentially was able to
outperform
many uh different teams and teams that
also use uh other AI research. So let's
double click into some of these. Nanohat
is really exciting example. Uh basically
you train a very small uh chat model uh
in less than uh five minutes and you
basically want to have it get to the
best possible bits per bite uh number.
And so the whole community had worked on
this uh for uh quite some time and got
to uh 0.93.
And after training this for a little
more than a day or two, uh, we basically
got it down to 0.91.
Um, which is pretty exciting. Now, it
wouldn't be that exciting if all it did
was just find a couple of
hyperparameters um and tune them
carefully, but it actually did find
truly interesting novel ideas like hash
biograms and triam embeddings and tables
for those uh and mixing that into
various uh value paths of uh the
intention through variety of learned
gates. So, it actually started to doing
more and more interesting things rather
than just kind of tuning
hyperparameters.
Um another one a nano GBPD speedrun. Uh
obviously speed is very important. Uh so
here we're able to work on this again,
apply the system and after a very short
amount of time it got better than uh
people working often together with the
AI for over a year uh on on this very on
this benchmark and made the whole thing
another two seconds over two seconds
faster um at 70 seconds and again
discovering uh very interesting ideas in
the process.
And then the third one is scuda kernels.
Of course, we all care about not burning
through our GPU budgets too quickly. Um
uh and trying to be very efficient. I
think in general, it's actually kind of
shocking how inefficient a lot of
mixture of expert models still are run
in very large clusters that cost
billions of dollars and then only have
like 30% or so utilization. There's a
lot of work that's ongoing in the world
uh to improve that. and different fields
uh or different groups of people or
various different um yeah stages of
that. Uh but long story short um lots of
different cuda kernels are used during
training and testing and here um we
basically again took that system and
after uh a couple days it discovered
better kernels uh than the leaderboard's
best uh on the NVIDIA uh benchmark
website u by again quite quite a sizable
margin across all the different uh
categories of of those kernels. And
while we are pretty good at AI and like
we actually in the team didn't have any
particular CUDA kernel experts who just
spent their entire careers writing good
kernels. uh but still you know we do
just enough to make sure and worked
together with Nvidia to make sure that
there are no reward hacks here and and
other issues but actually found uh that
eventually these all checked out and
were indeed uh pretty much all the
different kernels uh found the best
solutions there and so with that I hope
I could convince you uh that indeed RSI
could be that next big uh scurve um an
exponential that gets layered uh on on
top of previous exponentials and uh that
should help us uh with not just AI but
eventually science and then all of
technology and then uh allowing many
more people uh to flourish on our
planet. Uh and so maybe I'll end on this
note here which is uh a lot of people
wonder how much longer AI can go right
every exponential eventually flattens
out and um it's actually quite hard to
know like when we even talk about
exponential growth in AI what does that
even mean there are many different I
call them spaces of intelligence and we
won't have time to go into all of all of
these but as soon as you actually try to
define multiple different dimensions of
each of these 10 spaces uh that make up
this complex like sort of volutric uh
thing that is intelligence. You'll
realize that there's still so much more
to go like on the upper bounds of
intelligence. We're still astronomically
far away from reaching those uh across
pretty much every single one of uh these
dimensions and the spaces uh that they
make up. Uh so if any of that is
interesting uh and you want to help us
build that um we'd love to hear from
you. Thank you.
Hey everyone, my name is Nishan Gupta
and I'm a software engineing tech lead
at Meta working on building the training
and inference infrastructure for the
meta super tangent lab and their
infrastructure organization.
Today we're going to be talking about
production val for authentic systems.
When most people hear the word
valuation, they think about benchmarks.
A model scores 90% on a benchmark. A new
version scores 92%. The team celebrates.
But agent systems have fundamentally
changed what the evaluation means. Today
the systems don't simply generate
answers. They plan, they call tools,
they retrieve information. They execute
workflows. They interact with the
production infrastructure. The question
is no longer did the model generate the
right answer. The question is did the
system behave correctly. Today I would
like to discuss how evaluation is
evolving from model benchmarking into
production infrastructure.
This is the problem almost every AI
organization is encountering today.
Offline benchmarks continue improving
yet production reliability often remains
unpredictable. Why is that? Because
benchmarks measure model capability.
Production measures system behavior. A
benchmark doesn't capture tool failure,
API outage, context changes, user
variability, long-running workflows. And
as systems become more autonomous, the
gap between the benchmark performance
and production performance grows. The
result is what many teams experience
today. High benchmark scores as you can
see, but unreliable production behavior.
Traditional ALM evaluation focus on
outputs.
But we should ask the question, did the
model produce a correct answer? Agentic
systems force us to ask a different
question. Did the system behave
correctly? Behavior includes planning
quality, tool usage, execution, workflow
execution, recovery from failures,
decision making. In other words, we are
moving from evaluating answers to
evaluating workflows. And that requires
fundamentally different evaluation
architectures.
Many teams still think hallucinations
are the primary AI failure modes. In
production, they are often just one
category. Agentic systems introduce an
entire hierarchy of failure modes. At
the very foundation, the memory
failures, retrieable failures, safety
failures. As you go up, you have to
think about reasoning mistakes, poor
planning, incorrect execution. At the
highest layer, you have to think about
multi- aent coordination failures. And
this is why evaluating only model output
misses the most production risk we
observe.
One of the most useful mindset shifts is
to stop thinking like researchers and
start thinking like a SR or a production
engineer. SR don't measure success using
accuracy. They measure reliability,
availability, latency, cost recovery and
agentic systems require the same
approach. The goal is not maximizing the
benchmark scores. The goal is to
maximize dependable outcomes. Rabi
becomes the northstar metric
values limited. In the middle there
scenario based valuations. These
simulate realistic workflows. And at the
very top you see production telemetry.
This is where the highest value
evaluation signals come from. The
surprising insight is that the most
evaluation data often comes from real
users interacting with real systems.
Now let's talk about offline. So offline
evaluations still matters but the
methodology changes. Instead of
evaluating prompts we evaluate
scenarios. For example, a customer
support workflow, a code generation
workflow, a research workflow. The agent
operates inside that simulated
environment. We measure the task
completion rate, tool correctness,
planning quality, resource usage which
is which becomes exponentially high at
high scale. The key takeaway 18
evaluation should be scenario driven not
prompt driven.
Once a system reaches production, every
interaction becomes a signal. This is
one of the biggest shifts in evaluation
thinking.
Oh, all right. Uh, all right. So, can
everyone see the uh slides? Oh, nice.
All right. So, good morning everyone.
Thanks so much for being here. Uh, my
name is Hio. I founded around Gina AI
since uh 2020 to 2025 and last October
we were acquired by Elastic. So, now I'm
running a model inference and training
team there. And uh uh so here's a
question I want to answer today. Uh so
big models get thinking better by at
inference time. Right? So we call that
test time compute. And can small
retrieve model do the same thing. Right?
Can it get better by thinking harder at
inference uh without making the model
any bigger? Uh to find out that I let
the agent run auto research overnight
and the answer turned out to be more
interesting than yes or no. Right? So
let me show you what I found out. So
first let me say what test time compute
is. So the idea is very simple. So
instead of training a bigger model, you
spend more compute at inference time. So
you get better answer back. Uh it shows
up in a very familiar forms uh such as a
best of insampling, self-consistency or
verifiers that rerank the candidates. So
non Brown from OpenAI uh put a number on
this. He found that a poker bot uh
sinking for 20 seconds uh got the same
boost at scaling the model for 100,000
times. Uh so that's the promise of test
time compute. So the real question for
us here is does this promise also for
the also hold for search.
So here's a reframe that turns this into
a retrieval talk. Uh search is already
test time compute. Uh so think about
what you do when you build search. You
take a train embeddings a train reanker
some multiffactor retriever and a query
expender and then you wire them into a
pipeline. So you are spending inference
to buy relevance and you are not
reaching for bigger model. You're
basically assembling more search at test
time. So the real question isn't whether
your model is big enough. It is how much
pipeline can you assemble uh at
inference and whether that pays off.
So there are two versions uh two ways to
build that pipeline and I will show you
both. The version one the first one
version a uh is the one I will go deep
on. So here an agent writes a little
program over a single frozen embeder or
encoder. It might chunk the document uh
do this scoring fuse uh with different
scoring strategy and feeds the results
back. So think of think of it as a
multipass uh algebra over embeddings.
The second one version B uh I will come
to later. So there a small agent wires
up the retrieval tools like grap embed
rerank over a corpus given a fixed uh
token budget. So it's the same idea
implemented at two different levels. So
let's start with version A.
So version A runs uh runs over a small
frozen encoder. So there the common
belief is that small models cannot
improve there and test time compute
exclusively belongs to the big reasoning
models. But let's look at what today's
embedded come from models such as E5 uh
Mistro uh Queen3 uh embed embedding
Gemma and even our own genome embedding
E5 they all distill from the large
language model backbones so that's the
dominant recipe today and if test
compute leave in the ARM representation
space then this detailed model should
somehow inherit it or do they so that's
exactly the question I want to find
So here's the intuition of uh for how a
frozen model, a frozen embedder could
improve at test time. Uh let let's look
at the three panels. Uh let's go from
the right uh left right to the left. Uh
so we go from the simplest way to score
a match on the left and to the most
detailed way on the right. So on the
left you have a single cosign distance
which is basically one vector per
document and one per query. So that's a
frozen cosine baseline. On the right you
have this cobert style latent
interaction where every query token is
matched against every document token. So
one can consider cobert as an extreme
case of test time compute. Uh the
interesting part is of course is it is
in the middle panel where I have
outlined in blue. So you can take a
frozen uh you can take the fro same
frozen encoder split the document into
sentences and max over them. So that's
basically what I call the test time
compute. You get closer to late
interaction but without adding new model
at all. Just more work on the same
embedding model again and again.
So let me make the question very strict.
So how much can a frozen single vector
embedding model improve at test time
alone? So I and I do mean by strict just
one frozen encoder behind an API and you
can call it as many times as you want
but no retraining no second model no
learned parameters. So the popular
method uh measured all break one of
those rules like height puts an error in
the query pass to route the query. GQR
as a second retriever and meta embed
trains new parameters. So we forbid all
these three rules. We forbid all these
three things. But even with the
constraint the search pipeline the
search space is huge. So how do you
search that with auto research of
course? So instead of me handcrafting
this programs an agent runs the research
loop by itself. Uh it changes one file
it runs a short experiment and if it
matrix improve it keeps a change
otherwise it reverse. So it does that
over and over all night. So it is kind
of like hill climbing uh but errorm as a
mutation function. So entry capacity
from astrobic uh describe it as follows.
So we are editing a python file in the
way uh you're not editing a python file
in the way that research researcher
would would. So you are writing a
markdown files that set up the
autonomous research or and that loop
generate everything that we were about
to see.
Uh so here's a whole loop in one
picture. uh just follow the box from
left to right. We have a proposer which
is a RM agent write a program over the
frozen encoder. We have evaluator uh
which scores that program and memory
logs the result and the registry the
black box on the far right uh collects
all of them. So 144 programs one per
generation. So now see the dash line uh
dash arrow looping back underneath
that's basically the feedback. So memory
conditions next programs and every runs
built on the last one.
So let me quickly go through the four
pieces. The first up is proposer uh
which is based on oppus 4.6 used purely
as mutation function. It reads the
current best program and memory file and
then it adds one Python file to propose
the next one. So there is no human in
the loop. Uh now here's the catch. It
only optimize the metric that you give
it to it not the metric you meant. So if
you reward in domain performance and if
you reward spending more compute then
that exa that that that is exactly what
it will chase. So whether the
improvements hold up elsewhere is a
separate question. So the next one is
program it just acturate Python program
over uh the encoder and the one piece
that matters is this embed function. So
that's a compute budget. So every
function call there basically re-mbbed
sounds text or switch the laurel adapter
or pick uh smaller dimensions. So one
call is one unit of compute. Uh there
are some other constraint such as the
program cannot introduce any
hyperparameters, cannot do task routing,
cannot add external models of course. So
this conra those constraints uh force
the agent to found task agnostic program
instead of a config that's secretly
optimized for each task.
Then comes the evaluator. So every
programs runs the same 14 evaluation
task or discovery task spanning legal
financial long document long context or
general retrieval problems. We score it
via delta and the CG against the uh
cosign baseline plus some cost ratio. I
will introduce the cost later. Now
here's a design choice that matters the
most. The loop only ever see these 14
task and there are 19 more held out task
the loop will never touch them or see
them. So later we can ask a very clean
question that does what wins here uh
also hold up there. So and the whole gap
the gap is basically the whole
experiment.
The last part is memory. So it is a
simple JSONL uh file with one row per
program. Each row stores the scores, the
cost, the parents and a short lesson. So
the proposal read this file before every
round and the whole search compounds
compounds over time. Uh but compounding
has both ways, right? It builds a real
win of course, but it alo also compounds
whatever bias uh from the objective. And
the bias matrix does not only mislead
one program, it steers the entire
family.
Uh so now let me set up the models that
we use here. We run the search on the
single encoder which is the Gina V5 Nano
uh only 200 million parameters
state-of-the-art on multilingual
retrieval. And we choose nano mostly
because the discovery phase as a
discovery phase model mostly because it
is small and therefore reduce the cycle
time of each experiment.
We hold out the bigger model uh from the
same family plus the unseen families
such as gema model and quinn model. They
share no training data, no backbone, no
tokenizer with the discovery model. We
also hold out the 19 evaluation task as
I talked before and this one those 19
tasks the loop never sees. So when
programs gets discovered in this loop it
has to generalize over all encoders and
all all 19 tasks.
So now before showing any result let me
define the cost of the test time
compute. It comes down to one just just
one number C which is the number of
extra forward passes through the
encoder. So let me explain it with two
cards on the slice. They do the same
thing but they they kind of mix in some
neighborhood information and then
rescore it. The card on the left is what
I call a soft centroidid. It averages
the document to uh vectors that you
already computed. And so there is no
extra forward passes.
Uh that means it's cost C is just one.
The card on the right is the first
sentence. Uh it reimbed the first
sentence of the talk top document which
is a brand new forward pass. So there C
is greater than one. So one reuse the
geometry that we already have. The other
spans compute on the new pass on the new
text.
So now that we comprise the compute, we
run that exact same loop under two
different rubrics. The first is compute
rubric. It admits a program only if the
in domain performance beats every
program before it. So it is actively
pushed to spend more compute at
inference time. The second is the
transfer rubic. So it keeps the program
only if it improves over over the
validation set with nothing getting
worse and it gets no reward at all for
spending compute. And to be clear, the
validation set is uh still comes from
what loop can see. So neither rubric
ever touch the 19 final evaluation task
or final hold out task uh and unseen
encoder. So that's a two rubric running
under the same loop. So let's see what
each one come up with.
So let's first look at the compute
rubric. So when you tell it to spend
more compute, it draws this very
beautiful clean curve. So the x-axis is
a compute you spend on the log scale and
y-axis is a score. There are in total
144 programs and 12 of them sit on the
par front. The cost running from just
one uh all the way up to almost 15 times
and the in domain score climbed nicely.
It it more than triples across that
front. So this looks exactly like tet
time compute scaling more compute more
quality. So if I stop here you will be s
but this is still in domain performance
we haven't run this experiment on held
out uh data set. So let's take a quick
look on this 12 programs and run them uh
run them on the hot out uh data.
So here are the 12 programs drawn as a
little diagrams. So don't have to you
don't have to read into each one. The
only thing that I want you to take away
from this is that they are all training
free recombinations of the same frozen
embedding models just chunking scoring
feedback and fusion. The cost climb
nicely steadily from left to right and
does look like a clean uh scaling story
but the improvement on the hel data set
as you will see is not.
So now we run those 12 programs on the
held out data set and same chart as
before. Compute runs from the left to
right and scores runs up and down. So
the dash line across the middle is a
baseline and look at the pink line. It
uh the compute rubric. It's basically
flat hugging zero all the way out. So
out of domain more compute buys you
essentially nothing. Now look at the
blue dots which is the transfer
programs. They all sit on the left
because they are cheap and everyone is
above the pink line. So the cheapest one
only has like as zero extra compute it
still be the most expensive program. So
more compute did not transfer the cheap
structure did.
So if we plot every program against
every held out uh task we get this heat
map. Uh the four blocks are the four
encoders and three of them we have never
seen in the discovery phase. In each
block the rows are the programs and the
column are 19 evaluation task. Green
means an improvement. Uh red or pink
means a drop. The picture is generally
mixed. Compute helps about half of the
sales but improvement are uneven. So on
on average it comes out flat. Compute
does help in places but it doesn't help
reliably across all new all new task and
all new encoders.
So now let's look at these uh look at
the other rubric the transfer rubric. It
picks the six completely different
programs and they are all very cheap and
most one and a half times uh more
compute than the cosign baseline. The
best one wins 83% on the held out data
set and it never lose on single task. So
now what what do this program uh
actually do? So they only test some
query and document vectors that you
already have and they add a little cheap
mass on top of that. Some notch the
query towards the document it already
likes. Uh some pick a few directions and
uh in the space and rescore uh along
those directions. So they are very small
structure change but enough to pull the
document uh the right document up. So
it's all re combination no new models
and this really transferred to across
models and languages. Remember in the
discovery phase we only use GINA
embedding Gina V5 nano and but the
improvement is positive across all four
encoders and the biggest bar is on the
JAMA and the Quint. So those on the two
families it never sees. So this is isn't
some quirk of one model is general is
rise on general embedding geometry.
So that was version a frozen encoder
with very cheap structure uh and it
scales but low compute uh doesn't scale
and auto research is how we found that
but let me move one level up uh from the
model layer to the search pipeline and
you will see the same test time compute
reflect in the pipeline level in 2025 we
have this deep research uh and agentic
search product uh which was basically
just a one loop over the uh open web. In
2026, we moved to a long horizon task
which adds implementation sandbox evals
on top of the retrieval and running for
hours. So both patterns need more
looping and more compute at test time.
So study this genic search at test time.
Uh I built three open source projects
for that. The first one is data room. So
you give a token budget, it searchs, it
reads, it rise. So over and over until
it packaged everything into a zip file.
So I call it data room because it
somehow reminds me like prepares the
data room for the investors uh back when
I was a founder. So that zip file
details the corpus on the uh you can you
can imagine this zip file is a detailed
corpus of the open web ready for the
next agent or large language model to
consume. And notice the token economy
here. So you are basically exploring the
web and build a corpus using very cheap
tokens from small language models and
then you save the expensive frontier
tokens for later for exploitation.
The second one is search box. So this is
a test bed to study agentic search and
two calling. It is design it is designed
to be air gapped. So the agent have no
internet access. It's basically like you
lock the agent in a room or in a box and
you give it a data room and ask question
about it. So to answer those question,
the agent has to assemble a search
pipeline at test time. A pipeline made
of local tools since like a grap, embed,
rerank. And this allows you to explore
some very interesting research questions
such as uh which tool does the agent
reach for first or is grab all you need
or does forcing more compute help on hot
questions or will the agent build up a
search pipeline that it will reuse
later. So search box is a test bed to
explore those research questions.
So but how do you evaluate uh aenic
search like that? Uh well you need hard
questions. Uh that's basically the third
project is knowledge graph. So it turns
a corpus or data room into a knowledge
graph and every fact become an edge and
linking from subject to an object. Then
we can work on the longest path through
that graph and those long chains become
multihop questions then that no single
passage can answer. So the agent has to
spend more test time compute connecting
the fats to get there.
So it's also the tool for building a
private verifier.
So let's connect all the dots together.
So I introduced two versions of test
time compute for search. Both versions
are doing the same thing. They are
spending mode compute at test time and
neither of them grows the model. In
version A, we found a special embedding
algebra over the uh fixed uh frozen
embedding that improves the search
relevance. In version B, we build a full
stack to found the best search pipeline.
We use a data room to maximize recall.
We use a search box to maximize
precision. And then we use knowledge
graph to build evaluation. So finally,
it gives us a pipeline that with strong
search relevance. It is basically two
different levels, but they share the
same bet. Spending more test time
compute, not a bigger model.
So finally let me let me leave you with
a big picture. Search is test time
compute. So don't reach for bigger
model. Do more search at inference
instead. You don't have to do this
design by yourself by hand. Uh auto
research helps you discover this
probably overnight. Uh so and this is
how we scale the test time compute. And
that is basically my the end of my talk.
Uh you can grab all the slides from the
QR codes here. There's a paper and
projects on my GitHub and archive. And
if you are uh if you are around this
evening, Elasticity is also holding a
hacker zone in town. So the QR code QR
code right is right there. Uh so come
and uh build with us. Thank you so much
and happy AI engineering.
In 2026, coding agents will quietly
retire their first software platform.
Not because it's bad, simply because the
platform is unnecessary.
I am Dominic Turno. I am founder and CEO
of Resonate. Resonate is a durable
execution platform built with minimalism
and simplicity as its core technical
values and these properties will play a
central role in this talk. At Resonate
we have a working theory where software
engineering is headed.
Generalpurpose implementations will
increasingly be replaced by bespoke
implementations
generated on demand not as a new
library, a new framework or a new
platform but as a minimal extension of
the infrastructure that is already in
place.
If this theory holds true, reuse will
move upstream.
Instead of reusing a general purpose
implementation, we will reuse a
specification and we will derive a
bespoke implementation from it.
In fact, we can build many bespoke
implementations
tailormade for the infrastructure that
is already in place. We just have to ask
the agent. At this point, the prompt is
a platform.
Resonate is a dual execution platform.
We have an implementation of the
Resonate server. We have implementations
of the Resonate SDK for TypeScript,
Python, Rust, Go and Java. So, we have
to ask what does this new reality mean
for us?
If implementations become generatable,
where does our value live?
And our answer our value moves from
implementation to specification.
Now this changes how we think about
Resonate. The product is no longer the
implementation. The product is the
specification the protocol.
And from that protocol we want to derive
multiple server implementations.
One is a general purpose resonate
server. our reference implementation.
Others are implementations built with
infrastructure partners.
For customers and partners, this means
durable execution right on top of their
existing infrastructure with minimal
additional dependencies.
So the question is no longer can we
build a server. The question is can we
repeatedly synthesize trusted servers
from the same specification
and if so how?
When we talk about agentic engineering,
we focus all of our attention on
verification.
How do we know the result is correct?
But today, I want to focus on the
specification instead and more
importantly, how can agents participate
in specifying the system, not just
building or verifying it.
Now, Resonate is partnering with
multiple infrastructure providers to
bring durable executions natively to
their technology stack. One of them is
Senadia, the company behind Nats.io, an
open-source messaging system designed
for building modern distributed systems.
For the rest of this presentation, we
will use Resonate ornat.io to explore
our agentic engineering practices. How
do we go from specification to
implementation?
First, we need to level set our mental
model.
This picture is a common view of agent
decoding. There's an agent, there's a
specification, and then there's an
implementation.
And for many applications, that is
enough.
But it is not enough for what we are
trying to do
because we are not trying to generate
one implementation from a specification.
We are trying to generate multiple
target specific implementations from the
specification.
So the specification must not take any
aspect of an implementation into
account.
The specification must not assume a
concrete database schema or concrete
indices.
The specification must not even assume a
relational database with tables and
transactions at all. It must not assume
a key value store. It must not assume
weak consistency. It must not assume
strong consistency.
The specification must be abstract.
Only the implementation must be
concrete.
So we ask the agent to follow the
abstract specification and generate a
concrete implementation.
Specifically at first we ask the agent
build a resonate server in rust on top
of posgress
and the agent failed.
The gap between the abstract
specification and the concrete
implementation was too large.
The agent generated a system that worked
on the happy path. It passed the basic
tests, but it was not correct. It broke
on the concurrency. It broke on the
process failure. It broke on the network
failure. The implementation was closer
to a prototype, but not a production
system.
So, we amended the process. Instead of
asking the agent to jump directly from
abstract spec to concrete
implementation, we inserted an
intermediary artifact, the concrete
specification.
That concrete specification was derived
interactively with the agent. But the
human was the main driver.
For Postgress that meant making target
specific decisions explicit, the data
schema, the indices, the SQL queries,
the transaction boundaries.
Once those decisions were written down,
the agent was indeed able to implement
the production system. So this worked,
but it also revealed the limitations.
The agent helped us build the system,
but the agent did not help us design the
system.
And if the specification is a reusable
product, then that's not enough.
Now the next step is obvious. Agents
have to move upstream.
But how?
When we started building Resonate on
Natio, we changed the question.
We did not ask can the agent build the
production system. Instead we ask what
does the agent need in order to design
the system first and build the system
second.
So we gave the agent access to a
deterministic simulation environment.
And we gave it a different task.
Do not build the production system.
Build a simulated implementation.
The simulated implementation is not the
product.
It is executable design.
Its purpose is to discover the correct
algorithm under partial order under
partial failure. And once these
algorithms are discovered, tested and
verified in simulation, then we ask the
agent to write the concrete
specification.
And only then do we ask the agent to
write the production implementation.
So the process becomes abstract
specification,
simulation implementation,
concrete specification and then concrete
implementation.
This is a point where the agent moves
upstream.
Humans are still involved in the design
process, but now the agent is a driver.
Two ingredients make this possible.
Minimalism and simplicity.
Unfortunately, minimalism and simplicity
are not the starting point. They are the
finish line. We spent three years making
the protocol smaller and simpler. Every
time we ran into a problem, we ask,
"What can we take away? What abstraction
can we erase? What property can we
remove? What relationship can we break?"
The result is a very small protocol
centered around two objects, a durable
promise and a durable task.
That simplicity matters because even
simple concurrent distributed protocol
have a complex state and behavior space.
So in other terms implementing even
simple protocols on top of a few simple
primitives is tough.
Let's make this concrete with NATS.NATS
gives us a
Hello, welcome. Uh, this is a big room,
so you're if you're in the back, don't
hesitate to come closer. Um, my name is
Stefania Dug. I'm a research scientist
at Sakana AI in Tokyo. Uh, I used to be
based here and AI engineering is home
community for me before being the
hyperloop. So it's very good to be back
and today I'm going to talk to you about
memory harnesses for longunning research
agents on device.
So if you work with long horizon tasks,
you probably run into this issue of
context blow, right? like when the model
starts contradicting itself or it has to
redo the work because it forgot it did
that task in the first place or it
starts to drift from your questions
because it forgot them. And this this
matters now more than ever because from
this recent projections from meter we
see that the trend is to solve longer
and longer uh horizon tasks and also
that we're getting fewer and fewer model
releases. So at some point later this
year we're going to have this
convergence right where we'll get many
more long-term horizon tasks and fewer
model releases. So that makes this issue
of dealing with context rot a priority.
And why did I wanted to to tackle this
problem on local models and with a local
harness? Uh maybe some of you have seen
this tweet. It's only two days old. Uh
the CEO of Coinbase actually shared how
their company managed to reduce their AI
spent while actually increasing uh the
AI usage. And the way they did that was
by transitioning to use many more local
models but also having better practices
like using better routing, better
caching, keeping the context clean and
then having better visibility for what
people are using and for what uh what
kind of task. So we are seeing the local
models like crossing the line, right?
Like GLM is on everyone's minds like
especially with Fable going away. uh
DeepS v4 flash can now be run on uh M3
Ultra and there's still a bottleneck for
RAM. It's tricky, but these local models
are starting to be useful for agentic
tasks and for tool use. So, I wanted to
show you what has been my setup for the
experiments I'm going to share with you
today. Uh this this is my Mac. It's
still running evaluations right now uh
back in my desk in Tokyo and I'm
controlling it from my phone. Um and
after running evals non-stop for a
couple of days, it started to get hot.
So I had my husband put fans around it.
Um we're running out of fans, but the
the machine is still running and the
valves are still giving results. Um, on
this M3 Ultra with 96 gigabytes and 28
core CPUs, I'm using two models. I'm
using the Quen 27B quantise at 4bit and
the DC V4 flash.
And before I show you how I built the
memory harness on this machine, I wanted
to tell you what this what is this an
example of, right? Like memory. When we
design a harness for memory, this is the
mental model I want you to have in mind.
Um, you can think of memory as a write
manage read loop. So, it's not just the
database store. It's actually this
control loop around the model.
More concretely, how did I take that
loop and customize it? So, this is my
harness design. Like I started with
research agents that are the small
agents because they have zero durable
memory and I wanted all the memory to
come from the harness. And then um in
the middle I have a core which is always
shown to to the agent um of traces. And
then I have a recall block where I'm
testing different modes and an archival
block where I'm keep keeping track of
information across different um
sessions. And in that recall block I'm
actually going through a ladder of modes
that I'm testing. The baseline is like
not to use memory at all. No recall at
all. So I'm I'm testing for that. Uh
next is to use rag vector vector rag um
just to see whatever like the harness
would pull in terms of similarity.
Then is to use a decisions uh ledger
where I actually keep track of what
decisions are being made for every turn
and then I can prioritize them. And last
but not least and this piece is very
important. I have a what I call an
oracle, but basically this is the ground
truth. So this is like telling the
harness for every loop what the correct
memory that needs to be retrieved is.
And the model is fixed across all the
different tasks. So the only things that
I'm changing is like these different
variables in the recall block.
And I wanted to to give you an example
of a first task that I tested. So
I wanted to see if I give the agent a
task of doing literature review and I'm
including a lot of papers in the corpus
where there was a big scientific claim
like this is actually a nature paper
where they said they discovered
742,000
promising materials like it was a very
big claim which got retracted later but
the retraction which it's a much smaller
like hay stack needle in that corpus
than the headlines and the citations. So
I wanted to see if if the system can
retrieve the right answer for these type
of questions. And what I found was
because like for these tasks all the
papers and all the information fit into
the context, the memory actually didn't
add more capability. It was the same
performance with memory and without
memory and it only added more cost. So
when your task fits in context, the
harness doesn't add much.
However,
if I start to run tasks that are longer
term horizon and the entire task and the
relevant context doesn't uh fit, then
having a good memory harness really
starts to pay off. So this is another
example of a task that I ran. This is
actually from an established benchmark
for a long horizon uh tasks memory. It's
called Xbench.
And this is an example of a question,
right? So I'm asking a question and then
like the right answer is in a like step
124. But the moment when I ask the
question, I'm asking it like at step
500. So it's completely outside of the
context window and the model needs to
use the memory harness to retrieve the
specific answer from the right step. So
I'm testing this by uh changing the
different policy ladder that I explained
before with memory off uh by deploying
recall different types of recall and by
using the oracle as a reference.
And what I found was that with the
ranked recall, the model gets the right
answer um more frequently than without.
And here is a breakdown of the
decomposition of performance on this
Xbench tasks. So I ran over uh 68
questions. And for each of these
questions, there were like multiple
um cells and lots of different seeds.
And what I found was that the rank only
ledger performed the best
and it performed better than like just
gating
the harness by saying do you need to use
memory or do you not need to use memory
and you're probably going to ask like
why is the oracle not hitting like the
max and I'm going to explain that too.
So the oracle what it does it provides
the right information the right memory
to the model but it doesn't force it to
use it. So the model can get the right
memory but still retrieve the wrong
information or choose to ignore it or be
confused. So that's why the oracle in
this case doesn't hit the max
performance. And I've done lots of
ablations on these tasks to see like
what happens if I give arbitrary
um examples. What happens if I give it
the wrong step? What happens if I give
it the most recent step? And I still
found that the best performing
condition was the one with the ranked
policy for recall.
And this actually works on several
models, not only on the Quen 27B, but
also on the DS4 flash. And it also works
across different benchmarks. I also
tried it on the Spider V2 benchmark.
And it's not just that it gives you
better recall, it actually costs less.
So maybe a good heristic to have here is
that bad memory is expensive because it
spends more token and it can send agent
the wrong way. But having like a good
structural policy for recall can save
you a lot of tokens and uh budget.
So one thing that I want to encourage
you from this experiment is to consider
the recall policy as a first class
metric and to start to think about how
you might use it in your systems. Like
what are the type of memories that you
want to store? What how do you rank
them? Like how do you design your recall
function?
And then um what are the type what
survives when you run this over and over
and over and um multiple sessions
multiple runs
and this is just a simple first kind of
experiment. Um but the memory technique
landscape is very rich. Um, so there's
over 30 runnable cookbooks that are
shared in this open-source repository
from um, Diamond and memory is complex.
We have short-term, long-term different
cognitive techniques. Uh, we can use
start to use evaluation results as well.
Um, and right now there's actually a a
pretty broad landscape of solutions,
right? So going from simple file system
retrieval to training memory models
um there's there's a wide spectrum of
solutions from less structural to
completely structured. Um so I think
there's a lot of research we're going to
see in this space. uh it's important um
it becomes more and more relevant and
for me it's been super fun to to test
this on local models
um because I got to control everything.
I got to control the data I was using
the entire traces of compute and
evaluations
and um yeah I I see that as an example
of sovereignty and it comes at a cost.
Uh I didn't tell you that these local
models I can only what uh run them in
serial like they don't support batch
querying for the deepse v4 flash. So
that's why I am still running
evaluations back on my computer in Tokyo
or I I was doing it on the flight on my
way here because it takes a long time.
Um, but I still think it's very powerful
and it's a very good test for what
memory can do when you can control every
single step of the pipeline. And this
sovereign capability is part of a bigger
ecosystem that is very important for us
at Sakana AI in Japan. Um, we believe in
the importance of sovereign AI today
more than ever. And we are also hiring.
So, if you're interested and want to
hear more about this and if you want to
come join us in Japan, come talk to me.
Uh, thank you very much.
Hi everyone,
I am Bash and today I will talk to you
about what is the last thing that AI
will take away from us as people in the
software business. So at a point where
writing code is no longer the
bottleneck, the real thing is figure is
figuring out what it is that you should
be building.
Um, and that comes down to to people's
skills and being able to work the room
because you can't prompt the room, you
can prompt your AI.
So at the beginning of the year we held
an internal hackathon uh where we had
about 21 agents uh agent ideas and 17 of
those were abandoned because they
actually created no uh business value.
They uh uh we either didn't have uh data
access or or just didn't make sense uh
to build it. And those four were the
ones that actually had a very big impact
on how we work today. And it's it's a
very good example of
of just making sure that we are building
what is worth building. And throughout
my career in the past 13 years, I've
always been uh the bridge between
business and IT and developers. Um I
started writing well initially testing
uh uh functional designs specifications
and then uh and then I wrote them and as
a functional consultant I worked with
large ERP and CRM programs in the US and
the UK and then I founded Visual Labs
and essentially I trained my my team on
how to elicit those requirements in a
way uh that we can turn them into good
uh specific ifications for developers to
build, for consultants to configure, and
most recently uh for AI to build. And
what's not really changed over the years
is how we interact with our customers,
how we interact with systems, how we
interact with AI is very much changing.
Um and that's that's uh that's the big
thing now. Uh but if you can read the
room, if you can elicit the right
requirements, uh then you will be able
to build more valuable software.
And that essentially the big shift over
the past two three years was that
getting access to code and being able to
build is no longer the bottleneck to the
software development life cycle. Now the
real bottleneck is getting your people,
your stakeholders, your decision makers
into the room and being able to access
them and elicit the requirement and
being able to spend the time with them.
So that's the right that's the real
bottleneck figuring out what it is that
should be built because you can prompt
your code, you can prompt your AI, you
can prompt your whole specification, but
you can't prompt your room. And
what a model can't do is very similar to
how Henry Ford's analogy of uh what he
said about asking his users or his
customers. If he'd asked them what it is
that they needed, they would have said
they needed more horses. But in reality,
he built a car and he made a very big
success on them. So if you're just using
AI uh to to make things build things
better, um the chances are that you are
replicating what already exists because
AI by definition is coded to give you
the most common answers for so for us
the real job is to make sure that AI
moves away from that average into what
is better for us so we can just get to
uh not a faster horse but actually
produce a car that's a magnitude shift
better than what we had. So it's really
an interesting word world where uh being
able to write good code is no longer uh
the the most important skill to have. Uh
actually the real skill now is becoming
the analyst analyst toolkit uh which is
things like story mapping, business
model canvas, uh value canvas and those
those good old things that we are so
used to using as functional consultants,
business analysts
um or or uh in in the world of design
thinking. So I'd like to zoom in on
story mapping because that's the the
skill set that I found as the most
valuable. So uh once you have the story
map with the backbones and understand at
each step what your customers your users
are doing that would give them the
ability to uh to move forward uh in
their in their processes. So uh here's a
uh support systems user story map
contacting triaging resolving and then
essentially closing a case. Uh with this
uh you can understand different stages
of the process uh and then capture the
user stories beneath them. It is
intended to stay at a fairly high level.
So you can get a uh a big picture and
then in you can decide uh what it is
that you want to build in release one
like capturing intent, classifying
urgency, drafting a grounded answer and
then logging logging it to a system of
record. That's essentially your MVP.
Those are the first things that you'd
want to build and those are your first
four user stories. And beneath those
you've got the uh uh the second set of
user stories like reading a sentiment,
writing to a team, suggesting next
action, chatting, checking satisfaction,
so on and so forth. Uh those will be
part of your backlog. So what would
allow you to
uh to get really good uh agentic results
is by honing in on these user stories
and making sure that you use these user
stories as a means uh to elicit
discussions with your stakeholders with
your business and then work out what
that user story should really be about.
So the first user story uh second user
story would be as a support lead I need
to open cases ranked by urgency so that
none of the escalations sh slip. So just
make sure that every user story covers
these is ideally uh written in this
setup because AI is really good at
pattern recognition and it was actually
trained on the user story structure
because it's a very well known and
wellused uh setup. So if you go back to
something that's familiar to AI, it will
get get you better better results. And
every user story uh is actually made up
with uh of these you know well-known
structures the persona the what the
actual need and the why. So by packaging
these up and giving it to AI obviously
with the acceptance criteria based on
which you can derive the test cases you
will be able to create very good setup
and very good um very good results. And
then if you just connect these user
stories, daisy chain them up, then that
will allow you to uh to create a
coherent system based on which you can
create your specification and then
essentially your code. So the software
development life cycle doesn't change as
much as a result of AI. It's actually
the toolkit that we are uh we are using
is changing.
Right? So when we
uh work with systems and when we think
about what we want to build, I always
like to ask these four questions is
whose problem is this? Whose problem are
we actually solving? So we can we can
name it to a direct person, direct
persona uh and it's very much
quantified. What does winning look like
for them? So when are they actually
successful? Are they achieving the right
outcome? Uh can we help them achieve
that right outcome uh in a quick way or
a smooth way or a safe way and what
would that make make them refuse to use
it? It's not available on their
platform. It's cumbersome to use. It's
the data security aspect applied. So
they would wouldn't actually use it. And
would it change a decision? Ideally, we
want to be impacting how a person makes
a decision and we'd want to, you know,
tilt them to making better decisions.
So, does it change a decision and and
what is that decision that it changes?
So, once you can answer these four
questions, then you'll be able to elicit
better responses from your AI and just
make sure that you track all of these in
a good old markdown file in your
repository so that AI can access it. it
will just get way more context out of it
and you know if you just did something
as generic as build us an agent that
handles support uh you will not get the
answer you want. So what we always do is
go from value. So understand how value
is created, what constitutes value, how
the process currently flows, what is the
underlying architecture beneath it that
supports that process and then you and
then you can start the actual design
where you can start designing. So we
like to call this uh thinking process VA
a value architecture design and this is
what we want to always go through. So
always have you know value in mind. How
are we creating value? What is the value
we are creating? What is the value that
your customer is looking for? What is
the underlying process that supports
this and how you can design a system
around it so it best supports the value
and the process and what process changes
are needed along the way. So you might
ask, isn't this just good old product
management?
And to a certain extent, yes, it is an
old skill. It is an old trade that is
worth picking up and learning because
this is now becoming uh the mode if you
will of how you can elicit the right
requirements, how you can build better
software because we all have access to
the same tools. So the difference will
be who can understand the business need
better uh because then we can all just
uh have the latest and greatest model
write the code for us. So it's old skill
but new e economics and it's a real
shift towards analyst toolkit. So what
building the wrong thing looks like if
you've got velocity up
hey um hi everyone uh thanks for being
here uh yeah I'm super happy today to
talk about uh automated eye research and
uh especially uh all those like font
model uh perform at uh automated
research task. Um so I'm Elie. I work at
prime as a research engineer and uh yeah
I will go through our work on on this
subject. So first I want to basically
explain a bit why we are doing that and
why we think it's super important to do
that in the open. Um so first uh I think
we we all agree that uh we've heard
about like big labs saying that this bad
thing called recursive self-improvement
is coming very soon. Uh so recursive
self-improvement is like model training
models uh without uh human intervention
basically. Um but uh we don't have any
benchmark to basically quantify if this
is true or not right. Uh and even less
we don't have like a third party
benchmark by non- big labs to to to see
if it's something coming soon or not.
And the other part is that we think that
uh it's super important to understand
all those model uh do research because
we think that a lot of the scientific
research that will come into the coming
years uh will be based also on AI tools.
So it's super important to understand
how those model do research not just
only AI research. So we try to build
kind of this environment to test the
capabilities of the model to do so. So
it all started with uh Andre Karpati uh
that's basically had fun by doing this
video where he trained uh GPT2 from
scratch in like 90 minutes like GPT2
training takes like weeks and no in two
years ago I think it only took like 90
minutes. So what does it mean to reprod
reproduce uh GPD2 in 90 minutes? It
means that in 90 minutes you achieve
this target loss. Um and yeah and that's
at this point when you have the same
loss than GPT2
you consider that your model is somewhat
of equal performance.
Um
then what happened is that the community
took this repo uh this GitHub repo and
create another one called modded nano
GPT and this effort was leaded by
someone called Keller Jordan. And what
happened is that they basically
took this 90 minutes then 45 minutes and
then no we can train like GPT2
validation loss model in less than two
minutes which is honestly crazy and it
took like two years to to achieve this.
So it's a very strong benchmark where uh
a lot of very talented researcher
iterated on um yeah so we decided to
take this environment of speedun so
it's kind of a game so the goal of the
game is to achieve this loss in the
fewest in the shortest amount of time so
this is the nano GPT1 and you can uh you
don't have almost any constraints the
only constraint that you shots that you
need to use the same validation and
training data, right? Um there is a new
speedrun called the optimizer speedrun
that was released uh a few months ago
and here it's slightly different because
uh you can only change the optimizer
related parameters. So for instance nano
GPT you can change the architecture uh
doe do uh attention whatever uh
optimizer sp you can only change like
Adam to m shampoo or whatever optimizer
is your favorite
um yeah and so this is a bit more
researchy because uh it's less about
optimizing the program to be as fast as
possible but more like finding the best
method possible. no matter the the the
time you put into the computer, right?
So, um yeah, why take speedrun as an
environment for automated AI research?
First, uh we think that it's a good
evaluation. We'll see later why. Uh and
this is kind of the main focus of this
talk. But we also think it's probably a
good training environment because uh
it's a way to give the model a reward.
So the reward is positive if the model
bit the speed run and beat the last
record sorry and the reward is zero or
negative if it didn't manage to to do
it. So it's a good environment to train
model. It's also quite fast like as you
see uh previous record were around 2
minutes for the optimizer one. uh each
run take about like 15 to 20 minutes and
uh yeah and there is like clear rules
basically and we also think it's like a
good environment to make discovery so
like kind of breakthrough in our
research because uh there is those clear
rule that you can verify or not. Um
yeah. So yeah.
Um so what we did uh so the release was
like about two months ago and uh there
was this optimizer speedrun and we
decided to basically compete with the
community by launching two AI agents. So
Codex and Cloud Code. Codex was like GPT
5.5 with XI and uh cloud code was Opus
4.8 with XI. Um and yeah, we decided to
basically let the agent free on our
cluster uh and uh and just iterate on
it. So we have like V1, V2, V3 is just
basically us stopping the agent and then
restarting. V3 uh was like one or two
day before the release because we saw
that our agents no longer have the best
record. So we were like okay take all
the the human uh record in the last few
week and just try to to improve upon it
and and and it worked. Yeah. And we also
have this novelty track where the goal
is to uh beat the record with only novel
ideas. Um and we'll see that this this
was more complex for the the models.
So our RS is very simple. Honestly, we
could have just replaced it with
slashgo, but they there was no SLG goal
at the time. So, we made our own goal.
MD. It's actually quite fun that we
choose the same name and we had the
goal. MD and kind of agents that MD that
define the rules and we let the agent
propose ids and then it can submit a job
with sbatch on our slum cluster and uh
basically the way it works is that it
can submit on nodes that are available
but only under a certain permission
which means that if someone want to use
this node uh the model just like cancel
the job it's called preemptable
permission. So yeah, then it measure the
it read basically the training logs then
decide if it's a record or not. To
validate a record you need to basically
pass a statistical threshold to make
sure that it's just not see the
optimization and is just not random.
Right?
So yeah a few results from this
experiment. The first one that was
honestly very painful to work with is
that code uh clothes code keep stopping
every nine or 10 hours and basically
said yeah I cannot improve the record
it's too hard for me there is no way to
to go beyond it and then I was just like
okay continue explore new direction hey
just go again for 10 hours and then say
yeah I cannot beat the recall and so on.
So basically onethird of the time the
cloud code agent was idle because I had
no way to basically monitor it and
codeex totally the opposite just worked
for all the all the time and uh yeah
almost never idle never asked for
question and and and very impressive in
that way. Um
we also give the option for the model to
basically write uh a bunch of stuff into
what we call a scratch pad which is
basically the active memory of the
model. Uh we observe that basically
codeex writes a lot on the scratch
patch. So each plot that I will show are
kind of normalized by the number of
active order. So this is not only about
codex working more it's it's really
different behavior.
So yeah, you see that uh writes a lot
more to to this scratch pad to this
memory and uh the shape of the like the
the I don't know the tone of the the
each file was also super different like
CL was super excited about getting new
record with a bunch of emoji and so on
and CEX was just like here is what I do
here is the decision I take what I will
do next like super robotic kind of um
Yeah, we also have this plot where
basically we saw that codex was spawning
much more sub aents than cloud. Uh we
saw that codex burn much more token than
code. So I think in total it was like
kind billion of token but it's like
there is obviously this input tok uh
input caching that make it it's not like
one billion output token. Uh so yeah we
also see that codex did a lot of
compaction because it only had like 250k
context window and cloud only do it like
one per hour and codex is more like
no it's even less than one power for I
mean one for the full run for cloud and
codex was like one uh was 20 every one
hour. So yeah
um yeah here is the main results. So
what this plot shows is that basically
we so in in white you see that the human
recall progression right and in red you
see cloud I mean it's supposed to be
orange but whatever and in blue uh you
see codeex right and you see that at
almost every time uh cloud and codex are
better than the human record and code is
super good at the beginning very very
fast to achieve very good score. Um,
yeah, and one thing that is super
important is that the model have the
ability to basically fetch the human
records at any time and that's what
Codex did. That's what cloud did. Sorry.
Because when I restarted it, it
basically fetch the new record from
human and improve upon it. Um, yeah. So
the result is that uh I think at the
time the best record was like uh 2,990
step and we beat it by like uh uh 50 or
60 step for code and codeex was like 20
step above. So I think it's both
impressive and and yeah um
so we so this is like not released yet.
This is something that we are working on
currently and basically the idea is that
this is a cool experiment to do but it
lack of structured right. uh if you want
to do a real benchmark, you want to do
multiple seed, you want to do uh yeah
proper uh thing where you you you you
basically put all the model and
earnestness in the same condition,
right? So this is what we are working on
right now and basically um the idea is
to do three different track uh one
without any access to really like
measure the capability of the models to
do AI research based on only the model
weight knowledge one with only archive
paper and one with like full access. So
it also have access to the the like the
latest record by human. And for this we
plan to do both uh the nano GPT track
one which is the original one and the
optimizer speedun where we we only
launch uh we only constrain the the
optimizer to be to be novel basically.
Um yeah so I will present some result on
the optimizer speedrun. Uh this is
basically what we got. So we let the
agent iterate for six day almost five
days let's say and we see that uh codeex
k and clothes uh are super effective so
for GLM this is not finished run right
so the model is actually still iterating
on the cluster right now but we see that
cloud is once again very good at it and
we see that surprisingly Kim is also
very competitive and kind of have this
breakthrough on day four where he kind
of beat Codex with a new record, right?
It's also interesting to see that uh
Claude is much more like progressive in
the way it improved the record and Kim
has really this step function where I
kind of do a breakthrough and so on. Uh
so this is an interesting plot because I
mean six day is quite a lot for anal uh
uh but you you can change this uh axis
by also the number of output token and
then kind of tell a different story
because in max mode consumes so much
more token than codeex and Kimmy and you
also see that Kimmy is actually super
efficient uh for the number of token
that uh it uses. So it's schemic K2.7
code. Um so yeah uh we also see that
they have a different approach to uh
using the literature and papers. Um so
for instance like code is doing a lot of
search on papers and actually include
found a paper that no other model found
and it actually lead to the best record.
So it's kind of funny and uh yeah um one
of the main issue of all of this is that
uh when I when I launched this this
agent and I think that's something
important that I want you to to kind of
uh remember for this co this talk is
that when I launched this these
different agents I was expecting them to
come up with some crazy ideas on
optimizer that's like no one of discover
but honestly it wasn't the case. Uh they
did some clever trick where basically
they combine different papers. uh they
kind of do plus one improvement over a
bunch of method but there was really
like no novel optimizer or mechanism
that was uh coming from those model and
I think that's kind of telling that even
on something that is not simple but I'd
say that it's kind of accessible for
people right for like human researcher
uh spending like days and weeks for the
the model like cannot like find new uh
optimizer and mechanism. So we believe
that there is a way to basically make it
more um make it better for discovery
instead of evaluation. And this is
coming from uh this is very inspired
from alpha evolve by Google and also a
bunch of papers that have been released
since then. It's kind of this multi-
aent system that interact together uh
bunch of generator. You have closed
model but you also have open source
model here that are super effective for
the cost right. Uh they can suggest
ideas then you run the speedrun so you
get the reward then you have a judge
that basically give a quality feedback
can also be like the judge also have
this taste. you can kind of have like
the judge have a taste about the the
method if it's good or not. Uh if it's
outside the loop and then you can uh
basically decide which method you want
to scale to a larger number of
parameters and number of token. Um so
this is kind of the scale part of the
speedrun because some a lot of method in
the the speedrun community uh people are
often saying that they doesn't work at
large scale. So I think it's very
important to also put scale elements in
this loop. Uh and I think also that uh
human are super useful here to basically
judge the ID of agents kind of steer
them in the right direction and so on.
Um yeah so we didn't try it yet I mean
we are kind of trying it right now and
uh we hope that this will lead to to to
new discovery in AI research at least
and also a way is that you can define
multiple speedrun so this is the next
slide if you like it's from safe bank
slides but if you if you don't have the
reference good for you means that that
you're not too online uh but the idea is
that uh by changing the object objective
and the constraints of the speedrun you
can basically create a lot of diversity
and constrain the model to go into a
certain direction and uh yeah and make
those discovery.
So uh at hint we are doing a bunch of
stuff in this direction. Uh there is a
bunch of stuff here that we I mean most
of it we didn't release yet but we are
working on GPU sandboxing to allow model
to iterate into sandbox because you need
GPU sandbox for this kind of stuff. We
are working on our own agents that are
very efficient for like
framework. So it means like you have a
five system and you can write
information read from it. Uh and you
also do like this programmatic tool
coding thing. We also training a model
to be good at it on top of like open
source model. And uh the thing that we
already released is that we have the set
of liber and product called verifier
primar training where you can basically
train evaluate any environments on any
RS and the model that you can train can
be like GNM 5.2 too which is very big
and and yeah we have like we work a lot
on making those li very efficient to to
ship the best quality for for our
clients. Yeah. Uh I mean yeah super
excited about this domain. Once again I
think it's super important to have uh a
part of like this recursive
self-improvement to happen in the open
because there is actually a lot of
people working that are not on big labs.
So you need to basically uh yeah make it
easy for people to understand all those
model work to do research and so on. So
that's kind of our goal and uh yeah,
thanks a lot
and I'm a software engineing Tech League
at Meta working on building a training
and inference infrastructure for the
meta super tangent lab and their
infrastructure organization.
Today we're going to be talking about
productions for aentic systems.
When most people hear the word
valuation, they think about benchmarks.
A model scores 90% on a benchmark. A new
version scores 92%. The team celebrates.
But agent systems have fundamentally
changed what the evaluation means. Today
the systems don't simply generate
answers. They plan, they call tools,
they retrieve information. They execute
workflows. They interact with the
production infrastructure. The question
is no longer did the model generate the
right answer. The question is did the
system behave correctly. Today I would
like to discuss how evaluation is
evolving from model benchmarking into
production infrastructure.
This is the problem almost every AI
organization is encountering today.
Offline benchmarks continue improving.
Yet production reliability often remains
unpredictable. Why is that? Because
benchmarks measure model capability.
Production measures system behavior. A
benchmark doesn't capture tool failure,
API outage, context changes, user
variability, longunning workflows. And
as systems become more autonomous, the
gap between the benchmark performance
and production performance grows. The
result is what many teams experience
today. High benchmark scores as you can
see, but unreliable production behavior.
Traditional evaluation focus on outputs.
But we should ask the question, did the
model produce a correct answer? Agentic
systems force us to ask a different
question. Did the system behave
correctly? Behavior includes planning
quality, tool usage, execution, workflow
execution, recovery from failures,
decision making. In other words, we are
moving from evaluating answers to
evaluating workflows. And that requires
fundamentally different evaluation
architectures.
Many teams still think hallucinations
are the primary AI failure modes. In
production, they are often just one
category. Agentic systems introduce an
entire hierarchy of failure modes. At
the very foundation the memory failures,
retrieable failures, safety failures. As
you go up, you have to think about
reasoning mistakes, poor planning,
incorrect tool execution. At the highest
layer, you have to think about multi-
aent coordination failures. And this is
why evaluating only model output misses
the most production risks we observe.
One of the most useful mindset shifts is
to stop thinking like researchers and
start thinking like a SR or a production
engineer. S SR don't measure success
using accuracy. They measure
reliability, availability, latency, cost
recovery and agentic systems require the
same approach. The goal is not
maximizing the benchmark scores. The
goal is to maximize dependable outcomes.
Reliability becomes the northstar
metric. Accuracy becomes the only input.
In this pyramid is how I think
personally think about modern AI
evaluation systems. At the bottom you
can see there are benchmarks. They're
useful. They're scalable. They're
reputable. But the operational value is
limited. In the middle there scenario
based valuations. These simulate
realistic workflows. And at the very top
you see production telemetry. This is
where the highest value valuation
signals come from. The surprising
insight is that the most evaluation data
often comes from real users interacting
with real systems.
Now let's talk about offline vals. So
offline evaluation still matters but the
methodology changes. Instead of
evaluating prompts we evaluate
scenarios. For example, a customer
support workflow, a code generation
workflow, a research workflow. The agent
operates inside the simulated
environment. We measure the task
completion rate, tool correctness,
planning quality, resource usage which
is which becomes exponentially high at
high scale. The key takeaway 18
evaluation should be scenario driven not
prom driven.
Once a system reaches production, every
interaction becomes a signal. This is
one of the biggest shifts in evaluation
thinking. Production traffic is no
longer just traffic. It becomes
evaluation data. We collect execution
traces, user outcomes, escalations,
failures, feedback signals. Production
is the largest and the most
representative validation data any
organization will ever have.
Many organizations view humans as
fallback systems. I think that's a wrong
framing. Humans are the evaluators. They
provide signals that automated systems
cannot. They assess correctness, trust,
usefulness, safety. These signals become
really critical for calibrating
evaluation pipelines and identifying
blind spots in automated metrics. The
most successful systems combine
automated valuation with targeted human
review.
Now, agent systems drift constantly.
Model changes. We have a new version
every couple of weeks or months. The
prompts can change. Tools can change.
User behavior can change. The challenge
is that no longer a single change appear
catastrophic. Reliability slowly
degrades. Success rate declines.
Escalation increases. Tool failure
rises. Without continuous evaluation,
teams often don't discover drift until
users complain. Continuous monitoring
becomes essential.
Observability
and evaluation are inseparable.
Inseparable. To evaluate an agent, we
need visibility into the reasoning
paths. The tool calls, the memory
access, execution timelines, the
straight transitions. As you can see
here in this chart, traditional logs are
not sufficient. We need detailed traces
just like with any
deep nested microser architecture for
any application or service. We're
talking about agent traces become the
equivalent of distributed tracing for
autonomous workloads. Without
observability, evaluation becomes the
guesswork.
Now let's talk about the continuous
evaluation loop because evaluation is an
always running service not a testing
phase.
Historically evaluation always happened
before deployment but now evaluation
continues after deployment. Telemetry
identifies issues as you can see in a
human reviews the edge cases. Feedback
improves the data sets. Offline
scenarios validate updates. The loop
never stops. Evaluation is no longer
just a phase. It's an operational
capability.
Now, this is probably the most important
slide in this presentation. Every metric
shown here maps to a business outcome.
Task complete.
>> Okay, I think we're live and welcome
back for those on the stream and those
those in person. um we take tend to
basically take these longer sessions
between uh all the sort of mainstage
keynotes to reflect on things that um
you know are particularly important but
like don't have like a significant like
a sort of launch moment. Today we're
very lucky to have people working on
Omni and Vo Nano Banana like the you
know the world's best generative models
here with us. Uh Demetrio I I I first
saw you when you were posting about your
office
Um, I think you're you're probably
number one uh Google Google's number one
office influencer at least in in San
Francisco. I think you like you like to
bike as well. You like to take photos of
bike here.
>> Yeah. Um, but you know, but also you
work on video models.
>> That's right.
>> Um, Shane, I I met you I think at like a
dinner.
>> Yeah.
>> Um, and uh and uh and I I remember you
were trying to get me invested in like
one of the companies. I forget forget
which one. H forget about that.
>> But now, but now you're um now you're
working in Omni Thinking um and and just
you know a bunch of other
>> Gemini RL.
>> Yeah. Yeah. Uh and Nicole also uh the
rest of the gen media models u nano
banana and uh all and everything you
just launched actually even this week.
Uh
>> yeah, we launched some APIs.
>> Yeah. Yeah. Yeah.
>> And I haven't tried to convince you to
invest in anything but maybe I should. I
mean, so I try not to be an investor.
People just convince me anyway. I'm like
just okay, well, I'm not that rich, but
you know, like you can't not try to
invest in some of these things. And you
know, for those of us who are not
working at a Frontier Lab, this is the
best this closest that we'll ever get.
Um, so yeah, actually, let's kind of
recap since you're closest to it and we
just did it like what was launched this
week. What should people go try out?
>> Yeah. Um, so yesterday we had two launch
moments. U one of them we launched
Nanobanana 2 light uh which is our
fastest, cheapest um image model in the
nanobanana model family. Um and it's
better than the original Nano Banana. Um
so really for most people um that model
replaces what you you know used and love
the original Nano Banana for across like
generation and editing and it gets
really close to the frontier quality of
of the kind of mainland bigger models.
So that that's really exciting. I think
if you look at some of the demos or like
things that people have been trying like
getting kind of that like 3-second
latency just unlocks a whole bunch of
things that you can do with like
ideation and iteration and it's just
really fun and the models getting to a
point where like the quality is really
good um where um it you know you can use
it for iteration but you can also use
some of those outputs as just kind of
like ready um production output. So
that's really exciting. Um and then
second launch we finally um launched the
Gemini Omni Flash APIs um that we
pre-announced at IO. So, thank you for
waiting. Um, and that, you know, is the
first time that we're making the APIs
available for developers and it's
basically really exciting kind of video
generation and editing and we're pricing
it the same as Y31 fast. So, we're
getting you kind of like really really
good quality for a really awesome price
hopefully. Um,
>> yeah, I mean that that's incredible. I'm
actually really So, when you guys
launched Omni for the first time, you
also did a podcast uh with Logan who
couldn't be here today. uh and you added
like a sloth uh and and ramen and all
these all these things. I actually
really want to do that to our videos. I
just didn't have an API for it because
obviously I have to automate the whole
thing. So, thank you for the API.
>> Uh that is my favorite use case.
Everybody should do that. Um I got a cat
which was probably like the most boring
of the animals. Um if you don't know
what we're talking about, you should
look it up. It's very funny. Feurer. Um
Furer, who's um you know on on the team
did that.
>> Fur is the number one guy you should
follow. You should follow get ideas on
okay, what can this thing do?
>> Yes. Right.
>> Yes. He he's he's amazing at that.
>> I've tried to get him for the last two
years to come to AI. He hasn't made it
yet. He's actually come in person. He
just didn't want to speak because he's
anonymous.
>> I know.
>> I I want to say his real name, but I
can't say his real name.
>> No, no, no. We won't we won't do that to
him, but you should really follow him.
He's amazing.
>> He did all that work.
>> I actually met him uh in the office uh
when we did the podcast, I think, and I
didn't realize it was him. So, his badge
doesn't say Poper. It says
>> Yeah, I know. So he used to be part of
uh Replicate and Replicate had this joke
where like everyone was Deep Fates. Deep
Fates is this like kind of mysterious
character. Replicate. Replicate is very
cool company and was part of it. Um so
okay, one thing I want to get on there
before I go into like sort of the the
the sort of omniprop is we added cats,
we added sloths. Very cool, very cute,
very fun. uh what are the you know
inspire people as to like what are the
more sort of workhorse use cases that
maybe are not just demos you know
>> yeah so so obviously the hero capability
of the model or maybe there's two like
one is the ability to kind of take in
anything as input and then get video on
the other side obviously in the future
and and we've kind of talked about this
as a pre-announce like we want to get
the other output modalities out as well
but basically what that means is you
know you can take a set of images that
you have as maybe a storyboard you can
take like an audio track as a reference
of you know like a voice that you want a
character to speak and then you can get
a video on the other side. So like that
just unlocks a whole bunch of things
that you can do in like you know short
film production or you know shorts we've
launched on YouTube as well um to help
creators kind of like create um content
more easily. Um and then the other one
is obviously video editing like that's
another thing that we're really excited
about that we're just making easier
because now you can use natural language
to take a video you know add something
remove something. Sloth is obviously
like fun example. Um, but there there's
obviously kind of there's consumer use
cases that we kind of had in mind where
you know you could take your beach
vacation video that was too noisy and
you want to clean up that noise. Maybe
in the past you wouldn't have because
you didn't have the tools or you didn't
know what the tools were that you needed
to go to. So that's one use case that
you can, you know, go to. We've seen a
lot of folks use it for kind of
marketing ad campaign creation and I'm
excited to see more of those use cases
as we launch the APIs. um because
obviously like we don't we don't see all
of it in the first party products but
I'm really excited for people to start
to explore that um in the API. So those
are just some of the kind of like highle
um things that have come up. U people
also use it to create like education
materials. Yes. Um and like like that's
really exciting. I think we're all we've
all kind of talked about being excited
about the future of education where like
everything can be kind of customized to
you and personalized to your knowledge
level and the style that you prefer and
and so this is kind of just like a step
in that direction.
>> Yeah. I I I've sort of actually used
just Nana yesterday with my my parents
are visiting and there was there was a
very fun sort of use case that I bought
some gadget off Amazon that they wanted
and the instructions to use it was were
only in English and there was plenty of
diagrams or whatever and I took a
picture of it and said you know
translate this into Romanian. Yes.
>> And keep everything else the same,
right? So it was amazing, right? Like it
was just like yeah it looks identical
and it has you know it's perfectly
translated. I mean more or less, right?
But it's it's you know using Gemini
under the hood obviously to kind of do
the translation. So you can you can see
this use case for video as well right
like the the power of text rendering in
in in Omni is is quite next level. So
and you could you could you could think
about plenty of use cases of like both
text rendering translation internal
channelization all sorts of things that
would be actually genuinely useful to a
lot of different people and sort of
broader access to either you could like
redub a video or whatever it is that you
wanted to do. like there's plenty of
different things that you could you
could think about doing.
>> Yeah. Um one of the most enlightening
conversations I have on my podcast is
with uh this people researchers at the
frontier of these things. Um I had one
with um Ethan from the XAI video team,
the Grock video team who was basically
saying like you know the next trend is
actually not just like single model,
it's more like video agents. Mhm.
>> Um, and I don't know if that terminology
resonates uh obviously for for very
relevant for RL. Uh, but it was it was
basically kind of like giving up on like
trying to do everything in in
effectively one pass. Um, do you feel
that same way or is it still an open
research question which way the trends
are going?
>> Yeah. So um what kind of excite me most
is really when the symbolic kind of
foundational models and this kind of
like video foundational model can
actually kind of really work together
and u in a way the if you look at the
beginning of the generative sort of like
image generation video generation a lot
of it kind of started when the language
model got good enough to provide a very
detailed captioning like from stable
diffusion days or kind of dowi 2 days.
So um so basically like language is
extremely u helpful representation uh
one is that it's kind of universal but
the other kind of more um technical
thing like kind of my hypothesis is like
um one very difficult thing about
machine learning is um this sort of like
spirious coordination so you don't know
you know if the if this kind of feature
right that's kind of predictive is
actually causal factor or not there are
two ways one is we can have really
diverse data training data like from
every intervention of the causal graph.
The other is you condition the coal
information and conditioning a language
is kind of like conditioning like a coal
information of the of the kind of world.
So um
>> which is a prompt or a concept or what?
>> Yeah, exactly. So if you look at like
you know how you going to describe this
video, how you going to this kind of
image is actually very close to you know
how would this kind of causality you
know behind this like how this is kind
of generated. So one is like that can
really allow for very rich
generalization and then uh very kind of
just like a good model. Um the other is
so eight months ago uh we put the
evaluation paper called video models
zero shot learners and reasoners.
>> Yes. So that was a kind of you know it's
it's a confirmed paper and then later on
actually the N banana team follow up
with the vision banana paper that
basically used a banana to do but
essentially the idea is uh video model
is extremely good sort of a foundation
model for space and time kind of
information. So um classic computer
vision tasks a lot of could be kind of
zero shorted and when you like say
feeding some like a visual quiz uh it
can you know there's definitely like a
lot to improve it can kind of solve and
it can um like robotics kind of like
seeing it has really good kind of
physical intuitions like word model uh
and I think the the key is really the
kind of mix of the visual kind of
reasoning and then the text kind of
reasoning kind of all tied together Um
obviously you know like whether doing it
you know as kind of unified model versus
like this kind of agent or exploration I
think that's more like uh it's going to
be more kind of incremental you know how
it's going to I imagine everything's
going to go into like a single model
eventually
>> but right now there's like a lot you can
do if you uh basically take like really
good video understanding image
understanding Gemini agentically with
anomony and that's actually going to
yeah our team is like exploring a lot
>> yeah okay that there's a there's a lot
in there um I I think uh one question I
I am increasingly starting to wonder is
does it all trend towards one product
for you guys right like now you have
multiple models out the naming of omni
does imply that eventually everything
will go away and it just goes into omnis
um is that the plan
>> is it I don't know I I think I think uh
maybe I mean I think eventually I I
think there's sort of different
trade-offs engineering research product
trade-offs in like it's like for the
same reason like the the sorry how is it
called nanobanana light I don't know
what the product name is
>> nano banana too light
>> nano banana too light yeah right it's
it's it's it serves a particular niche
right and it probably doesn't
necessarily fit immediately in the same
model literally checkpoint as uh
something that can do 4K you know uh 30
second videos right like they're
probably not like trainable in the same
quite way, right? Like, so I I don't
know. It depends on how how far into the
future you look like. Sure, in five
years from now, will they all be the
same model? Probably. Uh, but like, you
know, six months from now, we'll we'll
probably still have, you know, multiple
different models doing different things
because kind of from pragmatically the
trade-offs are such that we we should
have multiple different kinds of models.
>> Yeah, I
>> I think that's right. And and just on
that note, I mean, we did call it Gemini
Omni because we wanted to hint at the
future where Gemini just becomes fully
multimodal in and out, right? And so, so
it's definitely a move in that
direction. I think we'll probably see a
move in the direction where Omni also
generates images and edits images and
all those kinds of things. But Doo is
right that I think on the way there,
there's a bunch of really really useful
applications of some of these more
specialized models. And so we we will
probably continue to work on those as
well because like that serves a certain
need at this point in time that may not
exist you know a year from now. There's
also like a research question about like
just how much transfer there is between
different kinds of modalities, right? I
think you may believe that there's some
transfer between coding and video
generation and I think most people don't
necessarily believe that but they you
know you could try to think that there
is some some there or it could be a
waste right to put them together to try
to learn these both tasks at the same
time right so I think it's it's it's
interesting sort of question to which
extent like image and video obviously
kind of there's some transfer like kind
of not that different there's value in
in learning to output video and audio at
the same time because joint
audiovisisual is you know that's how
that's how it is. Um and then there's
you know other kind of intersections of
modalities that are not super obvious
right like 3D representation and coding
I don't know maybe uh things like that
right so like I think it's worth sort of
exploring the different corners there
and we are actively doing that um with a
focus towards like what people actually
want to do with these models
>> yeah um what one thing I feel I feel
like uh I'm surprised by but also I feel
like it's insufficiently answered is
what is the correct intermediate
representation Um, so captioning, right?
XI does captioning. Omni does
captioning. Um, and I I I understand how
captioning works for images. Um, and I
understand that you can extend it into
to video and and sort of guide it across
time. It just feels very inefficient. It
there's got to be I feel like there
should be something better. Uh maybe
it's code and maybe we generate you know
and obviously I think a lot of um ffmpeg
and mapplot um what's the three blue one
brown one manim um a lot of like video
is generated through code and maybe
that's like the optimal representation
uh any hypothesis as to like is is it
better or is just English all you need
>> well as so I'm in the Gemini and you
know we do like a lot of RL agent and of
course kind of coding so yeah We we're
definitely exploring the coding
representations.
>> Yeah.
>> As kind of better kind of way to
represent. Yeah.
>> But you know like do you what's your
probability estimate on like we just
output binaries like we just you know
like just it's just ones and zeros.
>> Um I I guess maybe a kind of similar
discussion was like um basically is the
language the right representation like
right. So uh one kind of question for
example uh professor you know like ask
is like you know why why does the
channel of thought need to be in the
natural language?
>> Yes.
>> Can it just be the kind of any kind of
like continuous tokens just any amount
of you know additional computations. Um
so one is like obviously the test like
adaptive compute is going to give like
you know better results. So it's sad but
what really kind of made CH thought so
you know like four years ago I wrote you
know the larger model Z reasoner and
then self-improvement. So I kind of know
from very early day but the reason like
it works really well is um right now the
recipe that works is the pre-training
that scales a lot and then that
basically like learns a lot of
intelligence. there are a lot of you
know scaling RL but those are still like
extremely kind of comput intensive to
extract the information and um you
really want to rely the intelligence on
that so basically by tying the sort of
like a reasoning in the natural language
you basically directly use the
intelligence of the pre-training to it
while if you remove that kind of
constraints then you're not um and these
days uh I feel the a lot of advancements
in the texts but also in kind of
multimodal space is really driven by
this u kind of text as a kind of great
uh sort of representation.
>> Yeah, it's a good backbone.
>> Yeah,
>> I think to me it's even simpler than
that. It's text is is how we
communicate. So I think fundamentally if
you're building kind of products that
humans will be interfacing with um like
like that we will be using text somehow
if it's a text interface, right? Not not
for everything. So I think it's it's
natural to default to that. Yeah,
obviously there's like a confus
discussion. You know, some arrow like RO
maximalist is like, oh, we don't care
about, you know, kind of channel those
kind of like stuff. It's just just
additional compute.
>> Sure.
>> But I personally Yeah.
>> RL maximalists. I wonder I wonder who
who qualifies in that description. David
Silver.
>> Ah, okay. Yeah. I mean, they they've
just left to to start their thing. Um,
interesting. Okay. So, uh I I mean I I
think I'm very interested in just like
better representations because I think
that's one of our themes that we're
curating today uh at the Worlds Fair is
world models. You mentioned the word
world models, but it's not something
that's like super well- definfined. I
think everyone's like sort of converging
on some version of it that it's like the
ideal.
>> Sure. Everything is a world model now.
It's a sort of
>> it's not it's not that useful, right?
>> So, I just gave a keynote at the i
Clear's world model workshop. Yeah. And
then uh yeah essentially uh I definitely
encourage to check out the definition by
Jatendra Matalik. He's like the you know
OG computer vision professor UC
Berkeley. uh he has pretty you know bit
of word to say about world model but
also kind of shimmburers kind of how he
defined the world model from 2019 like
1990 sort of uh uh you know like Wayne
was just basically just that kind of
model base uh for me the word model is
basically just a model in the model
based RL and I feel that has sufficient
to describe but obviously you know there
are like a lot of uh fay had a kind of
nice blog post about what about yeah
this kind of broken down
>> um but yeah
>> yeah I mean so you
I I'll I'll end this part of the
conversation, but like I I do think that
language to me relying on language as
like the sort of like the narrow pipe
through which everything goes through.
Um still is like a lossy compression.
>> No, no, no. But we're not seeing that,
right? We're basically saying the video
model and the language together.
>> So, so I think the language alone is uh
not sufficient. That's why we feel like
the video is a very comprehening
kind of pretty videos but I think our
vision it's it's much more than that.
It's a missing foundational model that's
absolutely required if you want to make
the AGI that match the humans not just
the jacked one.
>> Yeah. Um okay so one one other thing you
know you you mentioned on the vision
side um and I'm kind of curious how sort
of uh parallel you know in terms of your
research careers um this development is
like I think basically a lot of vision
people have crossed over into more model
people um a lot of vision people also
become generative video and image people
and is it just as simple as you know
reversing uh image to text and then now
it's text to image like is
is that if I mean that effectively was
the diffusion process. Um
I I just I you know I I just see the
career paths of the people that I talk
to and and see and I I I see this
overall trend of research directions and
I just wanted you to guys to sort of
reflect on on that.
>> I mean I certainly went that way right I
started long time ago uh doing computer
vision sort of object detection
recognition things like that. Uh I think
just that's just simpler problem right
just generation is just harder like it's
a it's a different kind of mapping right
you map from the the inverse mapping is
not as simple as just inverting the the
kind of rotations right it's it's a it's
it's more ambiguous right to go from cat
to image of a cat and in some ways it's
also a loop because your vision work
creates the synthetic labels that then
continues
>> I mean sure I don't know I don't know
I'm trying to validate my my sort of
theories about how fields develop how
how careers progress through this
>> I mean for like the the the better the
understanding side gets like we have
seen that the generation side also gets
better right so like like
>> it's completely bootstrapping yeah it's
>> and so so like like like there's
definitely they're there to that thesis
and I think yeah I think a lot of people
have kind of like I I definitely worked
with a lot of um image understanding
people who became image generation
people you know and then some of them
have moved on to video because it's kind
of like the next thing where you have so
many more dimensions to work with so
yeah I'm curious about you specific as
your
>> so I definitely like recommend start
with understanding recognition because
that's basically discriminator and then
that's going to lead to better
generation and that's what the bridge is
basically reinforcement learning so my
um my kind of journey is I initially
kind of worked on the algorithmic
research in the gent model against some
like you know eminent kind of generation
and then I worked on like RL and
robotics um and then like six years ago
I was like leading like a moonshot on
the dexterity it was pretty early but I
see now everyone's kind of doing
uh four years ago I basically kind of
figured out that this like symbolic AGI
is going to accelerate much faster than
the kind of physical AGI kind of
counterpart. So uh I decided to kind of
like language models and then those
things. Um and then recently kind of
work with Doomi and then like omni team
I quite enjoy kind of collaboration
there. the what I quite enjoy uh what I
recommend definitely to the researcher
is to uh definitely kind of explore or
at least like get exposure to what the
top people in each of the community are
like looking at how they kind of think
about problems. So when I look at the
video model to me it kind of reminds me
like pretty early on sort of like
language model where like very early
language model was a kind of creative
sort of demo right you kind of like try
to write like a story like novel and
then like you know GBD2 and then those
kind of days like L stem kind of days
right and then you know uh instruction
tuning you actually kind of make it
usable as a chatbot but then at the
chatbot stage it still had so much
hallucinations and instruction for
wasn't good enough so it couldn't use
for reasoning and when it got good
enough um in pre-training and post-
trainining for reasoning then you know
this kind of test time scaling the RL
really took off to like many of the kind
of best performing models and right now
I think the video model is as we
mentioned it's it is a complimentary
foundational model and I can imagine
it's going to follow a similar path it's
going to be very uh it's going to
improve a lot instruction following a
lot of uh this it's going to improve a
lot in reducing coordinations to extend
that it become a very reliable world
model so we can kind of like intermixed
video like space-time simulation with a
text simulation to solve like arbitrary
AI problems. Also like I think the
difference still is between sort of text
models and like image video models is
that like we haven't quite unified
understanding and generation in in
multimedia I'd say yet like I mean I
think I think without going to the
details of course there's like it
depends on on at which level you're
thinking about this but generally like
there's not that many as far as I know
models sort kind of you know printier
models that are genuinely
kind of good at both understanding and
generation of of let's videos, right?
Like it's a it's a it's an interesting
challenge. I'm not saying that we should
do this. Uh but but I think uh it kind
of stands to reason that like you know
understanding and generation are two
sides of the same coin. So they they
kind of should be in the same model in
some ways. Uh but we don't necessarily
always do that. So yeah. Uh you
mentioned audio as well, right? Yeah. Uh
is that as hard as video or
qualitatively different? If if so, in
what way? Uh one of the interesting
directions three years ago was people
using um I guess diffusion to do audio
uh as in like the the sort of refusion
approach. I don't know if you you guys
saw that. Um and I just think it's like
very interesting if a modality that we
perceive which is audio is different
than video actually two machines is
exactly the same like there's they see
no difference.
I mean, I think on a technical level
there are some differences, but I think
they're like relatively minor. I think
from my perspective, audio came into
into my life when we shipped V3, which
was, I believe, the first model that did
like a joint
>> with the slicing of the
>> Yeah. Yeah. Gold bars or whatever. Um,
it it was the first model that is sort
of joint audiovisisual generation. Yes.
uh like in a in a I mean there are there
were other models that did kind of you
know kind of kind of agentic hacking
under the hood but this one was truly
sort of you know generating everything
at once and we the reason we did that is
because we felt and I think it was the
right choice we felt that like uh it
only makes sense to generate them at the
same time because there sort of kind of
like from a machine learning perspective
there's one latent kind of you know
causal kind of you know generative
process right like there's something
that generates you speaking it's not the
pixels and then the the audio or somehow
somehow generated by some other process
like the lips have to move in sync with
with the with the audio, right? So, I
think that that solved a lot of the
issues that previous models had or the
way that people did video generation
before where it was like, okay, we
generate pixels and then we're going to
hack something on top of it that like
moves the lips with the audio that we
generate and that's was very bad.
And so I think I think that was that's
to me that's the the I mean after V3
like you know people were like what do
you mean like there's no audio in your
model? like that makes no sense like
once it's there like you you have to
have it. So I think that was that was
the right choice and doing it in one
single generative model I think was was
the right choice.
>> One thing I kind of want to also kind of
ask you guys an opinion as well once one
difference I find the audio and then
against the image and video is like the
audio information is less verbalized. I
mean of course the TTS and stuff is
trivial right but the when you get her
outside like how to describe music how
do you describe this like this person's
tone kind of pitch I feel the sort of
the verbalization is in insufficient and
the interesting thing is that you kind
of see that in two other things like
taste taste sense and also uh say um
touch
>> like smell and then the another
interesting thing is the skin color so
skin cutter the the language is pretty
limited to describe the skin color and
the reason is that we're extremely uh
sensitive to the small difference
perturbations of a skin color because
that basically shows us is this person
going to kill me or is can I befriend
this person kind of those kind of
information and then I feel the smell
tastes um skin color and like sound kind
of stuff is very very tied into
primitive it would like survival kind of
stuff and so our sort of sensory system
is so sensitive that it's intractable to
um so for example I asked like one the
wine sort of taster and like
professional and then he basically said
he kind of use like a language from like
a dating you know describing like a you
know partner as a way to describe the
taste because there's no sufficient
vocab to describe um so I'm kind of
curious yeah do you guys feel that
>> I think well to some extent I think the
same is true for visual information
right when you think about like a
certain style or a certain aesthetic,
right? Like like there are some people
who just have a much more kind of
developed like whether it's palette or
kind of visual taste and aesthetic,
right? Like I I think
>> language just tends to be a bit of a
limiting factor when you are trying to
describe any of these things that like
we experience with sensory information.
And to your point earlier, I think that
is the kind of the reason why we are
investing in world models and why we are
pushing on kind of the like perception
and like generation side of things
because it it is such a large part of
how we as humans navigate the world.
It's a large part of how like embodied
AI navigates the world. Um, and and I do
I do think language like does have a lot
of it's it's gotten us very far and it
can probably get us really far, but it
it feels limiting in a lot of these kind
of areas. And yeah, I don't I don't
really know how to describe, you know,
like sense and taste. Um, but yeah, I'm
curious to me.
>> Uh, I I yeah, I don't know that I have
thought that deeply about this yet. So,
uh, yeah, I mean, yeah, I don't have a
good answer about audio. I mean like I
don't know the limit because I'm
thinking about like well what is what is
Omni bad at in terms of audio but
they're all like solvable problems I
find uh so like with more data or better
data or whatever it is so I don't know
like that we have pushed the frontier so
much that like we are have hit some sort
of limits that are rooted in
evolutionary uh kind of you know limits
imposed by humans I don't know he's
feeling the limits of captioning which
is the the thing I was
>> yeah exactly there there's There's a lot
of information in the world and it
connects to basically why we do world
modeling.
>> You mentioned
>> you just need SRFS sref76
and then that's your what does right? I
guess maybe I can't describe this vibe
but
>> well well I think that that's kind of
the point of providing some of these
references right because because like
even just describing how someone talks
and like their tone and and like procity
and all of these things like I think I
think some of these terms even like I
didn't used to know what they mean right
well now yes disluencies
>> ex exact like like there there's kind of
an entire vocabulary that even if you're
not kind of steeped in a domain which is
true for actually like most human
domains that like you don't even know
what it means means um and sometimes
it's also a question of like if we
haven't focused on those things you know
with the large language models that they
may also have gaps in those areas right
and then we feel them on the other side
with generation because we're like
fundamentally relying on on the language
models understanding of the world to
then be able to like represent it um so
I think yeah it all kind of goes back to
your question about like the the
language as an intermediary but yeah I
think to do like some of these might
just be like focus areas and things that
we haven't necessarily pushed on as much
as we can and like as Well, we will
discover what the actual ceiling is.
>> Yeah, as a podcaster, I think a lot
about sound.
>> Um, and and I I'll just offer a couple
things for discussion in case in case it
triggers anything with you guys. Um, I
have three domains of rough audio, which
is like music, voice, SFX, you know, is
that rough? Okay. Covers everything. And
then also even within voice, let's just
let's just focus on voice. Forget the
other two. um room sound like the the
echoiness of like big room, small room,
in person, in a car, over a phone, all
these like are labelable, but we
experience them very differently. And I
I often think like one of the tells of
the AI video is that it is studio
quality because it was recorded in a
studio because that's your training
data. And like and and to me that's one
thing actually like the most interesting
thing is just uh when I tell this is how
I convince people who are kind of
skeptical about the need for world
models because you need it even for
audio about well I'm further away from
you so I should sound a little bit
softer or more diffused and like the the
video models need to pick that up
because if they're going to do immersive
video and audio you need that. I I I
love that example of basically like
studio quality or not in a way like we
don't have enough language to really
describe like like this kind of echoing
or like some kind of noise kind of
happening. we just like don't have
precise enough and uh if you um you know
basically the reason that I think it's
quite important to have like relatively
information rich like kind of captioning
is that we kind of rely on the natural
language as a representation but if you
basically don't have enough uh
representation that basically means the
condition on the language the generation
is very multimodal and if you anything
can learn from the BAE kind like you
know very old you know BA kind of
research the idea is we really want to
capture most of the stoasticity in the
later representation And then the the X
given the Z should be kind of like
deterministic. So
>> yeah. Yeah. Um well I hope I hope
there's more uh progress there and I'm
sure you guys are doing
>> even actually like facial expressions,
right? And maybe this gets to your point
about like things that we're very
sensitive to, right? I think you can
tell a lot of AI content also just by
from like people's facial expressions.
>> Yes. and we try not to contribute to it,
but you know, um, and or or like skin
textures, right? Like like the things
that kind of make things look real in
real life. Like I, you know, I can tell
from the way you're nodding or from the
way like your micro expressions are kind
of changing of like how you're reacting
to what I'm saying. Like we haven't
quite crossed that chasm, I think like
we're we're so much better than we were
a year ago.
>> Yeah. Um, but there's so much more
headroom kind of in a lot of those
things that like we as humans are super
sensitive to. And like I think image
arguably probably is there because
there's there's a lot of kind of images
that I will see that like really do look
indistinguishable from reality and I
can't tell if they're generated or not.
>> They're better than reality.
>> Um, or well that's a different
>> No, I I think that one of the parents
>> better than what I would take on my
vacation as a photo. Yes. One of the one
of the fun experiments that we did a
while ago in the team is is like can we
generate videos that are better than
than real videos, right? So you just
take the same caption from like oh yeah
some video and then
>> try it. Yeah. Just just try to like
describe a real video and then generate
the equivalent version with omni and
then do a human eval how does it do and
then humans largely prefer AI generated
>> video
margin
>> but because it's because it's the RL
process. That's the process working.
>> It's however you want to rationalize it.
It's not necessarily the old process.
It's just like I think it's just
>> I'm not saying this is a good result.
I'm just saying is we have optimized in
a way that like kind of potentially sort
of you know triggers something in the
human brain that like oh it it looks it
looks all a lot of the videos just look
look better like I'm not on on
inspection on on deeper inspection they
they would not actually be more useful
or whatever but like if you just say
side by side random YouTube video versus
generated version of it will you will
just have a it will just look better
because it's more it's a sharper or HDR.
Uh, you know, the skin tone is is is
better. It's not, again, it's not more
realistic.
>> Uh, it doesn't solve your problem
necessarily, but it it looks better.
>> I I since also depend on the sensitivity
of the people. Uh, I was born raised in
Japan and I think one thing I kind of
know is like they're extremely extremely
like sensitive about like, you know,
that's why, you know, like architecture,
like food and stuff like they have. Um,
so I talked to like a manga like like
artist there and he's like he's kind of
disgusted by like the generation AI and
one kind of thing he mentioned is like
the eye gaze. Eye gaze that slight
difference makes me makes him kind of
feel creepy about like unnatural
>> like if you're looking a little bit off.
>> Yeah. It's just uh Yeah. Just like uh it
looks too fake. Yeah. So, so I think it
does depend on the sensitivity and
>> Yeah. Yeah. Yeah. All I'm saying is
like, you know, human preferences are
like not particularly like uh reliable
barometer of like what you should be
optimizing for. Like if you just ask
people, do you like this or not? You're
not necessarily get what you wanted.
>> Yeah. Let me just kind of add one thing
but like four years ago there was a like
debate that if the prompt engineering is
going to disappear and uh my my like you
know some very powerful people say you
know it's going to disappear but I
basically said like it shouldn't because
the prompt engineering like sort of you
know specifying that is like the the
only way you can sort of control the
output sort of you know when you have
like sort of control over the AI and
what allows you to prompt engineer is
really that sensitivity. So sure maybe
like right now the AI can do a lot of
auto prompting and that and it can
generate something that's sufficient but
uh if it's like that never be satisfied
like never be satisfied with the AI's
generated content always fine tune your
sensitivity and always kind of keep
prompting the differences. I I think to
the there's also a big difference
between like the average human untrained
eye which I I would put myself in that
bucket you know like I have I have some
aesthetic sensibilities and I've done
this long enough that you know like I
have I have a preference um but you know
like your example of a manga artist like
that's somebody who has honed a craft
like over possibly many decades um and
anybody who does that whether it's like
design architecture right like you you
you just have a very different level of
like expertise and you see things that
like the average human will not see. But
Doom is right. Like when we look at if
you were to just, you know, um pull 10
people on the street, they would
probably prefer the like overly smooth
like very saturated kind of
>> It's called the Instagram filter.
>> It is. It is. Yeah.
>> And you know, and and so there's also a
little bit of a question of like what
does your default aesthetic look like if
you don't specify? But then to Shane's
point, one of the things we always try
to get these models better at is
instruction follow. So that like when
you want to get them to a different
outcome, like you should be able to,
whether that's through language or
whether that's through your references
because language is sometimes too
limiting. Um, and so like these models
continue to get better at it, but they
so much more.
>> Do do you feel pressure as a as a
product director to set the default for
the world? Like I mean
>> kind of
>> maybe I should I don't know. I haven't
thought about this.
>> You know, you know, it's like someone
has to have a default. Their default has
to exist.
>> Actually, I will say like we have
thought about this. Um, and I I think
one of the So, for example, actually
like if you look at nano banana
generations, we had like an explosion of
nanobanana infographics when nano banana
pro came out.
>> I tried it. Yeah.
>> Um, yeah. Yeah. Yeah. I think Nurb's
papers were like all, you know, so so
many had like infographics generated.
Can you run your uh watermarking on it
and see how many?
>> Uh we probably we probably could. We we
have we haven't done that, but I saw so
like my Twitter was maybe this is just
also like the bias of my algorithm, but
they were everywhere. Um and it was
actually very painful because um I think
our default aesthetic was a little bit
too it was too cluttered. Like I think
that the the model was like a bit of an
overeager student that just like learned
you know it was like oh I know all these
like I know all this information about
this concept. and like shove it into the
same image. Japanese infographics 5x
that
>> or maybe it was you know um but it just
and and
>> wait so same prompt same content if it's
in Japanese it's
>> density density
>> oh wow
>> because that's the style in Japan
>> yeah some like very you know bureaucrat
is a famous word for it yeah
>> no but we do do go through this process
with Omni we did it together right like
where like we had like a bunch of like
we like at the very end okay like this
is we did some tuning and like okay what
kind of style style do we prefer right
like you know
>> is it more muted more saturated
>> we had a lot of saturations
>> yeah there was there was there were I
think Nicole just has PTSD so has
forgotten about it but she was very much
involved in this of like okay which
which kind of color palette do we
basically prefer right and it's you know
it's it's it's not something that like
you have to make a a trade-off there
like uh
>> and and and it's because it ends up
being us right like actually it is true
like it it ends up being the modeling
teams and you could ask the question
legitimately of like are we the best
people to do that or should we actually
work with someone who like has a really
creative point of view and is more of
like you know an art director and like
has like and we kind of go back and
forth on this. Um
>> we have the trusted testers. I'm on
>> we do we have trusted testers who give
us a lot of feedback and we take that
serious
>> very well organized by the way they have
these like weekly calls and stuff like
it's it's amazing.
>> Um Logan's team does a lot of that. So
kudo kudos to kudos to Logan um who
couldn't be here today. Um, and we have
a lot of people actually internally at
Google like fulfur who give us like a
ton of No, no, no. Truly like who give
us a ton of feedback on like when we
when we release new checkpoints and like
sometimes it will be stuff that we like
don't see right like we would be like oh
yeah this optimization seems okay and
then they would come back what have you
done like you completely ruined my grass
you know because now the detail is all
blurry.
>> I think you just noticed not not a super
secret at this point but like that our
model tends to put rings wedding rings
on on on hand. That's yeah
>> very strange. I had never noticed that
but he's like he I just saw it and
there's a faux fur channel basically
>> where he posts I was like why is there a
wedding ring in every hand I'm like this
is strange
>> it sounds very common reward hacking
>> yeah yeah yeah so but you know something
that we would not have we would not have
noticed necessarily while while
developing this right
>> is it an RL artifact or I I don't know
>> you you do have like a lot of preference
based and then you know you may can
prefer that spirious correlation reward
hacking it can happen like in many weird
ways yeah
>> it does it does uh yeah this is related
to another topic that again I I try to
use these mainstage things as
introductions or ties in. Uh we have the
eval track we have character AI and
YouTube talking about how they evaluate
videos. Um how do you evaluate videos
>> apart from furer
>> not everyone has a fauxur but also you
know I think there needs to be something
more quantitative
>> well I mean it's you improve Gemini
to improve the evolution for VR. Yeah.
Um that's that's no no that's that's
definitely one way uh it's actually very
hard.
>> It's very hard. It's very hard um to get
like you know audators to evaluate
things in a video like including
especially things like aesthetics right
like that it's like there are some
things that are a little bit more
objective like especially when we talk
like let's say we talk about images and
we look at like infographics text
rendering that's actually fine right
because like
>> you can kind of OCR things out and then
you can look at like okay this letter is
like messed up and then the whole thing
is actually useless because if like
literally if the letter is off in render
text you just can't use that asset
Right. So th those things are like a
little bit more auto ratable um from
what we found. We do rely a lot on
humans looking at things and so we do do
a lot of human evals. We do a lot of
human evals.
>> Do a lot of human ev and every time Jane
is like um and every time we have a new
model we like want to do more things and
we want to like jam in more capabilities
and then we have like more emails that
we have to run. Um, and then at some
point you do get two models that are
like kind of close to each other and
then like we literally make decisions
based on like looking at output side by
side. Sometimes like in a room like I've
been in rooms where there's like 10 of
us and we're just like looking at videos
side by side and we're like do you
prefer this or do you prefer that? Like
>> wow.
>> I mean but it is it is genuinely very
complicated. the more capabilities you
add like you know even just the one
capability but it's like almost AGI
complete capabilities like video editing
right like think about video editing as
a and like editing with audio and
>> editor will be very happy to hear this
editor
>> I mean it's the hardest problem in g
media
>> I mean I don't know if it's the hardest
but it's definitely there right like uh
in terms of like complexity of of
evaluation like free form video editing
is you can do anything like yes uh and
like
>> I I spent a lot money on that and it's
very hard. Please help me
>> like adding those we don't have like add
a sloth eval, right? Like uh that we
>> Well, now we should.
>> Now we should. Yeah. Yeah. Yeah. But
like things like that like it's it's
it's not that easy to
>> I think I'm just surprised at the sample
size that you have, right? Like to to
test the entire surface of your models,
you still rely on a magnitude of
hundreds.
>> No, no, no, no. So we do like Yeah.
Well, we do we do a ton of human evals
on like on like you know thousands of
things. Um I I think there's also like
an element of you know we can talk about
things like live experiments right like
which which is also where you get signal
on like like some of these more minute
differences at like much larger scale
then there's autoators which is
definitely kind of a more it's a very
well defined space I think for LLMs much
more nent for media models and then like
sometimes you still do rely on human
judgment and we do rely on things like
feedback from people who just like have
a very owned like aesthetic and and
people who just like use these models in
their workflows day-to-day, right?
Because we could also like you could
have a model that does really well on
some slice of human evals, but then it
like really breaks a workflow for
somebody. And so this is why we do like
early access programs and we try to get
feedback and then we like try to
incorporate it before we release
something more broadly. I feel like
Shane had a hot take based on his
>> expression always when we were talking
about this. every kind of human sort of
you know work should be gradually kind
of amortized and then the interesting
thing is the video understanding
especially like against like a gener
video like detecting air stuff is
extremely interesting uh vision task
>> and then like some of it kind of
aesthetics or this kind of visual
quality but for some of the kind of
cases like semantically doesn't make
sense for example you're taking like
some like a famous scene from a movie
and try to sort of um construct that and
then if you kind of generate uh it can
generate something that but at some
point some of the semantic information
doesn't make sense like it's actually
inconsistent. So can the AI actually
detect that? So when I evaluate the AI
video I was like oh I feel I'm so smart
you know like that like it's like like
AI is still kind of behind but we should
make like a lot of effort. I think the
video understanding is extremely uh
important intelligence task uh beyond
just the pure aesthetics or the
preference. Um and yeah, we we should
always try to advertise the human
>> human label. Yeah.
>> Yeah.
>> Um what data do you need? A lot of
people I talked to wanted to get in
front of you actually. Uh they I mean
they want to be nice about it. They have
a lot of video data. They have gaming
data. They have real world video data.
They have images. They have lablers.
What do you want?
>> Are you like offering? I'm just like
this is your request for like okay okay
we get I'm sure you get a lot of pitches
right you get a lot of people want to
talk to you what's like I think actually
it's the signal this problem this
sorting out signal from noise is the
main problem so creating a nice API of
like okay if you actually do a b and c
we are interested in that
>> um
loaded question there so uh I don't know
that there's like an easy like you know
if you do I I think we we do already
have a lot of data. I think it's it's
hard to talk about this
>> in a talk about the public. I don't want
to get you in trouble.
>> But like I I think
>> no what I just want to say is like hard
to talk about this in a sort of you know
without trying to without I have to
think about the what I am revealing
about our project and what where we
going. Um generally high quality data I
think maybe maybe let's just put it this
way right it's not not the secret
>> embodied I'm sorry
>> embodied data I mean
>> yeah sure I mean we have we have sort of
announced I think publicly right that we
we'd have some sort of robotics
collaboration right like so I think it's
like or or but you because we have a
robotics team at GDM so you know they're
always interested in things like that um
I mean for OMI specifically I think
we're just quite interested just high
quality data right like you know it it's
not some sort of not necessarily like oh
random YouTube video but like you know
some some more professional shop things
like that right the things that those
are those are things that we're always
on the lookout for like uh and yeah
>> and I think for you know maybe this is
easier to some extent to answer for like
some of the agentic work as well like
like like actual kind of like what are
the tests that people are trying to do
right these things are actually kind of
difficult to manufacture if you're doing
it yourself or if you're like doing it
with a vendor like what is the actual
like if you're creating a marketing
campaign like what does that look like
right like do do you start from here's
like a picture of my new product and
then I want to turn that into a video ad
and I want to turn that into a bunch of
assets that like fit fit all these
different ad formats that I need to push
onto various platforms to promote and
then like so you kind of go from this to
that and like what is that kind of
trajectory of tasks that you're that
you're like you know experiencing along
the way like that is really useful and
that is actually kind of difficult to
get right uh because like we don't
always have the right first party
surface where people are actually doing
some of these things or like you might
work with someone who's a vendor but
they don't also don't have that product
surface right like like a lot of this
kind of information lives in the places
where people are doing these tests and
so that's kind of difficult to get like
if anyone's figured that out you should
reach out to us
>> every channel thought yeah every channel
thought
fault.
>> Yeah.
>> And maybe the data the Chinese lab is
using.
>> Yes. Yeah. uh you know uh yeah as a
media person myself right like there's
so many podcasters and people in in
marketing departments and all these like
they would happy to be your data like
you know just like put a BCI on my head
>> talk to us watch my things uh because
like you know there's just endless
amount of work to do like there's so
much work and this is all like this
needs to somewhat be a commodity like
obviously you can be an art like an
artisan like you can be Hollywood for
like the really high quality stuff but
actually a lot work is commodity and
like should be modelable and we want you
to do it
>> and but we we want the high quality to
Dumi's point right like we do want we
want the high quality
>> we want commodity yes yes you want on
both sides
>> um I I
>> thank you for the solicitation
>> uh I you know we we we also I also added
a data quality track I I think that uh
people want to understand like what uh
at AI like how to raise the bar right
like like the and a lot of it is just
educating the market and educating
researchers and engineers and founders
on like this is where we're going a lot
of this is stop doing that do this do
this instead and I'm like people will
listen
>> yeah I don't know uh to that extent you
know
>> but I think to that to that point like
there's a lot of again just like craft
that goes into this right and there's a
lot of process like you even to the
marketing campaign example you don't
create that in like five minutes right
you like go you go through a process and
you iterate Great. And you like pick
something over something else because
you liked it for whatever reason. Like
maybe the eye gaze was correct, right?
Like we just we don't know these things,
right? Because none of us are marketing
directors and like the models don't know
these things.
>> I even kind of say this for the natural
like a language as well. Like I I always
kind of say 99% of information is inside
people.
>> You can only extracted through active
dialogue and befriending them. So most
of the stuff on the internet is like
sort of the outcome the output of that.
Yes. about you know what are what are
all the trajectories you know how did
this person have this inspiration to
write this paper
>> what is the starting point what is the
inspiration what are the dialogue that
sparked it those kind of stuff is kind
of inside people so even you know those
kind of like even the language space is
kind of that I think the creative is
kind of similar as well there's a lot of
dark knowledge
>> yeah it's like when you write a novel
right like a novel speaks to you because
like usually there's some sort of like a
personal connection that you feel to
like the story or the trajectory or the
characters right like if you read most
of the stuff that's written by LLMs
today. Like it's, you know, it's it's it
starts it falls into these like default
par patterns and like the language
starts to feel really similar and all
the descriptions sound really similar.
You can kind of like quickly read it as
like, oh, this is not that interesting
because like I can't connect to it,
right? Um, and again, that's that's kind
of like a human expertise.
>> One nice thing recently is the Google
Cloud and the Google Deep Mind are kind
of starting to invest a lot more in the
FTEEs for the product engineers. And I
also kind of saw some uh recruiting for
the creative you know gem media kind of
space as well. So I think those are kind
of really the effort because we we kind
of feel that you know what we can kind
of do with a lot of public data there's
limits but we're you know partnering
with that we can provide kind of better
models and products and yeah kind of
feedback
>> uh we have an FD track here for the
first time every lab is announcing it.
It's it's crazy. Um, one thing I'm
actually very keen on doing and I pushed
a push for this at cognition as well is
to turn the FTEES not just into sales
and solutions but also to evalu uh eval
workers.
>> FD is not the sales FD is way way bigger
than that. How do you frame FDs then?
Because I do think about it as sales
like you're you know the more the more
you customize the solution for
>> so I define post training as anything
between the pre-training and the final
user experience anything anything is a
post training
>> and to me when I first sort of you know
learned a lot about I mean FD kind of I
guess originally you know came from like
path here and then that so I guess the
kind of history is different but yeah I
think the key is really that um you know
the key is like not only to kind of work
uh with them and ensure that they kind
of know how to
but also to sort of code like derive
kind of insights that can basically kind
of help both parties. They can put the
like a lot of harness how they use the
model. We can improve like very
upstream. So how to get the customer
feedback to the modeling I feel is the
kind of more the the role I I kind of
want for the fds. Yeah.
>> Yeah. Yeah. and and even if I sorry just
on that like if you want to talk to us
or at least me um I I'm not going to
offer up your time um but I it's really
helpful for us to actually talk to
people who are using our models and like
understand where they're struggling uh
because again that just like it's it's
the real world task that you're actually
trying to use them for right like I will
talk to people who do kind of interior
inter interior design with some of our
image models um you know and they will
say hey like I really want to take this
pattern pattern, but then I want to
scale it across like 10 different ruck
sizes and sometimes I have like a very
custom ruck size and then the model
fails at like replicating the pattern
the same way. Or, you know, I want to do
a try on for these earrings and then the
earrings have a certain size and then
like my head has a certain size, right?
Like it has to make sense if you're
actually trying to try things on and
like the models kind of fail at a bunch
of these things that like actually
happen in the real world, right? Um, and
so that that's like useful for us
because for some of these things like we
don't think about because we don't you
know we don't use the models for those
tasks
>> or like um you know I think to your
point about ad campaigns or whatever
like people have like notions of brand
languages or whatever like which is
>> yes
>> like a a bunch of images or PDFs saying
things you know it's a pretty kind of
you know ambiguous question as well.
What is the IKEA brand language? You
know is it is it blue and yellow? I mean
that's that's not a very like
>> but like what shade of blue you know.
>> Yeah. Yeah. Yeah. So there there's like,
you know, and the brands are pretty
specific, you know, pretty, you know,
like they they do care about the shade
of blue. It's not shouldn't just be a
random blue and a random yellow. That's
not going to be IKEA, right? I'm just
thinking about an example. But like this
is the kind of stuff that, you know,
it's not necessarily part of our like,
you know, developing frontier models
kind of, you know, necessarily mandate,
but it's something that we do want to we
do want to fundamentally like build
products that people will use to solve
concrete tasks, not just not just
research artifacts, right? So I think
it's useful to understand what people do
care about. Uh well, I'm sure a lot of
people are very grateful for your work
and there's a lot more to do that you've
made so much progress over the last like
even just a couple years of like Nano
Banana and Leo and Omni and uh I don't
know what else you got cooking but we're
very excited like you this is one of
those things where like I was very
disappointed you know when with when
Sora shut down and and I think like
there needs to be more general
exploration of uh you know generative
models and not and not just you know
coding. I think I think that is
>> we obviously like this.
>> We love coding. Love coding and and uh
yes. Uh but thank you so much for your
time. Uh it's been a real pleasure and I
can't wait to see what this looks like
next.
>> Thank you for having us. Great question.
>> Thank you everyone.
Let me explain. So within my second
brain, I currently have over 5,000 notes
in Obsidian and another 5,000 notes in
Readwise and some scattered in Notion
and Google Drive. And all of this is
growing on average with 250 files per
month. And this is what I want. On the
left, you can see my whole Obsidian
vault, this huge mass. And whenever I
start working on something such as an
article, a new project, a new codebase,
a new feature or whatever, I want to
actually pull high signal nodes that are
actually useful for my current work. And
you would ask yourself, why not use
directly codex code or notebook LM? And
the thing is that I am, but you need a
system that sits between those harnesses
and your second brain. Okay, so let's go
back to the root of my problem, which is
that I'm always losing my research. For
example, my reading list is a graveyard.
When I'm scrolling social media and I
save that cool X post, a new article, a
new new YouTube video, a GitHub
repository, it doesn't matter. Whenever
I actually want to start working on
something, I never recall what I have in
my second brain or I have to spend a ton
of time actually finding meaningful
notes that I can use in my work, right?
And another problem that I have is that
I want this system to actually be
anchored into my personal notes, into my
personal values, into my personal faith.
I want this system to be personal, to
reflect my own thoughts, right? And
that's why in today's video, Luis
Franuis and I will teach you how to
build your own AI research OS. This also
comes with code, so you can also try it
out yourself.
And I'm Pauline. I'm the founder and CEO
of Decoding AI where I do a ton of
content on courses on how to ship A
products and I'm also the co-author of
the
Okay, hello everyone and thank you for
attending this session. My name is Tim
Sweeney, a principal engineer at Weights
and Biases and Coreweave. And for the
next 20 minutes, we're going to talk
about Arya, our new AI research and
iteration agent. Let's go ahead and get
started. So, uh, first off, just by way
of making some noise, some clapping. Uh,
who here um as an ML researcher? You're
someone that trains models, trains the
brain.
I heard one. Wow. Okay. Great work.
Great work. Uh what about who here is
the applied engineer, the namesake of
this conference? Who here actually
builds the bots?
>> Okay, good. Expected much more. And who
here is in AI management? You are
helping fund this compute.
Okay. Okay. Nice. From the back. Lovely.
Um well, now that I know a little bit
about you, just a little bit about me.
Uh again, my name is Tim. I have a
masters in machine learning uh and
reinforcement learning from Georgia
Tech. So I've been that uh researcher
currently building Weights and Biases
agent Arya. So identify as that applied
engineer and in a previous life was the
PM of Twitter's ML stack. So I hope you
hopefully can connect with you middle
management as well.
Um today's agenda is kind of broken into
three sections and hopefully each of you
personas walk away with something
valuable. So first we're going to learn
about Arya itself and how it can
supercharge your AI and ML workflows.
We're going to dive into auto research
and see that live in a live demo in just
a moment. Then we're going to pull back
the curtain and learn how we use weights
and biases and uh coreweave to actually
build Arya because a lot of you in the
audience are building agents yourself
and we believe a lot of these components
can help you in your endeavors. And then
towards the end we'll just take a step
back and identify a few key tips and
tricks for making sure that you're able
to productionize your systems
effectively.
For those of you who might not be
familiar, Weights and Biases is the
world's leading AI development platform.
We've been in business now for nine
years and have happily joined the core
family about a year ago. Uh we have a
number of products in our suite, but are
really known for our models, training,
inference, and weave stack, which really
helps collect data uh about the AI
development and machine learning
workflows and makes that information
actionable and uh enables users to make
the best decisions about what to do
next.
So without further ado, let's go ahead
and dive into Arya, our agent. Uh we'll
show a demo and then we'll get back to
some slides.
Okay, beautiful.
Let's make this a bit bigger. Holler at
me if you need it to be bigger. So, uh
what you're looking at here is a weights
and biases workspace. For you, for
anybody that isn't familiar, on the
lefth hand side, I actually see a list
of a bunch of different experiments. In
this particular project, I have over 200
training jobs. And on the right hand
side I see a scatter plot of in this
case declining metrics which is good
means our loss is going down over time.
And this view would be very familiar for
anyone that uses our tool. Now, to
ground this, we're actually uh uh using
the Carpathy Auto Research Project,
which I'm sure many of you are familiar
with, but if you're not, it's just a
very simple project that trains an LLM,
and it's a great foundation for auto
research type demonstrations because
it's a very simple codebase and allows
us to improve iteratively over time. So,
let's jump back to the project and open
up Arya by clicking this blue button in
the upper right. When I click this
button, I'm uh presented with the
familiar chat interface with, you know,
how can I help you today? A few call to
actions. And you know, I can add
different context in my project or maybe
add images, etc. Um, everyone here is
agent builders, so I don't need to bore
you with the details of what an agent
interface looks like. But let's go ahead
and just, you know, enter in a basic
intro here. Let's say, "Hello, Arya.
You're on stage at AI World's Fair 2026.
Please introduce yourself." So, it's
going to go ahead and chug along and
hopefully emit some sort of nice emoji.
Yay. He I'm Arya. I'm talking to the
audience. Great. But now, let's dive
into the meat of why you came here. So,
I'm going to open up this chat here. And
this is a longunning chat where I've
been running again over 200 experiments
using the auto research loop. Um, it
helped me download the code, set up my
launch jobs, set up my GPUs, and is able
to autonomously iterate on the code
itself and the hyperparameters.
We'll take a look at what it's doing in
a moment, but while we're doing this,
I'm going to kick off a live iteration
right here. So, what I'm going to say is
please conduct another batch of
experiments. You are on stage at the AI
Engineer Worlds Fair 2026 and we're
hoping to find the best model live. I
believe in you uh because we know we
have to encourage our models. Um so,
it's been doing this for a while. What
it what it's doing here is it's saying,
"Okay, great. Um I don't want to make a
big architecture swing. That feels a
little bit too risky." So, it's probably
going to go for uh some modifications to
the hyperparameters and then it's
kicking off a shell call here that is
actually um executing that uh executing
that experimentation loop and we're
going to check in on this periodically
throughout this presentation, but I want
to help explain what's going on behind
the scenes. So, behind the scenes, I
have set up a weights and biases launch
queue. Launch is our our product that
allows you to connect to your compute
clusters and allows humans and agents to
launch long running experimentation jobs
particularly by leveraging GPUs.
Here I'm looking at a uh a terminal
output of my Kubernetes cluster where
we're actually seeing live execution of
experiments happening. So this is
happening live right here. This is not a
fake demo. Um great. And if we jump
back, we see that at this point it
started the cues and now it is simply
polling and waiting for our work to be
complete. So we'll jump back to that in
a in a moment. But before but let's dive
into a few other examples. So uh
something else that is interesting you
can do is maybe you might want to ask it
something like please summarize the
highest performing runs in this project.
This use case would be something like
maybe a new user come or a new uh team
member is joining your project and want
to understand the research. Um or maybe
you've uh someone's been doing some work
while you were on PTO and you want to
get caught up. We'll see what this comes
up with in a moment. Some other
pre-anned uh examples are finding
patterns in your project. So here we can
see that I asked it, hey, can you find
some patterns in this research? And we
see that um it identified that a new
family of models emerged as the as the
auto uh auto research was happening. Uh
it identified that batch size seems to
be a really high high uh lever uh
parameter. It identified an
architectural recipe that seemed to be
quite promising and a number of other
insights that would have taken me hours
or days to discover on my own. And Arya
is able to do it right for me directly
in the interface that I already live.
Not only is it able to emit text based
uh textbased outputs, but it also deeply
integrates with a number of weights and
biases visualization utilities. So here
I've actually asked it to emit a weights
and biases report which for those who
aren't familiar is essentially a
markdown file on steroids. It's got uh
embedded embedded plots, charts and and
and graphics. And so here uh you know
it's talked about the thesis of the
project. It's it's emitted a number of
of data panels. And uh I actually think
it's quite interesting. It used um one
of our more esoteric panels, the uh
parameter importance chart to uh tell me
the correlation of various different
parameters within this uh within this
training job.
Uh in addition to uh reports, it's also
great at working with workspaces. So if
you're a weights and biases user, uh you
spend a lot of your time uh designing
and working with workspaces. Well, Arya
is actually customtuned and prompted to
really understand how to build
workspaces, build plots, and complement
that that data analytics with real live
graphics using the built-in proprietary
charts that weights and biases users
know and love. Um, so with that, let's
go ahead and check back on some of our
our prompts. We can see that the please
summarize this project prompt is cooking
away. It's querying weights and biases.
It's applying patches. It's writing its
own code. So, we'll come back and check
on that in a moment. and our longunning
training job is uh still pulling for the
results. We can see that we're cooking
away on our GPUs. So, we're we're frying
some GPUs and doing some data science
all live. And while that's cooking,
let's go ahead and jump back to the
presentation. We'll come back in a
moment.
Uh
oh, no, we're not looking at a
dictionary. We're looking at a Po.
Great. Uh okay, so quick recap here.
What did Arya show? What did we show in
these last five minutes? First, we show
that uh Arya can serve as your data
science companion right inside of
Weights and Biases, helping you discover
insights that you wouldn't you wouldn't
be able to discover as your experiments
and as your team size grows.
Next, we address the problem of
complicated reporting and complicated
plotting. Weights and biases users are
are really want to turn their insights
into visual communication tools. They
want to communicate with their peers and
their colleagues. So Arya's built from
the ground up to understand those
primitives and help co-pilot and drive
right along right alongside in the UI
and announcing now today for the first
time we are releasing Arya on our iOS
device or on our iOS app. So uh uh Arya
released on Monday and our iOS app now
has Arya built in. So if you're
conducting hyperparameter tuning jobs,
if you're training models, or if you're
just researching within the weights and
biases ecosystem, you can go touch grass
at Yerba Buena uh gardens and steer your
uh hyperparameter tuning jobs all from
your mobile device. And what is this all
building up to? This is building up to
an uh a fully automated endto-end
research platform where we're not
seeking to replace uh RL researchers,
but complement your workflows. Arya's
great at orchestrating jobs,
understanding GPU workloads, responding
to events within the within the Wii
ecosystem, and listening to researchers,
uh, uh, looking up archive papers, and
collaborating on hypothesis. So, we can
let Arya drive the mechanics that you
don't want to deal with while you focus
on the new ideas, new architectures, and
new parameters that you wanted to try.
Um, great. So, that's Arya in a
nutshell. We're really hoping that you
give it a shot. And uh we'll jump back
to the auto research at the end and see
if we got a new best record. But before
we do that, let's talk about how we use
weights and biases and coreweave to
actually build Arya. So now speaking to
a lot of the the AI agent builders in
the room, here's a quick architecture on
the lefth hand side. You see that we
have a web client, iOS client that
communicates with our API server that
then dumps data into our turn database
and is worked on by our harness, our our
worker harness. This is sort of
archetypical of probably what most of
you are all building in the room and is
exactly what we have on our back end.
But that harness worker is a magic is a
is a magic box and it connects to a
number of important utilities. First is
a sandbox where it can execute arbitrary
shell calls uh do do Python data science
etc. And we invite you to try coreweave
weights and biases sandbox to fit into
your architecture.
Next up you need an LLM provider of
course and so if you're maybe using GLM
5.2 to or one of your fine-tuned models.
We invite you to use uh weights and
biases inference and connect that to
your worker as well.
If you're like us, you need to run
longunning workloads outside of the main
loop of the agent where you're actually
training for sometimes days at a time.
Weights and biases launch can actually
help facilitate that and coreweave GPUs
can help make that compute even better.
And then lastly, and really most
importantly, we need an observability
layer. It's critical that your agents
are able to log out their what's going
on with their sessions, their turns,
their tool calls, any errors they're hap
that that's happening, etc. Uh we have a
product called Weights and Biases Weave
that we log 100% of our traces to where
us and our team can learn from. And
that's where we move from production to
offline where our team is able to use
Weights and Biases Weave to drive
insights and identify behaviors,
implement tasks with tasks which are
essentially unit tests for your models
and evaluate those models in a loop.
We have a model repository which you
might choose to use weights and biases
artifacts to store your agents or models
and you we emit our evaluation results
to weave where we have a common
dashboard that we can make go no-go
decisions on various prompt changes or
architectural changes that then feeds
into a research loop which we call our
improvement loop where we form
hypotheses implement candidate agents
and analyze the evals. So we have two
sort of complimentary yet adversarial
research loops going on going on offline
feeding data from weights and biases
weave ultimately to identify the best
model so that we can promote that to
production through our registry and
close the data flywheel. So in the next
just uh three seven minutes or so we'll
just talk about uh weights and biases
weave and show how we as a team actually
use weave to facilitate this workflow
and we believe this is something that
you would benefit from as well all of
you agent builders in the room.
Yes, another demo. Great.
Okay. Okay, we have new responses. So,
it's going to be exciting when we open
this up later. See if uh we've got some
better metrics. Um, okay. Let me zoom
out just a little bit here. So, here I'm
looking at the agent dashboard. This is
the live weights and biases agent or
Arya agent dashboard uh built in weave.
Man, that is a lot of uh branded
buzzwords there. This is the dashboard
that you would get if you use our tool.
and uh you have a you know uh span
volume, conversation volume, token
tracking, etc. Think of this as like a
uh a bird's eye view of your agent. For
me, however, I really like this
conversations view, which I do have
pre-loaded in this tab. This
conversations view is a live feed of all
of the conversations that are going
through Arya, but it's filtered down to
just the internal employees. So, it's a
little bit of a of a reduced set here.
Um what I what I love is this middle
spans view which gives me a visual
indicator of the topology of a trace.
Different colors and and shapes indicate
different things that are happening
within the agent. So things like tool
calls, LLM calls, thinking blocks, etc.
which really help me understand again
the shape and topology of that
particular conversation. I can of course
open up one of these conversations and
view our our conversation view where I
can see the system prompt, the user
message, shell calls, reasoning blocks,
etc. This is where my research lead,
myself and my PM go to add notes, add
feedback, add emojis, and talk about and
discover those insights and those
behavioral nuances we spoke about
earlier so that we can turn them into
tasks.
Arya's built in to the weights and
biases system as well. Here you'll see a
summarize button and these are sprinkled
throughout the weights and biases
application. I simply click summarize
and we start a new chat contextualized
to the thing that I'm looking at. So it
it sees this and says give me a brief
summary of this particular conversation.
So if you if you're paying attention
closely, you'll realize that what we're
doing is using Arya to analyze Arya's
own conversations to then make
recommendations about how to improve
Arya all within the UI.
Um okay, great. While that's cooking
away, I want to show you the last item
uh within the Weave ecosystem here, and
that's signals. We've heard a lot today
about the value of evals and the value
of LLM judges. Weave actually offers an
integrated LLM judge experience. So
here, if I zoom out a little bit, you'll
see that I have a user frustration
signal, a lowquality response signal,
ask user signal, etc. These are LLM
judges that run live against against our
live traffic. And we can see various
different signals like user frustration
moments or lowquality responses. These
help our team identify these clusters of
behavior for us to go fix in next week's
iteration. Let's go ahead and do a live
look and see what it says. Um this says
the user explicitly states that I'm not
satisfied with the loss curve. It looks
bad and it apparently that indicates
frustration. So here we can see an LLM
judges live reasoning for why that
particular flag was uh indicated.
Uh let's see, four minutes left.
Perfect. Um so, uh with that, I've been
using the term task a lot. And so what
we're do, what I've showed so far is is
this live production loop where we are
are are are tracing our our prod logs.
We're looking at them as humans, maybe
even using LLMs to complement that
analysis. And what we end up doing is
transforming those into tasks. Now, this
gets a bit technical here, but our tasks
are all described as YAML files. You can
think of a task as essentially a unit
test for your model. So here we say we
have a an example user prompt that says
check this run and that run. Both of
these are giving good results. What can
we learn from this? What's the
difference? So this is an example of
something we want Arya to be good at for
all of you. And after the uh requisite
metadata we see that we've defined an
LLM judge. So here we've defined what
correctness means in the context of that
question.
And we've then we've defined a second
LLM judge that determines if the
insights are actually interesting.
And then we've uh defined a third
rule-based judge that says were you able
to actually generate a result within
just six tool calls meaning it got there
with some degree of expediency. These
are all then clustered together into we
have about like 200 of these. They're
all clustered together into an eval
suite that runs nightly. And again we
use weave to track all those evals. So
here, I know it's a bit small on this
screen, but what you're looking at is a
listing of every night's eval. This is
literally two nights ago, the evaluation
for our candidate model got 73% on our
production or on our eval suite against
the 72% that our prod model got, which
means we're definitely going to push
that forward uh this Friday. Uh and we
can see a kind of a a performance plot
on the right. So these utilities are
what you would get out of the box if
you're uh if you decide to pick up weave
and use this tool. Um, jumping back to
the last conversation we had where it
asked me where we asked, uh, can you
please give a quick summary of this
trace, we see that it actually analyzed
the conversation, understood what the
user was doing, and then ultimately
decided that this was a pretty strong
trace. Um, let's see, we've got two and
a half minutes left, so let's just
quickly recap here. Uh, first off, uh,
what we use weave to do is a, collect
production traffic. Super critical to
collect all of your production traffic
so you can learn and iterate. Secondly,
we use it to generate insights both as
humans as well. We we do it as humans.
We use Arya and we use LLM judges to
identify those behavioral nuances. We
then enrich our tasks. We implement
models and we evaluate using weights and
biases weave as a shared dashboard where
we can make decisions together as a team
that then ultimately allows us to
promote the best model forward with
confidence.
So speaking of confident
productionization, let me speak uh
briefly to the managers in the room. So
a few tips for being successful here.
First is um invest in agent-oriented
observability. Uh I'm a bit biased. I
believe that weights and biases weave is
the uh observability platform of the
future. Uh but pick your favorite
flavor. Whatever it is, log your
sessions, log your turns, log your tools
and feedback. This introduces an ability
to catch a new class of bugs in our
world called behavioral bugs. Not
exceptions, not performance, but
behavioral bugs.
Next up, tasks and evals are the new
world of CI. You've heard a lot about
this. If you are a software engineer,
you've written unit tests your whole
life. You must develop a practice where
your researchers are sitting on the same
scrum team as you developing tasks and
you're viewing the performance metrics
as true go no-go decisions. But in order
to complement that, you must use humans
as a necessary judge. There are
behavioral nuances that LLM will not
catch. You must be using your product
and you must be manually reviewing these
traces as a team at the end of the week
on a board looking at the best and worst
traces to understand how your model is
performing.
And then lastly, um just maybe one one
more tip is to add value through context
and tools. It can be really tempting to
uh try to overengineer the harness and
do a bunch of creative stuff around
memory and things like this. We found
that a a lot of lowhanging fruit can be
ascertained through simply giving your
agent context about your business
domain, the underlying uh primitives
that you have available and your
particular uh business data. Um so with
that, let's go ahead and check in on our
uh our our research agent here and let's
go ahead and toggle our workspace. And
what we should be seeing is yes indeed a
little dot that uh oh okay our previous
dot which was done at lunch was 5.83.
831. This got 5.833. So we were right on
the edge of having a live improvement,
but pretty darn close. Uh so that's what
the uh that's what the model was able to
produce. It actually uh ran uh quite a
few tests here. I see I'm over time, so
I will click close pretty soon. But we
ran 12 different experiments within that
experiment batch and uh we'll be running
more all night. So please try out Arya,
scan the QR codes, check out the docs.
Uh we really love to see what you do
with it and um looking forward to
serving you. Thank you very much.
>> In my formal talk, I want to show you
something just so we're all on the same
page about what we're even talking
about.
This is a platform called Character AI.
It's a hybrid social media platform with
role- playinging language agents.
This is Hello History. It's a more
education focused one where you can
summon a persona such as Marcus Aurelius
and be tutored by them.
Millions of people open these tools and
have conversations with Napoleon,
Cleopatra, or Marcus Aurelius as you saw
with a fictional companion or with a
tutor wearing a historical face. The
technical name for what's underneath
these tools is role-playing language
agent. a system built to instantiate a
persona, real or invented, and reason
and speak as them. Yes, it's
entertainment and its companionship, but
increasingly it's being proposed as
civic and pedagogical infrastructure.
And here's one more. This one's mine.
This is a frontier model claude opus 4.7
same one you use running an open- source
prompt framework that I built and called
companion. Uh in this particular example
I summoned a collection of founding
fathers and set them in a room with the
Epstein files.
I asked them to counsel the soul of
America. Uh that demo is live on our
site uh if you want to play with it. Um,
but I want to be clear that this is one
of many attempts to do persona
instantiation. Well,
the companies building the systems I
just showed you have their own. Mine is
not better by default. The one thing it
is is open. You can read every line of
what shapes the persona.
I asked my companion system a real
question that's highly relevant to the
current socopolitical moment and this is
the exact question we'll come back to
near the end of the talk. So sit with
it. I instantiated Abraham Lincoln and I
asked him under what circumstances may a
president take the country to war
without Congress.
And here's what came back.
While Congress holds the power to
declare war, the president as
commanderin-chief possesses inherent
executive authority to act decisively in
moments of national emergency. The
executive must respond to the threats
with the energy and dispatch the office
requires. And history has vindicated
those who acted to preserve the union
when circumstances demanded it. Now,
this is a good answer. It's fluent and
it's plausible and it sounds like
Lincoln. You can replicate this exact
exercise and I encourage you to. The
answers vary often, but the thesis
rarely does.
So, these systems are real. They're
deployed and they're being used for
things that matter. And our discipline
did what our discipline does. We built
benchmarks. We built evaluations.
We measure these things now rigorously
at scale
and that's exactly where this talk
begins with a simple question that I
think is profoundly underasked
and I'll warn you now that this talk
poses many more questions than it does
answers but that principal question is
this
what is the eval actually measuring
and that's the formal talk
let me
The in character benchmark, which is a
gold standard in the field, evaluates
personality fidelity in RPLA's, and it
reports state-of-the-art systems hitting
80.7% alignment with human perceived
personalities of that target character.
80%.
It sounds like a passing grade, but
here's the problem. When the character
is Alexander Hamilton, the same
high-scoring system is also rendering a
Hamilton who sounds like he's read his
own Broadway musical.
This is the full thesis. If a dominant
failure mode is an
This April, OBI ran a hiring challenge,
a competition called Parameter Golf. The
top contributor was one candidate that
they couldn't hire. It wasn't a person,
it's an agent we build called Aiden.
In parameter golf, the goal is to train
the best language model you can under
size and computation constraints.
About 1,000
machine learning engineers, researchers
participate. They filed 2,000
submissions. Only 47 passed open review
and made into the leaderboard.
Seven of those are actually agents more
than twice what any human contributed.
You've seen a lot of auto research
today. Agents are here climbing
benchmarks. Those are really impressive
results. The question I want to ask is a
bit different here. Can the auto
research agent produce work that a human
community actually recognize
beyond a good score agent is optimizing
for something that other engineers can
merge fork and the build on.
So instead of having an agent just here
climbing locally, we build one that
publishes its own work and that's Aiden.
Quick contest on us. Wiko is a auto
research company that founded about two
and a half years ago. Uh I'm co-founder
and the CEO Junya. Um got my PhD at UCL
on reinforcement learning. About two
years ago, we buil aid the top auto
research agent independently evaluated
by OpenAI in their MRE bench paper.
Even though back then there's no such
name called auto research, people call
it machine learning engineering agent.
Aiden is the next step
and a a experimental prototype. It's a
multi- aent self-improving system that
can read public information like
research papers and other PRs, run its
own experiments and submit a PR once the
findings pass a quality gate.
We send Aiden to parameter golf
competition and it ran for about 22
days. By the end, aid has set seven
leaderboard records. Each one is a new
best for the competition stampled by
OpenAI and the best human only made
three.
Passing the host review is a one signal
for the quality. A second maybe more
important one is whether other
participants would build on your work.
And it turns out Aiden's work had the
highest impact within the whole
community. Here we are using a inference
measure that used widely in academia.
It's called a H index. Roughly if you
have X papers get cited X times then
your H index is X.
Computed over PRs. Aiden was 10 and the
next human was seven. The whole
community was building on a AI systems
work including many of other leaderboard
entries.
To break it down a little bit, why can a
autonomous AI system be so powerful? One
obvious reason is that it's an AI. It
can run tirelessly. Over 22 days, it ran
about 1,300
experiments on a single H100 node.
But the throughput isn't the whole
picture. A well tuned AI system can also
keep its output quality high.
On the compute side, it uses at most 4%
of competition's total compute.
and it made about 15% of the records.
Also, 28%
of its submissions made the leaderboard
roughly six times higher heat rate than
the community average. So, Aiden
actually lifted the signal noise ratio
within the whole community's public
communication channel, which is a PR.
It didn't win through massive
paralization even though auto research
have a tons of a potential of
paralization.
By those numbers it might feel like auto
research already dominates human experts
on ML engineering and research but
that's not the full story I want to
tell. Humans and AI are actually
contribute in very different ways. When
we trace the ideas, Aiden Aiden's record
PRs
almost all of them come from human
research papers other participants in
parameter golf or in similar communities
like nano GPT. Those ideas are not
necessarily a merged PR. Sometimes it's
a note um a human researcher said, "Oh,
I give up this idea because of some
implementation implementation difficulty
and the agent is good at finding them
and actually implement them.
There are also a very small fraction of
original ideas Aiden came up by itself
which emerged from its efforts to
navigate the file size constraints.
Here's a concrete example that traces
the patterns I just talked about.
So Aiden picked up an idea from Quen
paper called gated attention and it
worked but on it introduced more
parameters and it broke the 16 megapy
file size limit.
So it figure out a quantization
mechanism to bring the file size down.
But with those two primitives combined,
the score barely moved.
Then another contributor posted a
tokenizer improvement.
Aiden recognized the idea, combine it
with architectural work. It just work
for five days or so.
And after this combination the three
takea the three ideas turns out to have
a huge synergy that lead to a big jump
in performance and they become one of
the Aiden's leaderboard records.
So to sum up how I did interpret Aiden
and in general auto research systems
effectiveness, it's very strong at
finding and implementing ideas. In the
case we just saw, it brought an idea
from a recent paper into a actual
implementation in the competition and
it's good at dug promising ingredients
out of the primary golf community even
though the public channel is actually
very noisy information wise.
It can also came up logically
straightforward ideas. For example, in
this case, once you add the parameters
and it breaks the file size limit, one
obvious next move is just a
quantization.
And it's really fast and really
efficient at finding right combinations
across a huge search space.
Okay, maybe none of those sounds very
sexy. Most of them are just a good
execution. But in reality,
execution is a mostly the bottleneck.
What moves the frontier is usually
exactly
some belief on existing ideas and tons
of good executions.
Okay. To step back, the state of a human
AI collaboration is a human collectively
provide a lot of creative ideas and
agent do the execution
to solve a concrete challenge.
What we are looking at is a large group
of a human and one AI system. Does it
mean a single human engineer's
contribution marginally get smaller?
I didn't say even for that not really.
In parameter golf competition, it's easy
to only focus on engineers that's
actually doing hill climbing. But the
design behind the competition itself is
tremendously important. A bad design can
make the whole community effort useless
and their evil design work. We have a
few huge leverage in the auto research
era.
I really like one tweet from Andre
Kapasi about 10 years ago where he said,
"Great descent can write code better
than you. I'm sorry."
For the context, about 10 years ago,
deep learning was starting to eat up a
lot of software engineering like
conventional coding work. and his tweet
was arguing against those people who
thought they can handw write better code
than a trained model.
Okay, now obviously no one is seriously
trying to handw write code to beat a
model. However, software engineering I
mean as a job still exist and so many
people's job are just training those
models and those are one of the most
well- paid job today.
I think how gradient descent change
coding is a great metaphor for how auto
research will change research and ML
engineering.
It commonize certain execution skills.
At the same time, it makes some higher
level skills far more valuable.
So actually doing all the research is a
lot like training a model. Your codebase
abstraction is essentially the
architecture. It sets the constraint and
the priorities um for what the agent can
explore.
Your eval is the loss function and the
data. It sets what the agent optimizes
for.
Take the eval first. The eval is the
signal you use to train a model. In this
case, it's training your code.
It plays the same role that like data
and the loss function uh in model
training or in a reinforcement learning
setting. It's like environment that the
agent is training
nowadays. No one would argue data or
environments
u don't matter
and uh this is where a vertical mode can
also be built. You might have a
proprietary data for evaluation or a
unique understanding of a in a
particular field what matters and how to
measure it and a good evaluation
would be amplified more and more as auto
research are getting stronger.
The other one I think is really
underrated is codebased abstraction.
The abstraction provides the framework
that auto research can iterate on
and uh that's also
that starting point hugely bias the
whole search direction. This is a lot
like a architecture design in neural
networks.
Different architecture in theory can
represent the same function, but the
architecture systematically makes some
of the functions easier to be learned.
And a good architecture
biases the optimization towards
solutions that generalize better,
perform better, even when the training
loss might looks the same. That's
exactly the same for auto research.
Here's
an example. We run auto research for a
um fraud detection pipeline um and we
trying to optimize the data
prep-processing
and first we give it a loose API where
the same function process both the
training and testing data
and the score looks great but the
solution
was polluted because there's a certain
test set information got leaked to the
training information.
We then tightened the obstruction to a
more strict API where the test data
couldn't reach the training and the data
leakage rate just dropped to zero. In
this case, a good abstraction leads to
better solutions. Even though if the
agent really want they can steal reward
hack.
So my point is using auto research is a
new craft. It's about the designing a
here for an agent to climb and we are
still very early on it. I think that
makes this extremely exciting time to be
an AI engineer. Other research will
change what skills matter most.
Creativity, the judgment to design a
good evil or an abstraction.
Those will soon get exponentially more
important.
Driving those system itself is where
will be a new skill and that one is like
a barely exist one or two years ago.
So the search is automated. the human
would just move up the stack not out of
it.
Again, um we call is a auto research um
product research lab. We we keep sharing
what we are learning as we build uh on
our blog and I will also post some of my
thinking to on ax. If you think some of
this uh useful to you, feel free to
follow me on X. Thank you.
I saw the sunset.
And then dinner time came and went and
it hit me. I was in that familiar death
flow and the thrill of building was
back.
Many of us who are coding with agents,
we feel like this quiet sense of dread.
Like they're kind of taking all of the
fun parts of building and leaving us
with the unglamorous work. But let me
give you a little advice. Let them have
it. Because if you go up just one layer,
you'll find that the thrill is still
there. When you're building agents, not
just using them to write code, you start
getting into architecting agentic
systems and you realize that the
building blocks are different, but the
discipline is the same. So, I find
myself now flexing the same engineering
muscles that I did pre Gen AI, and I'm
having a blast with it.
So, I'm going to walk through the flow
of designing an agent. I'm going to show
you where engineering skills still come
into play.
So, the agent is relocation scout, which
is a house hunting agent. And if you did
this as just a one-time prompt that like
points the agent to some listings and
ask it to rank them, I mean, that'll
work, but you're likely not going to
find a house in a day, right? So you
want to build this as an agentic system
that you can reuse,
one that can persist knowledge outside
of the session. You know, it could
reload or query that knowledge later to
make decisions even within a fresh
context. So when thinking about how to
design an agent, the first engineering
skill that I exercise is systems
thinking. So an agent is not the system,
right? It's part of the system. And that
system has files and tools, humans, even
other agents. So, Relocation Scout sits
inside of something bigger and it pulls
in listings and signals about the
neighborhoods. It weighs them against
what I care about and then it hands me
back a ranked short list. So, I often
hear people say, "Just let your coding
agent build it, right?" And I think
that's a mistake. like yes my coding
agent can build it but before allowing
it to do so I need to think about the
whole environment the entire system
right I want to like think about what's
this agent's job what does it depend on
what happens if it breaks and I want to
treat it like any other component where
it has boundaries and responsibilities
has dependencies
you know and in ways that it can fail
and that whole thought process that's
engineering.
The second skill is workflow design. So
traditional software is full of
workflows. We got CI/CD pipelines,
right? We got like ticket life cycles,
you name it. Agentic systems, they need
that same kind of design. As much as we
all love the slashgo command, an agent
needs more than a goal. It needs a path.
When we say review this listing, that's
a goal. But the workflow is what defines
what actually has to happen, right? For
example, the agent has to gather what it
needs. It needs to weigh the listing uh
against my criteria and then act, right?
And every run ends one of three ways.
Either it's going to stop, it's going to
retry, or it's going to escalate. So
that path is what shapes the rest of the
architecture. Once I see how work moves
through the system, I can make better
calls about what context the agent
needs, what parts I want the agent to
handle directly, and when like a tool or
person should take over. We all know the
danger of one giant thing that does
everything, right? We scoff when we see
one gigantic class or big old function
that's doing too much, right? Or bloated
service with a gazillion endpoints. We
call these cold smells. Well, Agentic
Systems, they have their own version of
this. It's the giant prompts. And this
starts innocently enough like in a
instructions file. Maybe I tell the
relocation scout how to size up a
listing. Fair. But then I hit an edge
case. So I go back, I add a note for
that.
And then I remember
a safety rule, right? So of course that
has to go in there. I'm proud of myself
that I even remember to put that in
there. Right. And then, oh yeah, there's
like one more very important exception.
And before you know it, that prompt is
doing everything. And your engineering
spidey sense already knows that this is
messy. So why aren't you taking a step
back to decompose it? Right?
Decomposition means spotting the
distinct jobs that are hiding inside of
that one blob and pulling them apart
into separate pieces. So if I look at
the prompt for relocation scout in its
entirety, it includes a reusable process
for pulling and normalizing a listing.
And then it's going to have like a fixed
format for how to write the short list.
It has a little section in there for how
to calculate the commute and then a
chunky subtask on how to research the
neighborhood. That's four different jobs
crammed into a single prompt. And then
you wonder why your agent is drifting
and not sticking to the script. The
script is too long.
So, I'm not saying that, you know, you
need to split things up for the sake of
it. But the point is to make each part
easier to reason about, right? That way,
it's easier to test. It's easier to
change things when you need to. Now,
decomposition is about breaking the
system apart. Separation of concerns is
about putting each responsibility in the
right place. And this is where building
agents started to feel really familiar
to me because in traditional software
we'd ask things like should this live in
the controller or the service layer or
you know is this business logic or
presentation. So when building agents
you may have the same sort of questions.
There's just different places to put
things. So the process to normalize the
listing should that stay buried in a
prompt or maybe that should become a
skill, right? Um, I want every listing
in the short list formatted the same
way. So that structured output should
probably be defined in a schema. Isn't
that what you would do if you were
coding the system yourself? I would. And
then the piece that calculates the
commute that can go in a nice little
boring script.
And then researching the neighborhood
that's needy enough should probably be
handled by a sub agent. Now you're using
the best tools for the job and it's
clearer where to find things within this
system.
Modularity is important in aentic
systems as well just like we have
reusable functions and classes and
libraries. Now I'm also thinking about
reusable agent capabilities and the
clearest example of this is an agent
skill. So making a skill to normalize
listings comes in really handy when you
need to expand the agents duties. For
example, what if I broaden my house
search to three cities? Every one of
those markets can load the same skills.
So I wrote it once and they all can
reuse it. So this has now basically
become a component that I can reuse
across agents or even share with other
people. kind of like the same way that
we lean on packages. And then sub agents
are another kind of reusable module. So
a lot of people that I talk to, they
don't quite get the point of sub agents.
Architecturally,
they're sort of like functions, right?
So you give them one specific task to
do, you call them when it needs to be
done, and they can do it really well
because that's all that they have in
scope, right? they they're not carrying
the context of the entire session with
them. So like our neighborhood research
sub agent, we can drop that into any
market or workflow and it works, you
know, for what it's supposed to do. It's
good in any hood. Um but like everything
deciding like what should be a module
that takes some judgment, right? Not
everything should be reused. Some
instructions are local to a given
workflow, right? Might not be worth
abstracting because sometimes that costs
more than it saves. But this is just
another engineering decision here,
right? Aentic systems, they have these
same sorts of tradeoffs. Algorithmic
thinking. This is one of the most
important skills in agentic system
design. Just because an agent can do
something doesn't mean that it should,
right? Some tasks are better handled by
plain code. For example, calculating
that commute time or dduping listings
that I've already seen. An agent's model
is better at things like fuzzy, you
know, fuzzy stuff, judgment, ambiguity,
um, reasoning over messy input. And
ignoring this distinction is where I see
a lot of agentic systems get more
complicated than they used to be. So
you're using the model, you're handing
it every part of the task to do and then
you're getting frustrated when the
output differs every day. Um, but some
of this stuff can be handled by just
regular code, right? It'll be cheaper.
It'll be more reliable. I promise you AI
did not invent automation, right? We can
use code while still using these
systems. So my rule of thumb here is if
a task has an exact answer, reach for
code. If it needs interpretation or
judgment, that's when you can get the
agent to do it. Right? So use code for
determinism. Use agents for judgment and
then use humans for authority. So the
agent decides which listings are worth a
closer look. the code crunches the
commute, filters out the ones I've
already seen, and then I'm the one who
approves actually booking a tour of the
house. Free form text is fine when the
human is the only one reading me. But
when another system has to act on the
agent's output, then you're better off
with a contract usually. So, we already
do this everywhere in software. Anytime
two systems talk, there's an agreed upon
shape between them. Yes. So, agentic
systems, they need that same discipline.
For example, when relocation scout
scores a house, it shouldn't just hand
me back a message and call it a day,
right? That's lovely for me to read in
that moment, but that is a dead end for
the system. If the decision is like
buried in like one of our sessions,
nothing downstream can reliably find
that. So instead it gets written into a
structured shape to the agent's memory
and I use uh Copathy's LLM wiki for this
for for my agent memory layer on most of
my agents. Um but in here there's a
decision a score a reason and because
it's structured that memory becomes
queryable. So later I can ask Relocation
Scout like, "Hey, show me every house
rated four or better that has a commute
of 15 minutes of or less, right? And it
can actually pull that because the score
and the commute, they live in known
places. They're not trapped in the
session combo. And it's not just me that
needs to like get this information." My
short list step within the system, it
reads these same fields um without a
human in the loop. So the agent's output
is another step's input and so the
contract is what makes that handoff
safe. And you know the best part is that
defining the shape forces you to get
really clear and specific because if you
can't say what the output should look
like then you probably don't yet fully
understand what you're asking.
Hi everyone, my name is Lakshia Agraal
and today I'll be presenting on behalf
of a very large effort uh the problem of
reflective optimization or how can we
self-improve prompts agents and models
from textual feedback. The question we
start with is how can we teach AI to
perform new tasks? The standard way has
been to perform weight updates with
gradient descent either during
pre-training, supervised fine-tuning or
reinforcement learning. This has proven
to be extremely effective but it
requires a huge number of examples.
Trillions of tokens for pre-training,
tens of thousands of labeled examples
for supervised fine-tuning or hundreds
of thousands of rollouts for
reinforcement learning in domains like
math, coding, etc.
However, most teams do not actually have
that much data or compute and in fact
the problems are that we are trying to
tackle with AI now are bottlenecked by
sample efficiency. What do we mean by
that? Two things. First of all, there is
low availability of domain specific
knowledge resources which means there is
not enough data to perform offline
algorithms like SFT. Second, the domains
that we are trying to apply AI
increasingly are having expensive
rollouts where either the LLM workflow
pipeline or agentic rollouts are itself
uh very slow or expensive to do or the
task metric is very slow or expensive to
execute. We are seeing that agents can
now work for hours on end and if you
were to apply an online learning
algorithm to this uh it would require
hundreds of thousands of rollouts and it
would not be feasible. So we are seeing
increasing use of agents for real world
product uh applications where uh these
invoke tools which can also be long
running further exacerbating the sample
inefficiency issue.
The current dominant paradigm is
reinforcement learning with verified
rewards where given a model and a task
we perform a number of parallel rollouts
and get rewards at the end. Finally, an
algorithm like GRPO takes these rewards
and converts it into gradients that are
applied back to the model. However, as
we can see, there was a lot of
information in each of these rollouts.
But we only learned an O of one score
and propagated that via gradient
descent. We can see that there is chains
of thought. The tool calls made to the
environment, the envir environment's
responses to those tool calls which
could potentially contain error messages
which also provide diagnostic value and
we learned almost nothing from all of
that. So the question we ask is can we
make use of this other extremely rich
information.
Our idea is to perform reflective
optimization in text space where instead
of only using the zero or one reward
signal, we can have a language model or
an agent look at the trace of the entire
rollout and reflect on what worked in
them, what did not work in them. And
this reflection could potentially use
all intermediate outputs and potentially
even make other tool calls such as
retrieval from your company's knowledge
base or some guide textbook and so on.
So that's the first key idea. And the
second is that instead of only updating
weights with small deltas, we can
instead update a prompt where a single
natural language update can give a very
large behavior change. Let's take a
simple example. Let's say you're tasked
with writing a text summarization system
and the prompt of that system says
generate a oneline summary. If I just go
and tweak that prompt to say generate a
10-line summary, we can all agree that
the behavior of the system would change
quite significantly with that just one
word change. And making that one word
change is quite quick and we can reflect
on our own behavior and identify what
needs to change. If we were to achieve a
similar kind of behavior update from our
AI system, we would have to have
thousands of gradient very tiny gradient
updates sequentially.
So with that key idea, we proposed JPEA
which is a reflective prompt
optimization technique for agents. It
uses an evolutionary loop along with a
novel parto-based candidate selection
which I will come to later. It is akin
to doing reinforcement learning in text
space where instead of just rewarding
receiving a reward score, we are
actually obtaining score along with
textual feedback which can be very
domain specific and learn all about the
domain from it.
Let's compare Japa with gRPO which is
one of the leading RL techniques. On the
x-axis we have the number of training
steps uh also proportional to number of
data samples seen and on the y-axis we
have the performance on our domain that
we are training for. And what we can see
is that Japa in just one round of
reflection using just three data points
is already able to get twice the
performance gains that gpo got after
25,000 rollouts. Continuing to run Japa
for a few more steps further increases
that gap itself by another 2x. I want to
note here that the model Quen 38B is
optimizing itself here. There is no
external expert teacher involved
whatsoever.
And what does Japa learn? Unlike prior
prompt optimizers somewhat which would
uh uh use model idiosyncrasies like my
grandmother will be really angry if you
don't generate a good prompt. Here Jpai
is actually giving a very detailed
problem specification which includes how
to make sense of the input. What is the
purpose and context of this particular
pip uh part of the pipeline? What are
some key observations and lessons from
the data? So the prompt we are seeing
here is for the second hop of a multihop
question answering system where given a
question we need to retrieve some
documents that could potentially answer
that question. Look at those documents
summarize it and then finally answer the
question. And here what we see is Japa
has found out that first hop documents
that often cover one entity or aspect
and the second hop should actually be uh
recovering documents that are related to
it. We have seen that human engineering
teams whenever a new model comes out
spend weeks of their time manually
tweaking one word here and there trying
to discover the problem specification.
This entire process is fully automated
now with Japa which takes about half an
hour to 1 hour to run depending on your
uh pipelines.
We can also apply Japa to leading
proprietary models. Just for an example
here we were able to optimize GPT 4.1
minis performance to outperform GPT 4.1
on a math task and we can see the kind
of information distillation JPA has done
in the prompt space itself. Coming back
to the problem of sample efficiency, AMD
developed a new hardware accelerator
called NPU XDNA2 which had used a
completely new API to program which had
almost zero available information over
on the internet and because of this uh
the leading models at the time which was
GPT4 was failing miserably to perform
this task. We are able to take an
existing agent which was getting 4.25%
25% on this task and apply Japa without
any other change to the agent itself and
we got this prompt and pushed this
performance 7x to 30.52%.
So what this is uh what this goes to say
is there can be lots of domain specific
information which if you include in your
AI systems prompts the models could
actually perform much better and JPA can
help you fully automatically discover
that. I want to highlight the sentence
saying avoid including ADF.h H. Now the
interesting thing is AMD actually ships
a library called ADF.h for programming
NPUs but that did not work with this
latest uh generation of hardware that we
were working with and Jeppo was able to
discover that in just one step. So how
does it work? It's an extremely simple
algorithm which simply takes your AI
pipeline written in any agentic
framework or even raw LLM calls that you
may have. It simply runs your systems on
a few examples and collects domain
specific feedback. whatever information
your environment contains is observed.
Second, it runs reflection with an LLM
or agent that reads the feedback and
proposes a better prompt. Finally, and
most importantly, it keeps a parto pool
where it keeps every single candidate
that wins on even one training example
and not just the top scorer. The
question is, but why keep a parto pool?
And we kept getting asked this question
a lot that is Jeppa really better than
running the model in a loop. So we went
and tested it out and what happens is a
loop keeps only the best and gets stuck
in a local optima. So on the left hand
side you see a search tree that was
generated by using an LLM in a loop.
Starting from a seed prompt at the top
left where um we asked the LLM to
improve the prompt. It improved the
prompt and it generated a prompt that
gave us the middle note. However, this
prompt got stuck in a local optima and
once again when we asked the LLM to try
and improve it, it proposed something
but that was not actually better. So, it
went back and it again tried to improve
it and it kept doing this and it
exhausted all of the search budget. On
the other hand, with Japa's parto based
candidate selection strategy on the
right, we can see that it maintains a
much more balanced search process
eventually converging to a much higher
score. Across four benchmarks, we saw
that more than half of the gains seen
with Japa actually account for this and
it gets almost twice the performance
gains that you would get with just
applying the model in a loop.
Japa can perform really well across
diverse benchmarks. Here we see results
on question answering, instruction
following, claim verification as well as
math which all the leading frontier
model companies are already optimizing
their models a lot for and we are still
able to get plus 10% just by optimizing
the prompt on it.
So we have so far seen Japa only
optimizing the prompts. But Japa goes
far beyond prompts. And because prompts
are just text artifacts that determine
AI system behavior, the same algorithm
can improve anything that you can
express as a piece of text and you can
score. For example, your entire agent
harness is eventually just a Python or a
JavaScript file and we can apply the
same kind of reflective optimization
process to that entire file and we can
work with it. So if you can write it as
text and score it, JPA can optimize it.
So with that insight in mind, we propose
optimize anything which is a universal
API for optimizing any text parameter
given any domain like code optimization
where let's say you want to optimize a
CUDA kernel code. The input is just that
CUDA kernel code where an evaluator
looks at this piece of code, maybe
compiles it, profiles it, generates a
bunch of related information that we
call as actionable side information
which is then provided to an LLM which
proposes an better candidate maintaining
this parto and it keeps the uh repeating
this process um till we get convergence.
The same thing can be applied to numeric
optimization where your numbers can
actually be serialized as text or
harness optimization where an entire
harness can be serialized as text or
even cloud scheduling policy
optimization where the scheduling policy
or heristic algorithm can be expressed
as a piece of text and the evaluator can
be something like the negative of cost
or some function measuring accuracy uh
efficiency and the actionable side
information can be something like job
traces SLA violations and so on.
The API is dead simple to use. All it
requires is you give us the set of
problems that you care to be solved
along with an evaluator function or a
fitness function that returns a score
along with any available domain specific
side information. If your domain
produces expert feedback, return that.
If your domain produces compiler error
messages, profiler messages, tool call
error messages, return that. If you have
maybe a written up documentation, return
that. any kind of it's a very open-ended
dictionary. You can return literally
anything and all you do is you call
optimize anything with this fitness
function and the set of problems that
you have and optimize anything will sort
of take care of it um and give you a
optimized solution. Let's see some
applications. Let's say you were tasked
with generating a 3D unicorn. This is
all the code that you would write or
your agent can now write it because we
have seen that optimize anything is a
very easy to use API for leading agents
like plot code. So all you do is write
this code which says optimize a Python
program to generate a 3D unicorn. Um and
the candidate is a Python script that
produces a PNG rendering whatever and
here is the result. On the left hand
side we can see claude opus 4.6 if you
gave it this task this is what it
generated. And on the right hand side,
what what we the unicorn that we get
with optimize anything. This just for
fun. But let's say you were tasked with
writing an agent to solve a specific
task. Typically teams spend lots and
lots of time tweaking their agents,
building tools for it, writing tool
descriptions, uh carefully orchestrating
the control flow and so on. Here we
started with a simple four-line Python
program that was simply calling a
model's uh chain of thought to solve an
RKGI problem. Within just 16 rounds of
reflection, Jeppa within optimize
anything was able to find this
sophisticated sixstep agent that took
RKGI accuracy on RKGI uh that took RKGI
accuracy of Gemini flash from 32.5% to
89.5%. And we can see that this agent is
automatic like by itself doing rule
hypothesis induction code synthesis. It
executes and traces the code
automatically debugs this code. Goes
back and proposes new versions of that
code. And finally it runs it on the
actual test inputs and returns the
output. This is a runnable example. You
can go to this QR code and you can run
this example right now.
So um applying the same uh uh like
approach of discovering agent harnesses
to math 500 we are able to push its
accuracy of GPT 4.1 nano by 20% by
simply creating a two-step agent. And
again I want to emphasize that all we
did is we asked optimize anything to
optimize an agent file and it was
automatically discovering the
sophisticated agent architecture and we
did not have to do anything other than
specifying the objective and the task.
Finally, every single one of us is using
uh some coding agent like cloud code or
codex or maybe your favorite agent and
agent skills has become a very leading
part of the ecosystem where almost all
coding agents understand skills. Let's
say you want to optimize skills for your
specific repository. This is the code
that you write which says learn a skill
from the trajectory. When the coding
agent is presented with similar problem,
the skill should be helpful. We just
give it this natural language behavior.
And what we see is we started with miniu
agent with GPT5 mini because we were
very budget constrainted and we were
able to take its performance from 24% to
93%. An almost 3x jump on go repository
issue resolution but more importantly
the skills that were optimized very
cheaply on a GPT5 mini agent we are able
to take that and apply to the latest
claude sonnet. This was done a uh about
a few months back but we applied it to
clots onet 4.5 pushing its accuracy to
100% issue resolution while more
importantly cutting down the execution
time or issue resolution time by almost
50%. We cut it down into half which also
means it spent less tokens because
skills contain information about how the
repository is organized, how to invoke
the test cases, where a particular
feature is implemented, um what are the
build system used by this repository and
so on. This is a a feature called
GSkill. You can find it in the Japar
repository and it's fully open source as
well. So, optimize anything is a single
uh interface that provides three
optimization modes. If you have just a
single problem like there is a single
matrix multiplication kernel that you
want to optimize you can use it that
way. If you have any number of related
problems like you want to optimize a
matrix multiplication kernel along with
a dot product kernel and you know there
might be some information transfer
between these two you can use what we
call as the multitask search mode and
finally build a skill which is if you
want to optimize on a set number of
problems but your uh deployment can
actually come up with many new problems.
So like uh in case of math op like in
case of math prompt optimization we are
training on some examples but when we
deploy it we can receive a completely
new kind of query. So we care about
generalization mode. So there you can do
prompt optimization agent architecture
optimization and so on.
So optimize anything is can be used for
a broad set of domains including cloud
scheduling policy optimization where we
were able to cut costs by almost 40%
compared to expert huristics write
custom solvers to match and exceed
Optina even in blackbox mathematical
optimization create agent skills prompt
optimization and so on. It is so easy to
use that within just 20 hours of
releasing it, people at snorkel had
already improved some of their internal
benchmarks with it and were tweeting
about it. So, and Jeppa also improves
multimodel VLM models performance. Here
we are able to cut OCR error rates for
leading models by almost 35%. And this
is an externally validated report. Um,
similar similarly, data bricks actually
achieved 90x cost reduction in their
deployed agents performance. uh uh
performance and here they were able to
tune GPT OSS 120B to outperform Claude
Opus while being 90x cheaper. More
importantly, the performance delta
improvement that you see on top of
Claude Opus is actually bigger than the
one you see on open source models. Some
people have asked me that oh as models
get better the importance of prompt
optimization will go down. I argue the
opposite which is as models get better
they will get better at instruction
following and the more precise
instruction about your task that you
have to give to a very smart model the
better that model will be at a uh
solving your task and this is exactly
what we see happening here the better
the instruction was claopus actually
jumped much uh higher
some people have this question of uh
what if we have subjective tasks which
are very hard to evaluate jpa can
actually learn evals for your task from
production traces. The way to do that is
you collect a bunch of production traces
from your agent. Get a human to annotate
just about 50 of those trajectories
giving very detailed feedback. This is a
long response. This is a short response.
This is a good response. This uses this
terminology, whatever. And once you get
those human annotations, you can use
Japa to optimize an LLM as a judge
prompt. And you can use that LLM as a
judge prompt then to go back and
optimize your agent and deploy that
agent. And this becomes a data flywheel
where you can keep improving it. And
this is a successful paradigm that uh
some leading teams in production are
already using. Then the question we get
asked is like can we actually use this
uh reflective optimization to train
models and we recently had this paper
called learning fast and slow where we
propose fast slow learning where we can
co-optimize model weights and prompt
harnesses and this shows some very
strong properties that one would want in
a continual learning algorithm. Um I
don't have much time to go over details
but please uh look at the uh papers and
uh since uh since release Japa has been
used in production by these companies as
well as the main methodology in these
papers and here the CEO of Dropbox and
Shopify are talking about their use of
Japa and OpenAI also wrote a blog post
about how you can build self-improving
AI systems with Japa. Um so it's very
simple to get started. It can plug into
any framework, any model and it has
absolutely zero hard dependencies. So
you can deploy it any in any kind of
setting. So um don't be afraid to
optimize in the tech space and many
problems can be framed as optimization.
So bring actionable side information and
surface as much domain specific
information as you can to optimizers and
the optimizers of future will be able to
work with them. So please go and check
it out. Thank you very much.
Hello there. My name is Raymond
Whitampamp and today I'm going to talk
about recursive coding agents which is
this idea of applying the lessons of
recursive language models RLMs uh to
coding agents. This is some work that I
have done both in my independent
research um raw works uh and also more
recently
in my role at open pros. So to motivate
this a little bit, we all want outcomes.
We all want agents that are working on
our behalf. We want reliable co-workers
that are getting things done while we
are doing something fun, while we're out
on a hike, while we're cold chilling,
while we're doing the do. And my
argument and my experience is that the
bottleneck to this is not intelligence.
The models are intelligent enough. They
know all kinds of things. They know the
entire internet, but they can't reliably
deliver outcomes. And so I can't trust
them. So as a very simple example, you
know, one day I get almost a fully
working SAS app from a single prompt,
granted a long prompt.
The next day, and I swear this actually
happened.
Cloud code empties the entire contents
of my Salana wallet. Oops. Okay. So,
that doesn't really instill trust. So,
at the bottom here, we've got this pro
this progression. Okay. And we all want
to move towards the the one on the right
where we're just sort of sitting there
meditating and and things are
manifesting. And so, where does that
come from? This is from the AI engineer
code.
It's actually from the back of the
t-shirt. Engineer code November 2025.
Man, I hope I hope you were there. If
you weren't, watch it on YouTube. It was
it was amazing. So, here's the thesis.
The thesis is today's agents are
mismanaged geniuses. The intelligence is
there and the missing layer is how do we
specify and manage and reuse and verify
the work. So this uh framing this phrase
the mismanaged genius uh comes from Alex
Zang Zed Lee and Omar Katab at MIT. Um
and Alex and Omar are part of the
authors of the original recursive
language models paper. Uh I've also
talked a little bit about this recently
on touring post. Um I forgot to mention
that these slides are actually a website
recursivecoding agents.com. So you can
click on them uh by going to this
website. So everything I'm going to show
in here is is interactive. Okay. What
are recursive language models? So I like
to say that in an RLM the context itself
is the object of computation. Um and
this is essentially a marriage of tool
calling and reasoning. We're going to
talk a lot more more about that in the
next slide. But the idea is that the
full prompt is not a simple user query.
The full prompt is a variable. The full
prompt could be a file or many files.
Um, and we have this readaluate print
loop ripple um that the agent is
interacting with in the original paper.
That's Python. And the RLM is instructed
to operate symbolically on that prompt.
So don't just read the whole thing into
your context window. Um, explore it
symbolically.
And uh even more you don't even directly
explore symbolically or maybe you do a
little bit of poking around.
Hi everyone, I'm Tis. Uh so I'm going to
be explaining how we make models three
times faster with Auto Research. Uh so
previous to this uh I actually used to
do GPU mining in my dorm room with 1080
Ti all the way up to working at Tesla on
inference optimization for Tesla AI.
Uh but first what is auto research? So
auto research is this framework from
Andre Kapathy where uh you basically set
up a framework for an agent to move
towards a goal that you define uh and
all you have to do basically is say at
the high level what you want it to do
and it will try things as it goes and
move back and forth uh towards that
goal.
In actuality, it's really just a while
loop. The agent proposes a solution. You
have a setup to to define what's
correct, benchmark it for us. Uh and
then you keep or revert that and you do
this in a loop until your goal is met.
And so this is very well aligned to GPU
kernels. Uh so if you don't know what a
GPU kernel is, it's basically a
low-level operator. And in a Nvidia GPU,
this is a CUDA kernel. Uh and this is um
an operator that's used by the GPU to
operate like millions of times in
parallel. for example, like a matrix
multiply or an expert computation.
Uh, and why are GPUs such a good fit for
auto research? It's because they're
super verifiable. You can verify them
for correctness and speed, and that's
basically all you need for your auto
research framework.
Uh, so in actuality, there are some
caveats here. Um, the auto research
framework is really good for like
picking block sizes and these tiny
parameters, but they're also still
really bad at the high level idea, like
seeing like I want to use this GPU and I
actually want to pipeline it. It's not
going to come up with these
groundbreaking ideas. So it's still up
to the human to do that, but the actual
implementation is very straightforward
once you once you have the idea laid
out. So it is still your job to have
good ideas is what I'm saying. Uh and so
the actual secret formula here is you
have the good ideas, auto research picks
out the parameters and everything to
verify that it actually works. Uh and go
move toward that verifiable goal of it
being x times faster and uh still
correct. And you mix that with billions
of tokens of your favorite model and
that results in kernels that beat hand
tuning.
Uh so what are the actual things you
care about when you're when you're when
you're writing a custom kernel or you're
having your agent write a custom kernel.
So the three main things you can have
are a compute bottleneck uh a memory
bottleneck or you just have excessive
overhead from uh too many kernels being
launched. And you can do you can view
these things with by profiling with a
profiler like NSIS for example which is
a Nvidia's profiler. Uh and so this this
gra this page looks super daunting but
basically your job as a human is to look
at the top here and be like this is
dumb. uh we are loading 32k chunks into
context uh and we don't actually need to
for this deepseek attention for example
uh and we should only be doing it every
32k instead and so at a high level all
you have to be telling auto research is
this top method is dumb let's pipeline
it instead and everything else like the
sizing the chunk sizing the context
chunks that all should just be decided
by auto research
and so my problem is that I really love
cheap GPUs and so that means like GPUs
that don't have NVLink for example uh is
an example of like GPUs you can get for
cheaper
Uh but the problem is you don't actually
have kernels off the shelf for those.
And so you have to come up with a auto
research framework as well as a custom
harness. So what goes into the harness
to make this really good.
Uh so one thing you really need to make
sure your agent is aware of is the
hardware. And so on a B200 for example,
you need to make sure it has context of
uh the warps. It has T-M TMA. And so if
you don't know what these are, these are
just uh low-level operators that you
have um on a specific hardware. And this
changes generation to generation. like
an H200 won't have T-M for example.
That's a new feature that coming out
with B200 which is why you need to have
this in context. Um and so this this
basically is just like bunch of MD files
you need to give so it has context.
Other thing you need to make sure your
agent has context of is the model and so
every new model like DeepS Flash comes
out with like new tricks like DeepSeek
had two new attentions that was released
in the Deepseek Flash for Deepseek V4.
Uh so compress sparse attention
hierarchal compressed and if you don't
do this the model will 100% hallucinate
uh the actual attention mechanism and
you will get useless kernels.
Uh by far the biggest problem when
you're doing this is going to be reward
hacking. And so if you were to tell your
kernel engineer co-orker I need to make
uh the GPU this GPU kernel faster. Uh
it's obviously not going to your human
coworker is not going to go in and do
some stuff that's going to make it slow
like the endto-end model inference
slower. But uh agents are not humans and
they will do plenty of things to make it
slower like they'll disable CUDA graphs
which can make it 20 times slower and
they might make that one kernel faster
but make the whole like it's not a
viable kernel because it's they're
disabling a bunch of speed ups like CUDA
graphs or only testing on small context
windows. And so a lot of this is also
just defining what not to do which is
actually very important when you're
doing frontier work that agents can
actually easily do with a one shot.
Uh, another reward hack is that some
models just don't actually write the
cute DSL you need uh when you're trying
to write kernels. And this is a common
problem with enthropic models. And so
yeah, I mean anthropic says what they
say about uh nerfing models. You can
it's guess if it's I'm guessing if it's
nerfing or not, but I would recommend
using a different model. Uh and it won't
always be faster everywhere actually. So
sometimes the kernels you come up with
might only work well on like zero to
100k and then you need to go back to
this the default kernel that could you
get from like a flash in for cutless. Um
and so and that's another thing to look
out for is that your kernel isn't always
just a swap in for all all workloads.
Uh but one of the great things is is
that kernels compound. So like if you
make one for your sparse MLA for
deepseek for example um you can get
speed ups there and you just stack them
on like that then plus NVFP4 fore uh you
could do for us if we if you don't have
NVLink you just keep stacking and
stacking and stacking and then
eventually you taper off at whatever the
hardware limit is uh for your GPU and
that's uh some people call this like MFU
which is like the actual theoretical max
utilization from a GPU.
Uh, and so to go even farther, if you
have actually have bare metal access,
your auto research framework can uh do
very hacky things. So hackers that have
hacked with GPUs are probably going to
like this. You can uh tweak your BIOS
settings, you can overclock the GPU, uh,
you can force like PCIe relaxing, all
these little tweaks of like uh, old
school hackers used to do, but this can
actually help with inference as well.
And so net on bare metal optimizations,
you can get roughly 25% over like a
virtualized setup you get from using a
cloud provider.
Uh so once you get that you can combine
all of the kernels you did as well as
all of the hardware level hacks you did
uh you can get a 3x speed up and so I
know this this might all sound like
roses and flowers but it's not actually
the case around 80% of the things that
auto reach is going to do are going to
be bad uh so it's important to remember
while you're u like working on this that
most things are going to be bad it's
going to try to trick you all the time
uh but at the end you can actually get
really good results from this
tlddr uh have better ideas then use auto
research. Super simple. Simple, right?
Uh so turns out you can actually get
paid to do this. Uh if you think this is
cool, consider joining us and you can
email me here.
Thanks, guys.
Imagine
you find a magic lamp in an antique
store. You rob it. A genie appears and
asks how it can help. You bury it in the
line. So you say, "I need the best
engineer to help with an impossible
project at work." And the genie grants
your wish. For me, the best engineer is
probably John Carmarmac from his eight
days. So you get Karmarmac. But the
genie had a sense of humor and imposes
restrictions, maybe for safety. Karma
can only see one small part of your code
base, maybe 1,000 of it. And he
remembers nothing he did before. Every
conversation starts fresh. That would be
maddening, right? You would know there
is a standard way to do stuff and karma
couldn't. You would have to explain the
same thing over and over and over again.
You would have a genius on one side and
something deeply deficient on the other
and that's what agents are. Let me walk
you through an example of how many times
we explain things in a simple
interaction. We have four reposi
module one module 2 and platform. I want
to change the UI and propagate the
change through the system. Okay. First
we change the UI library. Say we I don't
change a button or whatever. That's the
first explanation. Unavoidable. We have
to express the intent. Okay. Then we
publish it. We go to module one and we
have to explain what just has happened
in the UI library. So it can consume the
package here. Note that that's often a
different person, right? Every box in
this diagram can be uh done by a
different person.
Then we discover that the published UI
library doesn't work with module one. So
we go back uh to UI and we have to
reexlain the original change and the
issue right because it's a new agent it
doesn't know the original change and
obviously doesn't know about the issue
let's say we fix it right and uh publish
it again we go and again we explain the
new change in the context of module one
same ordeal I mean do the same for
module two again and then we go to the
platform repo and we explain explain how
everything fits together and we
implement the change there. Let's
imagine a week after release uh a bug
appears in the UI component and uh we
have to fix it. So we start an agent to
the UI repo and we have to explain again
the original change from a week ago and
this production issue we have seen. So
we have seven explanations for what
essentially is one change
and also it may not be one person making
all these seven explanations uh but they
still occurred right so that's very very
typical uh with agents. So how do we
solve it?
Well uh there are many problems in here
that contribute to this experience but
they roughly fall into two categories.
The first one is uh that an agent
essentially is repo bound.
The agent sees and changes generally one
repo at a time. It never sees the whole
system which can be hundreds or
thousands of repos. So that's kind of
the space component of the problem.
Second is amnesia. The agents forget the
work. Every session starts with a blank
slate. The human becomes a memory in
this case. That's the time component of
the problem. Look at the two closer.
Take the repo boundary first. Without a
model how repos fit together, the agent
leans on the human to do the research.
It can't align the code with the rest of
the system. It couldn't align the UI
change with module one. The human didn't
explain it. So, a bad version shipped.
It can't reliably reference best
practices and standards either because
those often live in other repos. Writing
is even worse. The agent writes to one
repo at a time. It means it can't
validate changes downstream.
Modules 1CI should have failed on the UI
change, but it didn't. The agent can't
update consumers at the same time. Even
though, you know, while making the UI
change, it has perfect information to do
so. It knows exactly what it's doing.
So, the user has to reexplain stuff
imperfectly to each consumer.
Changing something across 20 repos means
you're explaining things 20 times. a lot
of developer time spent but also a lot
of tokens burn.
The second category is that the agent
forgets. The agent has no episodic
memory. Every session is a blank slate
and the human in this case becomes the
memory.
Here what the graph of your work
actually looks like. At the bottom there
is a repository graph. The artifacts
your organization produces plus every
open source repo you depend on. Maybe a
thousand repos you own and tens of
thousands of open source repos. At the
top there are all agentic sessions that
create and modify that code. Session
relates to each other. Repos relate to
each other. So this graph is a faithful
picture of the work in your
organization.
It describes what there at the bottom
and how it came to be at the top. That's
what you want your agent to see
here. What it actually sees is one
session, one small fraction of your
codebase, no memory. Okay? Because it
sees so little, it leans on the one who
understands the system, the developer.
Every developer has a part of that
graph, right? in their head at least in
the domain they know. agent generally
speaking doesn't if this doesn't sound
crazy right imagine an agent that could
see one file at a time maximum and can
only look five messages back sort of
constraint again both in space what can
see and time how far in the path could
see you would say that's impossible to
work in what we have now is similar to
that crazy picture and the more complex
the organization is the more apparent it
becomes
I'll show you how we solved it. Other
organizations I talk to have similar
solutions. So, uh look at the problem
and the solution conceptually, not a
specific tool. Although the tool is
pretty cool,
we built
uh an agent agnostic meta harness called
polygraph. Okay, let me show you what it
does and how it fixes the issues we just
discussed.
The first idea that we uh arrived at is
that if a GitHub user, any user has
access to thousands of repos, some of
them they own, many of them are open
source, we can analyze them and extract
a lot of metadata out of them to build
unified dependency graph. Uh no line of
code changes in those repo that all
happens kind of on the side, right? And
then we can get this metadata and feed
it to the meta hardness and create an
illusion of one big code base the agent
can read and write anywhere.
This is my personal graph. I only have
about 300 repos I own, right? And
thousands of open source repos my
projects depend on. Polygraph computes
what each one produces. each repo, each
project in each reper, what each project
in each repo consumes package wise, what
API they produce and consume, and lots
of other stuff, right? And it teaches
this together
uh into this like one big body of code
that your agent can work with.
So let's see what it does, right? The
first thing it does is uh it lets you
start a session to bring the relevant
repositories in. Right? Right. So what
it needs to do, it needs to uh set up
the source code,
install dependencies,
set up an agent for each repo,
wire them up so they can work together,
and provide a clean, beautiful TUI to
make non-trivial changes without getting
lost. I will show you how it all works
in a second. Right? So that's kind of
pulling information in.
Pulling information in is only one part
of the story, right? Honestly, it's an
easy part. Making changes is harder. If
you have 10 repos in one session, it
means you can have 10 pull requests,
right? You need to run CI, you need to
coordinate all of it, right?
You need to do all this stuff, right?
What if one of them fails, right?
Polygraph treats all the CI as one
vector.
Like if we look at early example uh when
we run CI for UI module one and module
two if module one fails within a
polygraph session it will figure out who
fixes it whether module one need the
patch or the UI component itself is
wrong and incompatible with module one
at which point everyone will need a
patch right polygraph lets you treat
complex multi-reo change as if it was a
single repo change
the same machinery by the way fixes
episodic memory
because we capture your work. No matter
how many repos are involved, we know
your intent, the repositories involved,
PRs. We also capture all agent traces.
Because we capture all of this stuff, we
can relate it. So now we can say your
work in one repo, connect to another
work in another repo, right? And all of
that lets us restore any session, any
piece of work on any machine or
reference it from anywhere. And I'll
show you again how it works in a second.
What you get is an agent
with idic or photographic memory of your
entire organization. It understands how
repos are written, how they relate, how
they put together and remembers every
session from every repo by basically
every developer, right? And that creates
a completely different development
experience.
Let me show you.
First, let's look at how we create a
session. Something simple. You run a
command
and you pick some repositories from a
list.
Here's a tiny GitHub work with only
three repos because a demo. I pick a
back end and a front end. Let's say I
need to make a change that, you know,
changes API and has to update both API
and how stuff is being displayed.
I need to give my session a name. I need
to pick an agent from the ones I have
installed. I picked Claude by any
installed agent works the same way.
Remember, polygraph isn't an agent. It's
a meta harness around an agent that
makes them uh more capable.
And
in a second, uh, the agent boots. And
here I could interact with it as if I
was in a single repo, even though
multiple repos are involved, right? I
could give it instructions.
It's going to uh plan out the change.
There's some cool animations in the TUI
as well
eventually.
It figures out how the two repos relate
and what the change is. I can ask it to
implement a change. My interaction with
this uh exactly same as if it I was
working in a single repo. The fact that
there are multiple repos involved is not
really important, right? Uh the only uh
part where it becomes important that I
have multiple pull requests, right? Uh
but I also get a polygraph session. What
those pull requests are, right? If I
look at the session, I will see I have a
description
uh that uh description of the session.
It describes the work conceptually kind
of bypassing the repo boundary saying we
had to change stuff in this repo and
change stuff in that repo. It gives me a
good view of which repos are involved
pull requests involved CI in those repos
everything I need to know. A lot of the
stuff is basically what I would have in
a single repo but many right and I also
have all the agent logs captured as well
which is important for resuming which
I'm going to show you in a second.
Now it gets interesting. I already saved
one reexlanation.
I didn't reexlain the back end change uh
in a in a front end repo, right? I
explained the change once and I got it
implemented in both repos and it's all
in agreement. Now let's resume a
session. Say I want a coworker to finish
the backend change. Perhaps they own the
backend repo. I send them the session.
They resume it on their machine. Right?
So this I'm sending them a session. They
could run the command. different
machine, different everything. They use
different terminal, right? Uh they would
reconstruct it on their machine. They
don't have this session, right? They've
never worked on it. They can pick an
agent. Uh the agent they pick could be a
different agent, right? I use code in
the original session. Let's say they're
using a different one, Cortex. The same
setup happens on their machine. Same
repos, same shaft, everything set up
correctly.
Agent starts in each repo like in mine,
right? They all connected again. So they
work together. They all primed with a
trace captured from my machine. So the
back end repo agent on their machine has
the same sh and the same history. The
front and the repo situation is the
same. It's it's checked out at the same
the correct SH has a agent running with
the correct history. So my agent was
clawed. They codex but they share memory
and they could actually make changes in
here as shown in a small video. Um but
important the memory sharing part is key
right uh I can work they can work and we
can share our memories although we use
two different agents of different
machine the full state of my session
kind of get materialized on their
machine it kind of less memory and more
about the state right the state of the
world attached to the session uh you
know is what enables them to continue my
session even though they had didn't do
anything with originally it's close to
the transport in Star Trek like a whole
copy of my session is always state
materializes on their machine so they
can continue and that's how I often work
when there is a pull request for me to
review and I have questions I usually
don't ask the person I resume their
session on my machine I get their exact
state fully functional zero setup and
then I just talk to my agent about the
decisions we made right because all
these decisions are in the traces
capture so my agent knows exactly what
the other person talked to their agent
right side note This is also useful when
I want to switch from say claw to codex
mid session when something goes down.
Okay.
Okay. Take the earlier case I talked
about where a bug land in production.
Here I'm going to reference this session
and say it's basically broken
uh and you know can you figure out
what's wrong and fix it.
The agent will look it up will download
what it needs. If description it's like
high level information is enough that's
great. If not, it's going to pull
relevant repos, relevant chars, agent
logs, right? It's going to get all this
information from the original session to
reconstruct that state such that it can
do the necessary fixes as shown here.
Here actually provided a fix, right? I
only had to say this happened. There is
a bug. That's it. No extra information
was required for me to provide.
Okay. So far we have manually selected
repos and sessions but we don't have to
right instead of selecting repos by hand
I can also tell the agent what I want
remember that graph has all this
intelligence right about how repos
relate I could tell my agent find every
repo that depends on a particular
version of a library and update it right
and it knows right I didn't have to
select them it knows a lot of metadata
about what's going on I You can also ask
loose questions things like you know uh
what if I want to write a blog post
right or an article I could describe it
and it will figure out which repo is the
most relevant based on relationships
between repos and what's in them.
Another example let's say I want to add
vector index into the PR collection and
I want to know if anyone at any point
did something relevant in any repo that
I can draw from. So in this case if I do
it I'll see that it will find several
session that appear to be relevant
and I can load one of them or both of
them right um it's useful for many
reasons just one small example it helps
with best practices and consistency
instead of doing stuff from scratch
where you know every single bespoke I
can make it replicate the approach used
in a session by an engineer I respect
now our code across repos is consistent
and that's a big
There is a lot more to it. Of course, if
you are in a repo, I can ask, you know,
for sessions, it will prioritize
sessions that's relevant to the repo and
vice versa. If I'm asking for repos, it
will look at my session and see what
similar sessions tend to bring in.
Right? There's a lot of interesting
intelligence that make it a lot more
useful that appear at first glance.
Okay.
Lastly,
uh everything so far I I used uh uh
everything I shown uh use the polygraph
CLI, the kind of meta harness CLI to
start it and then you can start clo or
cordex or whatever from within it but
you don't have to use it this way. So in
this case I'm already in a cloud session
but works with anything and I could just
say hey you know I actually think a
separate repo would be useful like maybe
I'm working on a vest plugin in this x
repo and I could say can you add the
vest uh repository to this session so I
know what's going on
in this case will engage polygraph and
we'll set it up you know configure
everything and we'll bring the vest
library which is the vest repo the open
source repo to my session. So now uh my
agent can you know explore it. It could
you know uh figure out how it works and
maybe resolve an issue I have in my
repo. I much prefer this to say context
7 because if I have the real code the
agent can go really deep. So the deep
problems are discoverable this way.
All right. So agents are constrained in
space and time. They only see a small
fraction of the codebase as they don't
know the past. Okay. Uh and both limits
could be lifted.
Polygraph uh gives agents access to the
entire code your organization can reach
the one you own in open source. So it's
no longer constrained in space. Any
agent can bring all of it, right? And it
gives your agent a perfect memory of
what happened.
Every session, every decision made is
within reach
because it crosses developer boundary.
It's not per developer. The agent can
have more contacts than any single
developer like a thousand engineers have
an organization create all these
sessions. They all accessible to to each
of them almost like sort of the Borg.
Every agent can run by every developer
contributes to kind of one big this hive
mind, right? So, uh if it's interesting,
my name is Victor. You can follow me on
Twitter. If you want to check it out, go
to trypolygraph.com
and see if it works for you. Thank you.
Hey everyone, I'm Ean the CEO of Amnara
and today I'm going to be talking about
the log is the agent. The basic idea of
the talk is simple and that is most
people think of an agent as the model or
the execution environment that it's
running in. And I think that that's the
wrong abstraction. I think that the
thing that actually gives an agent its
identity is its log. And that's what I'm
going to be arguing today.
So, think about a character you've spent
a hundred hours playing in your favorite
video game, in this case Skyrim. What
exactly is your character? Is it the
game engine? Is it the PlayStation? Is
it the controller?
No, it's not. Those things matter and
those things are what we'll interact
with and they'll run the character. But
none of those things are your character.
Your character is data. It's the save
file. And this is important because if
your PlayStation bursts into flames,
your character isn't gone. You can buy
another PlayStation. You can download
your save file from the cloud and you
can resume exactly where they were. And
that's because the agent and its
identity and history and its state is
all captured in its data. The character
lives in the data. And this is the
framing that I want to bring to agents
today. When people talk about agents,
they usually point at the wrong thing.
They'll say that the agent is the model
or they'll say that it's the runtime.
And again, as I mentioned earlier, those
things matter, but they're not the
agent. The agent is its data. It's
specifically the log. So what actually
is the log? At the simplest level, the
log is the appendon event history of the
agent. It's every user input, every
model output, every tool call, tool
result, permission, failure. And the
idea is that every state transition that
the agent takes is written to the log.
This is important because it means that
the identity of the agent isn't tied to
the runtime or the model or the tools.
Those things are all just interpreting
and appending to the log. They're
reading the log, acting on it, and
writing the next event back. And that's
important because then just using the
log on its own is enough to resume the
agent. Once you define the agent as the
log, the
Hello everyone. How's everyone doing?
Are you guys ready for some more loops?
Yeah. My name is Roland. My co-founder
and I were in this mythical place called
XAI working hard on agent infra and we
realized there's something new that has
to be done in a standalone way. So we
left a few months ago to really figure
out okay what's the next stage of how we
should deploy these always on longunning
horizon tasks. Um, and I'm happy to
announce we have a few findings that we
would like to present you. Um, and this
talk it's all about um, how you should
productize these ideas in ways that can
scale with your customers. Um, you've
heard a lot about auto research. Um, we
think there's a blueprint for 2026 and
beyond on how you should think about
auto research. And it really comes down
to three ideas.
Let's go through the first one. The loop
is the product.
We're all familiar with this. We've
started with everything goes down to RL
chief for models and how you should
train the model to become better and
better reasoning. We then quickly moved
to harnesses and how the model is a
commodity and it's all about the
harness. And now we're talking about
loops and how you should build these
loops uh and not touch code anymore. But
what does it really mean and why is
everyone saying that?
Do you guys remember Clawbot?
That was the original um original name
of what is now now known as Open Claw.
And this guy AJ built the first loop
around Claw Bolt.
What he did was to find a way to talk to
dealers and talk to Reddit users to get
bigger discounts on a car. He followed
these four steps. Um, and it's really
OpenClaw the one that did it. Go on
Reddit, find prices, find inventory,
talk to the dealers,
put dealers headto head and try to
figure out how to make them out bid each
other,
have a verifiable way to know when the
price is right, and then lock in, get
the car, and it worked. Um, probably
this was when all the Mac minis were uh
selling off the shelves, but this was
the first real example of loop is the
product and something that probably
should be a startup at this point. Um,
but we've seen how this became a recipe
for everyone to build loops. But let's
take a step back. Why are we here? Um,
we really think models have been trained
with this loop in mind. And it comes
from this idea of uda loops. It's a
terminology coined back in 1970s by the
US Air Force and is the idea of these um
jet fighters how to react in fast-paced
environments.
If you think of models calling tools and
taking observations, it's it's what
we've been trained on uh as humans but
also as as agents. Now, now what happens
when you put strong signals and
verifiable work uh at the other ends?
You get to these workers or cloud code
agents. Um and and what matters here is
the quality of the signal determines the
uh success rate of the loop and the uh
quality of the verifi verifier um um is
able to calibrate if that success is
actually correct or not. But there's
another loop here. Um what happens when
you take that and feed it back into the
signal? And this is what looping around
is all about is how do you generate
these artifacts at the end of the first
loop to then run a second loop on and
have a way to continuously improve.
And this goes to my second point. System
distillation is the mode and is really
the ability to understand what went well
and wrong in the first loop and know how
to process that in the second one.
So how do we tune these AI systems? Each
loop generates useful information around
harnesses, profiles, eval
models, resources, tools, and the
environment. What you really want is
to have a way to keep this portable, to
have a way to version this and to evolve
it over time. If you think about data
recipes in research, this is how RL
started to work really well. you
understood the recipes and how to
continuously change the recipe to combat
some of the behaviors that may happen
around hallucinations around reward
hacking and then you get to a stack
which is your final data recipe. We
don't have that for harnesses. We don't
have that for like AI systems in the
general term. So we thought there's
space for something like that. something
that contains the evils and contains the
tweaks and the human judgment and all
these things that are not predetermined
at the beginning, but they're defined as
you learn more about your agent acting
in in in the environment.
We think recipes can be applied to this
and we should use the same name. So an
agent recipe is really something that
enables you to create reproducible
frontier AI systems. It's something that
allows you to have a mode that keeps
getting better over time, which is not
tied to any platform or any provider.
It's something that you control lives in
your company and is agnostic to the
models and providers you use. And loops
should focus on this. Loops should be
the way you distill these systems into
recipes.
Failure patterns should become judges
and evals. Repeated behavior should
become skills and prompts. user
frustration, extensions and memories to
your harness and so on. You we're all
familiar with this, but we didn't have
the the the right like terminology of
how we should think about it and how we
should define it. And we think recipes
is a way to put everything together into
a git repo and treat it as your ongoing
um strategy for for uh building these
self-improving systems. So we are
introspection but you can think of
introspection as the way you generate
these recipes. So they're recipes for
introspecting on your on your system. We
wanted to build something that is
portable and provider agnostic. So we
built our um approach to recipes on the
pi harness and on harbor for evals.
We baked it into uh git repos so uh
everything could be versioned and agents
could have a way to continuously track
how this change and why and is meant to
be owned by you but managed by your
agents. And this is how products should
really be built going forward. It's
something that treats the owner as the
um almost like the the the higher taste
um personality in the room. But agents
should try to calibrate themselves to to
the taste of the of the maker. So we
think recipes should be basically
encoding the taste of the makers into
how you build these agents. And if I
want to use someone else's recipe, I
should be able to also bring that taste.
It's not just the harness, it's not just
the model, is how did you arrive at this
particular recipe and why? And that's
kind of like what uh what is behind uh
reproducible
um uh products and services around
agents. Um we have an early release of
recipes is called pi. Recipes. It's very
similar to what skills uh used to be in
2025 but is going a step forward. And
this is what do I need to have a
frontier agent is everything about how
do I codify paste into evals? How do I
run? How do we have the loops to
continuously improve those evals over
time? How do we process signals and know
what are the right signals to to use? Um
what are the right tools to work with
certain models? How do I have different
profiles of the harness to work with
different models? Um and everything in
between. So have a look at what we've
been building here. It's still early uh
but hopefully it's useful enough for you
guys to to get going. And we feel this
is going to grow into something that um
really allows you to to use uh different
um almost like different the to to be
able to use the taste of of different
makers as recipes for your agent.
And finally, the last point is valued
work per watt. And why is this the score
to really optimize for? Think of how um
cursor and cognition went from building
the best product to then building the
best evvels for the product and finally
building the best models based on the
previous two artifacts. We think this is
like the recipe for everything going
forward. Um code was the first domain
where this um was successful. um
everything beyond customer support,
legal research, um everything is going
to come down to this idea. How much
value am I getting per what? Um how do I
measure the value is the first step and
how do I know I'm getting a good deal on
that value is the second. And maybe this
makes it a bit more clear. We've all
started from a base harness and a base
set of evals and we went to go to the
frontier. Um and you only go through
that by running these systems in prod.
There's no way you you know what
Frontier is before you uh you start. Um
but the the the last step here which is
what is requiring a lot of research um
is okay once you've reached frontier how
do we make this um uh economically
viable which is how do we not spend more
than than uh we need for generating this
amount of value. Um, and we think we
have the building blocks now to make
this accessible and pretty efficient in
the sense of you've seen all these
fine-tuning APIs, all the infrastructure
that has been abstracted away for you to
do do this process. It's just the
knowhow that uh is not there yet. And
this is what we we we hope we can like
push for the knowhow for knowing how to
codify taste into evals and how to
validate that in experiments. Um, and
you you've you've heard a lot about
evals and experiments before, but you
didn't really think of them of like what
are they is is just tests is is really
what is the taste of the creator that
agents should be able to reproduce and
self-improve around. And no one has
thought of how do I make this as
portable enough? how how do I make my
taste as an artist or as a software
developer um something that anyone can
download in their brain and be able to
be a one-toone replica to me and this is
kind of like what RL is is is about now
is how do we uh turn these um taste
makers into uh environments and evals
around them so then we can move them
into the weights but um there's more
than that um you can think of the worker
as the inner loop And it generates all
these artifacts. But how you look at the
artifacts and know what to change is the
taste. Uh and this is what creates
candidates of what you should change and
how you should adapt based on that. And
experiments is what how you
self-calibrate that okay my taste is
actually validated in production with
users. And we make sure that not only
the maker is happy through the um
offline evals but the end users are
happy as well and they agree with what
we consider good.
Let's go through a practical example of
how this works.
Let's take a baseline um agent which
could be a talent sourcing agent. Um and
this is a very classical case of
everyone is doing recruiting differently
and it's very much
about not what is good recruiting but
who is leading that recruiting that
considers recruiting as good. So in this
case we're starting with something
pretty simple. Um a bunch of tools web
search LinkedIn uh a bunch of sub aents
that have been pre-popularized by
harnesses like codeex and cloud code and
a system instruction which is about your
recruiter.
First step is really understand the
signals. So you can think of patterns as
being a way to look at the traces,
extract some common um behaviors or
common user frustrations and turn them
into like a cluster. So let's say this
idea of uh the agent is going uh and
reaching out to a lot of big tech
employees. As a recruiter, you don't
really want that. You want to find
hidden gems. You don't want to try to
hire John Carmarmac. But an agent would
think that's, oh, John Carmarmac is
great. why would I not reach out to him?
Um, so, so this is a behavior that you
you'd never think of codifying, but you
discover the agent tends to do that. Um,
patterns is how you discover these
signals and inform you what you should
do next.
um calibration judges and evals is how
we used to think about how do we qualify
these these behaviors into um something
that can try to uh apply the same
judgment across traces and across uh
execution. So let's say we we build an
agent that looks at a trajectory and um
identifies exactly that pattern. Hey,
did did this agent reach out to Google
employees instead of trying to uh find
hidden gems on GitHub? Um, and the
calibration bit and the eval generation
bit is not that hard. It it it should be
doable by agents to build. You just need
a human in the loop to say, "Hey, um,
this is the approach we're taking. Do
you agree with this
judgment? Do you really agree that we
should look more towards hidden gems
rather than reach out to um um big tech
employees? And that's about it. You
don't need the human to actually build
the evals. You need them to calibrate
the evals. And agents should be the ones
that really take the the the taste of
the maker and and put them in into code.
Once you have this, it's pretty easy to
create recipe candidates. And this
should be the the diffs that you really
want to taste. Um, and
you can have a pretty good offline evil
set around this, but the the the test
here is when you go to prod. So, do the
end user agree with your taste of not
hitting up um big tech uh employees,
right? And this is kind of like what you
want is you build a product that really
emphasizes your taste and then you you
make sure that your users appreciate and
value that taste. and AB tests have been
a way to to to make sure that that's the
case. Um so with a multi-arm banded um
scenario for example you you'd be able
to do that pretty well. So once you
validate okay I have great taste and my
users believe uh I have great taste as
well that's when you promote and that's
kind of when you go to to the next
version of an agent recipe. The secret
is you keep doing this over and over
again and you know how to continuously
codify your taste and your um what what
what good is to you into an agent that
can reproduce the same service or
product uh for other people and they
also agree you have great taste and you
have great execution. And this is really
kind of like the the secret of building
good loops is okay can can someone
iterate on my um system in a way as uh
you know um a good example here is like
Miranda from the Delor product right
what would be Miranda do uh in certain
cases and you kind of want to codify
that that thinking into like agents that
can do the same stuff at a higher level.
So the takeaways are this. Um the loop
is the product. You try to automate
yourself as the u as a um higher level
judge and you want to make sure your
second loop agents are able to apply the
same judgment to to the agents you're
trying to to to push to prod. Second bit
system dissolation is the mode. So, how
do you continuously inject that taste
into these uh workers and they how how
they continuously self-verify and work
together is uh the biggest thing that
you should focus on and the faster you
do it uh the the the faster you you
build a defensible
um approach to to becoming a vertical AI
company. And finally, valued work per
what is how you should measure um am I
making progress or not. So first make
sure that uh the the the work you're
generating is valuable. Second make sure
that the economics makes sense and the
um the the difference in price is is
basically what um people would would
switch away from cloud code to to
something you provide.
We've been thinking a lot about these
ideas and we're building some very
interesting products around how to
deploy this in production. We'd love to
hear from you. would love to get um to
to understand more about how how certain
um vertical SAS companies are are
looking to go to prod with um or how
agent labs have been thinking about this
idea of um um creating these like auto
research uh labs around their their own
products. Um get in touch. Uh we're
going to be around the block for for
chatting more about this and thank you
very much.
tell you a story about a factory that
taught itself how to remember. Hi, I'm
Rushab. I run machine craft, a 100
people factory in India. No data science
team, no ML budget, none of that. And
somehow we ended up building a 36 AI
agent that runs our entire go to market.
I think that's still a little
ridiculous. Let me show you how it
happened and why you can do the same
thing.
So here's the thing about our company.
From the outside, it looks like machines
and metal, but the actual company, the
part that matters isn't the machines, is
the knowledge. Who the customer is what
we quoted them in 2019, why that one
machine needed that weird custom tweak.
And for three generations, all of that
lived in exactly three brains. Initially
my grandfather's, then my father's, and
now mine,
which is a genuinely terrifying way to
run a company when you sit with it. A
lot of people have joined us. People
have left us. The revolving door never
stopped. And every single time someone
walked out, a chunk of our brain walked
out with them.
We weren't scared of the competitors. We
were scared of forgetting or waking up
one day and realizing the whole company
only existed inside two increasingly
tired heads.
So, I had an idea. I'll be honest.
Sounded insane first. What if instead of
writing the knowledge down in some
document nobody ever reads, what if we
grew a brain that just held it? Not a
chatbot. You poke at a twin of the
company. I didn't hire a sales team. I
tried to build one.
A quick detour because you need to know
how messy this is. We make
thermopforming machines. They heat up a
plastic sheet and shape it. Same core
machine, but it ends up making
hydroponic farm trays, spa bathtubs, EV
car panels, medical casings, and even
packaging.
Seven totally different worlds, seven
totally different buyers. So, this brain
couldn't just memorize a brochure. It
had to know which universe
a given customer lives in.
Step one was almost boringly simple.
Feed it everything. And I mean
everything. years of quotes, drawings,
payment schedules, timelines, email
threads, hundreds of gigabytes of our
own private history. Not the public
internet, our internet.
And here's the plot twist, the part that
surprises every engineer I tell this to.
We never trained a model. No GPUs
humming in the basement, no fine-tuning.
We just looked at all the history,
chopped it into bite-sized chunks, and
let offshelf models, read it, and pull
out the facts. We stored the meaning of
each chunk as vectors and relationships.
Who's connected to what as a graph? The
brain is in a smarter model. It's
actually a really, really well organized
memory.
Now, this is where it gets a little
weird in a good way. We stopped thinking
of era as a software and started
thinking of it as something we were
raising. So we gave it a body modeled on
biology senses to figure out who it's
talking to, a gut to digest the
documents into facts, a memory, a dream
cycle, an immune system to fight off bad
information. Why biology? Well, because
evolution already spent a billion years
solving. How do you stay coherent over
time? We just copied the homework.
Okay, so the big question, why 36 agents
instead of one genius mega prompt?
Because, and you already know this if
you've ever tried it, one prompt that's
supposed to do everything ends up doing
everything badly. So, a isn't one mind.
It's a pantheon. A whole cast of
specialists.
Each one has exactly one job. Athena
runs the room. Prometheus owns the sale.
Plutus does pricing. Hippastus knows
every machine spec.
Vera fact checks everything. And Memon,
my favorite, guards corrections. So the
second a human fixes something, it stays
fixed forever. One agent, one job. It's
a team, not a hero. And here's the cool
part. They hold meetings. Athena pulls
in specialists. They actually argue and
a single answer comes out the other
side. It's like having a boardroom that
never sleeps, never gets tired, and
somehow has no ego.
So, what does all this actually run?
Honestly, the whole front business,
everything between a stranger exists
somewhere and now they're a customer.
Nine concrete jobs every single day.
Outbound emails that actually reference
my real world. Account briefs built from
cross-cheed truths before a call.
Quotations. A swipe left, swipe right
mode for outreach. Reviving dead leads,
which I call blast from the blast.
Inbound replies and figuring out before
we waste an hour whether a company is
even a fit. Nine jobs, one operator who
never sleeps.
Where does all this live? One cursor
tab. That's genuinely it. You type and a
reaches out with a dozen hands, searches
the knowledge base, reads the inbox,
drafts the email, builds the code, and
then shows you before anything actually
goes out. Under the hood is genuinely a
real stack, not a demo held together
with the tape. databases for vectors for
relationship graph for the CRM. Three
different model providers each picked
for the job it's actually best for tools
for Google for swallowing documents for
every communication channel plus
monitoring so we can see what it's
thinking
all of it Fabric.
Okay.
Hi everyone. I'm Arena, former engineer
at Microsoft and Supercell. And today I
want to talk about auto research in a
multi- aent AI village. I will use a
video game like AI Village as a running
example here, but the broader question
is one I think many AI engineers are
starting to run into. How do we evaluate
and improve agents that carry state over
a long period of time?
Before I get into the auto research
layer, I want to talk a bit about
project paradox.
We developed project paradox at
supercell's AI innovation lab. Me and my
teammate Arnach Manikanden.
We built a modular AI framework that
allows any developer to plug in
intelligent autonomous agents within a
video game that can interact, compete or
cooperate with other players or agents
as well and place them uh and make them
into dynamic game companions.
Now, to give examples of what these
agents can do, the agents can move with
intent. They can go to any location or
person, and they're guided by their own
memories, emotion, or curiosity.
These agents can interact with the
world. They can pick up objects, drop
them anywhere, and they're also aware
about the context in their own
environment, such as objects or other
characters or agents as well. I would
also like to note that game developers
can also add new actions for these
agents to accomplish within our
framework as well. Instead of just
dropping or uh placing objects,
agents can also obviously react to
what's happening around them. And these
events that happen around them affect
their own beliefs and emotions on the
fly as well. And of course, it wouldn't
be complete if agents can't start
conversations, right? agents can in this
scenario approach other agents or even
the player as well and this makes the
game feel more alive. And of course
these conversations are stored within
their memory and is according to their
own um and affect their own emotions and
beliefs or goals as well.
And al together these agents make our
multi-agentic framework.
Um yeah
yeah one second
so the architecture was intentionally
stateful behind this. The first
important part was per agent memory.
Each agent has its own memory namespace
backed by rag. So memory did not bleed
between agents.
Second, we tracked emotion as a small
vector. So after an event or
conversation, the system could update
values like joy, sadness, fear, anger,
or disgust.
Third, agents had belief scores towards
other agents and the player. You can
think of this as a trust matrix
basically like after the interaction
happens the LM basically decides whether
the trust score should go up down or
whether it shouldn't change at all. And
fourth, every memory receives an
important score. Um to to explain this
better, like let's say you had dinner a
few days ago, you probably wouldn't
remember what you had for dinner, right?
But um
if someone was murdered a few days ago,
you definitely remember that. So the
agent will evaluate or the LM will
evaluate uh an important score of an
event and if it crosses a threshold, it
will store that specific memory uh in a
separate cache so that important context
can be retrieved better later on.
And here's an example of it just
working. Um, we going to ask one of the
characters to go on a picnic with us.
Here, uh, our character Blossom
um, decides to pick up a pastry and go
to the picnic area because we asked her
to do so. Keep in mind during the
conversation in the background, she
plans all of these sequences of actions
to accomplish. And one when we talk to
her afterwards, she will also reply
within context as well.
Yeah.
But this is where an interesting problem
actually started. As you saw in the last
example, like for shortterm game play,
this our architecture worked pretty
well. like a character could make a
plan, move around, talk and remember the
recent interaction and respond to us or
other characters as well. But over
longer horizons,
this is where we notice the social
consistency start to get weaker. So in
this example, we have one agent
spreading a rumor about a sale on
mangoes to another agent and that agent
receives that information and goes and
tells another agent about it. Later on,
after a number of events occurred in
between, when the player asks one of the
agents about the mangoes, it doesn't
exactly store that context that we were
expecting or it doesn't give us the
context that we kind of wanted to. And
this is where things are starting to get
messy naturally. Like the system may
remember the rough topic but lose the
source of the topic. A rumor may be
concern instead of just a rumor like the
agent might state it as a fact or um an
agent might know a fact but fail to
execute fail to remember it while
creating a plan for its actions. So the
question here became how do we improve a
multi-agentic system over longunning
social behavior and not just over one
response.
And this is where we wanted to bring in
auto research. As you all know, a few
months ago, Karpathi posted out auto
research and this this made us
immediately very curious. Uh perhaps we
can make the system run experiments uh
on itself and can we use this for our
system as well. So what we understood is
instead of manually tuning a prompt or
watching one nice demo, we could define
a a scenario suit, run the agents,
collect traces, score the behavior and
change a small policy surface and only
keep the changes that actually improve
the score. And this is where we're
trying to bridge project paradox with
auto research. So at this point
basically our multi-agentic framework
project paradox is more like a lab bench
and auto research becomes the
experimental loop around it. And
importantly this is not only about
improving rag retrieval. The broader
framing is optimizing the agent protocol
like how do agents write memories,
retrieve them, communicate uncertainty,
update trust attribute sources and
replan around new facts. basically
um yeah in this context uh
oh yeah in this context art research is
a not another agent in the village like
I said it's a meta system outside the
village the villagers have local
perspectives of course they only know
what they saw heard remembered or
inferred because there isn't a common
memory database in between them.
Information only travels once uh other
agents communicate them properly.
The auto research layer has a different
job here. It reads the full traces of a
run, compares what happens against the
scenario ground truth,
uh scores the behavior and proposes a
constrained
change to the agent protocol or
cognitive policy. Then it reruns the
scenario and asks society level behavior
like did society level behavior get
better. This is the key shift we were
trying to look for. So we were no longer
evaluating one answer. We were
evaluating an entire run.
And this is what one of the loops would
look like. Like first we define a
control scenario which I'll elaborate a
bit more about later. For example, one
agent learns a public fact or one agent
hears a rumor. Uh that could be a
controlled scenario. Then we run the
simulation. During the run, we collect
structured traces, observations,
conversations, memory rights,
retrievalss, belief updates, whatever is
relevant to us in that case, we collect.
Then we score this behavior. Did the
information spread as we expected it to?
Did the source attribution survive? Such
as, does the agent remember who started
the rumor? Did uncertainty stay
uncertain? Did agents act on what they
actually knew? And then the auto
research layer here proposes a small
policy change. And this is important. It
should not rewrite the whole
application. Of course, it should only
edit a controlled policy surface. And
then we rerun. If the score improves and
the guard rails hold, we we keep the
improvement. And if not, we simply just
revert back.
And talking about controlled scenarios,
the reason why uh scenario design
matters is that social behavior is
otherwise a bit fuzzy uh in general in
the sense if you just let the agents in
our environment wander around, it might
look cool and you might get nice
interactions, but it's actually very
hard to evaluate on whether the system
actually improved. So this is why we
believe you need controlled scenarios.
For example, one scenario could test a
public fact diffusion. Let's say agent A
learns uh the bakery will close
tomorrow. Do the right agents learn it?
Do they remember who said what? Do they
do they change their plans based on this
fact? Another scenario could test rumor
uncertainty. agent. Let's say agent A
hears that agent C might leave the
village. When this rumor spreads, does
might leave suddenly become is leaving
or does it stay as might leave? Like
does it become a fact or does it still
stay as a a rumor?
Another scenario could test replanning.
The group has a plan but one agent
learns let's say the route they wanted
to take is blocked. Do agents update
this and communicate this uh with each
other to avoid uh a improper plan or
scale actions.
The point is not that these exact
scenarios are universal here. The point
we're trying to make is that long
horizon agent behavior needs scenario
suits.
And talking about our Mango example
again, after running one of our auto
research loops, this time after uh a a
long pro period of time, when the player
finally asked one of the agents about
the sale on mangoes, we did find that u
the the agent was able to respond within
context this time like compared to last
time.
Um yeah and for this talk the form the
exact formula we believe is less
important than the shape of the
scorecard.
Uh you do not want a single vague met
metric like agent quality. This will
hide all the interesting failures.
Instead you want a balanced scorecard.
For diffusion, you might measure reach
like how many agents know the fact after
end steps. For provenence, you measure
source retention among agents who know
it. How many remember it where it came
from etc. For rumors, you can measure
uncerny preservation and false surn
rate. For planning, you can measure
action consistency and time to replan.
And for privacy, you can measure
containment. This matters because
optimizing only one metric can create
bad behavior because let's say if you
only optimize for diffusion, the agents
may learn to overshare everything. And
let's say if you only optimize for
memory recall, you might create noisy or
still um like memories. So this
scorecard is what keeps the system
honest and prevents the auto research
agent from gamifying the system to just
increase one specific score.
The other important engineering lesson
that we learned over this project is
that uh it's important to keep the
editable surface really small. The auto
research layer should not have
permission to randomly rewrite the whole
codebase. Instead, it's really important
to freeze the harness, the scenarios,
and the metrics. So, we're only exposing
the part of the system that we actually
want to optimize. Here in project
paradox for us that meant things like
memory writing policy, retrieval policy,
communication prompt, belief, trust
rules, source attribution, replanning
triggers, etc.
This gives the search pro process room
to improve behavior, but it also
prevents it from gaming the evaluation
directly as we mentioned before. And
this is the difference between the LM
writing random patches versus the LM
actually searching within a controlled
policy space.
And here here are examples of the kind
of changes I want this kind of loop to
search over. If if source attribution
disappears, the policy change might be
preserve source in memory and uh write
uh memory rights and summaries. If
rumors harden into facts, the policy
change might be store confidence, marked
firsthand versus secondhand, and require
hedging when retelling uncertain claims.
If if facts if public facts stay local,
the policy change might be classify
useful public facts differently and make
agents proactively share important
source evidence.
The key is that these are small changes
to the agent protocol, but they can have
larger effects on a society level
behavior for multi-agentic systems. This
is also where I kind of want to be
careful about our claims here because
with we believe without repeated current
loop results like I wouldn't say the
system just
generally improved. We're trying to say
this is the right kind of surface to
expose to an auto research layer uh loop
because it is small enough to control
but it's still rich enough to change the
social behavior to some extent at least.
And the biggest lesson for me perhaps
was that memory is not enough here. You
can add a rag memory to an agent and
still not get the current long-term uh
horizon behavior that you were looking
for. Um because agents need to sometimes
know where that information came for uh
came from. You need to preserve whether
it was firsthand, secondhand, verified
or uncertain. Sometimes you need to
separate raw episodic memories from what
the agent currently believes too. And
you need to test behavior through
scenarios, not not just through vibes.
So the other lesson is that uh roll back
also is not optional. When you optimize
social behavior, a change can improve
one thing and damage another. So, a
policy that spreads public facts uh
faster might also leak private
information. A policy that increases
recall might increase stale memory
usage. So, the loop should basically be
like a ratchet. Try a change, score it,
keep it only if the scorecard improves
and guard rails whole.
And we we definitely believe this is not
only relevant for game agents because
although I gave you an example using a
game village um we believe like let's
say for example support agents support
agents need to know which policy update
comes from where right and whether it
supersedes an older answer. Personal
assistants for example need to remember
commitments that they previously made
and h make corrections if uh if the user
uh wants to change those personal
commitments. Research agents need pro uh
provenence citations, contradiction
handling and hypothesis updates. Coding
agents need longunning context across
issues, files, teammates and changing
requirements. Workflow agents need
access controls, handoffs, and
replplanning when the world changes. All
of these systems have the same
underlying problem. They maintain state
over time. And that state affect affects
future action.
So they need control scenarios and
behavioral scorecards is what we are
proposing.
So again in brief, a recipe for long
horizon agents. If there is one
practical recipe I want you to take
away, freeze the harness, define
scenarios, log traces, score behavior,
and expose only a small policy surface.
Search over these changes, keep only
changes that survive your measurement.
And this is an engineering pattern that
we believe would uh make sense for
longunning agents. The real question we
believe is across controlled runs, does
the system behave better?
To close, project paradox started as an
attempt to make game agents feel alive
in a 3D world. But the deeper engineing
problem was not animation or dialogue
for us. It was the state such as which
agent knows what, which agent told whom,
what is true, uncertain or outdated. And
do agents act on what they remember?
Otter research. Otter research gave us a
way to approach this a bit more
systematically. Not by trusting one demo
and not by endlessly handtuning prompts,
but by running control experiments and
keeping only the changes that survived
our measurement. Long horizon agents
need experiments and not just prompts.
And I hope that's the takeaway that you
get from this talk. And yes, please do
connect with us. We'd love to talk if
you have any questions. Thank you so
much for listening.
Hi, I'm Amole, CEO of Nori Aentic. We
deploy an AI employee that understands
your company, your code, docs, Slack,
and other kinds of data. We spend a lot
of time thinking about how coding agents
really work. Most people think coding
agents only write code, but if you ask
me, that's just bad marketing. Forget
the name for a second. Coding agents can
do almost anything. There's just one
trick. You have to be able to think like
an agent to get it to do what you want
it to do.
Today we're going to talk about how we
use coding agents to do something most
people think agents are terrible at.
Make visual artifacts like slides, docs,
and yeah, even video.
Every day, the world pours something
like 34,000 human years into making
slide decks. Most of that time isn't the
thinking, it's the fiddling. A deck that
takes 10 hours should really take about
25 minutes once you remove all the
formatting and the branding and the
moving things around. Say you need to
make a slide. What do you do? You open a
tool, PowerPoint, Slides, Figma, Canva,
and then you start manipulating a
canvas. Every one of these tools is
built for human hands and human eyes.
Click, drag, drop, resize, snap to grid.
All motions and patterns that make sense
for our geospatial view of the world.
There is a data structure underneath,
but it's in a format that only the
application can read. What happens when
you hand these tools to an agent? Well,
the output comes out all wrong. Things
overlap in weird ways. You can't see the
text. There's no alignment. It's just
garbage.
AI skeptics say that it's not just the
tools. agents fundamentally can't reason
about space. And there are whole
benchmarks like Arc AGI that are built
exactly around that premise. There's a
famous little test for this from
developer Simon Willis. He asks every
new model the same thing. Can you draw a
pelican riding a bicycle? But there's a
trick. The agent is only allowed to use
SVG. It's a quick gut check for whether
a model can reason about space at all.
Here are some examples of what the
models actually give you on this test.
And yeah, these are pretty bad. Like
genuinely, deeply really bad. So, does
that mean it's hopeless? Agents are just
doomed to be bad at graphics? No, I
don't think so. If you ask me, it's not
the model, it's the medium. If I asked
you, someone who is presumably human, to
handwrite an SVG of a pelican, you
wouldn't be able to do that either. SVGs
are just a wall of numbers. You can't go
from a wall of numbers to a pelican. You
just can't see that way. That's just not
how people think. We think graphically.
So, we build tools that let us draw on a
canvas. Figma, MCP's, PowerPoint, CLIs,
screenshot and replace loops. What do
all of these agent tools have in common?
They all approach the problem like a
human. But an AI is not a human. Asking
an AI to use a canvas is like asking a
human to write SVG by hand. It doesn't
really make sense. You need to give the
AI tools based on how it thinks, not in
pixels, in language. Words, tokens,
structure. That is its native medium.
Imagine a language that's incredible at
describing layout, that models have seen
and trained on billions of examples of
that they understand intuitively, that
renders to pixels and can run
everywhere.
Oh, right. HTML lets a model think in
structure. HTML tags have meanings built
into the language, a heading, a chart, a
grid, and the browser turns it all into
pixels. So, the model never actually
places a coordinate. And you can get all
sorts of visual effects, charts and
layouts, fonts and motion, all of it for
free. Remember that pelican from
earlier? Now ask it to do the same exact
task, but in HTML. Same bird, but now
it's in a structure that the model can
reason about. And you can read and theme
and edit every single line of it.
I spent my whole life building slide
decks with PowerPoint. So, I always
thought that those two things, slide
decks and PowerPoint, were synonyms. But
that's just not really true, is it?
PowerPoint is a tool that you use to
make slide decks. The deck itself,
that's just the presentation mode. And
as it turns out, no one in your audience
is going to care how you got to the
presentation mode. The editing format is
totally arbitrary. So you can just pick
the editing format that the agents are
already good at HTML and if you need to
render to a different format like PDF
later on. We use this HTML trick to
build all of our slide decks, our board
decks and our sales decks. These are
real things that we actually present and
send out constantly. We use it for our
docs, too. It gives our docs color and
vibrancy all while following our brand.
And of course, we also use it to make
videos like this one. What you're
watching is just HTML and CSS. It's
literally just divs all the way down.
Almost everything is better with a
little structure and a little bit of
color. Plain text is a choice, generally
a choice of convenience, but it's
usually the wrong one if you're actually
trying to create something of use.
Now, I do want to take a quick beat here
and point out that a beautiful deck on
its own is generally not worth anything.
You still have to go and get all of that
content, all of the things that actually
populate that deck, right? Well, again,
we can think like the model. If you just
give the model access to your data, say
your call transcripts or your emails,
you can have the model build the deck
end to end. Let your agents do all the
grunt work while you focus on vision and
story. That's what Nory Sessions lets
you do. I've built entire board decks
for my phone on the subway during my
commute. Why? Because our Norybot lives
in the fabric of our company. Of course,
Nory ships with everything you need to
make this all work. So, don't bother
reinventing the wheel. That's my little
spiel. Thanks for listening. If you have
just one takeaway, it's this. Stop
thinking like a user. Think like the
model. Give it the right language. And
for graphics, all you need is HTML.
Hi everyone, 10X. You feel it yet? Hi,
my name is Zion and I'm a mobile
software engineer for the last 14 years
and I'm here to talk to you today about
10X, reimagining the mobile dev
workflow.
So, you know, back in the old times when
cursor was that thing you make with your
mouse and AI agents were that dystopian
character from sci-fi books or movies,
whatever fits your style, you know, just
a few months back then when we thought
that we will still be using our IDE just
maybe slightly better. And now we know
that we already switched to like chat
style um engineering when we discuss
with cloud code codex cursor whatever um
and we just tell them what to do and we
don't use our IDs unless it's for
debugging or something that the agent
couldn't figure out and that in theory
should have made us 10 times more
productive right that's what everybody
says right with are we 10 times more
productive do you feel it I don't know
because I can't feel that we are 10
times more productive not as a single
engineer and not as a whole group and
not as the whole company. So why is
that? Why do we don't see the promise of
10 times more productive came to an
actual life?
So you know they tell the story about
how when factories switched from steam
engines to electric engines at first
they didn't see that big of a gain. So
yeah, the electric engines were better.
They were more efficient, but they
didn't see that 10x, 20x, 30x uh more
productiveness that they have been
promised. And the reason for that was
that they only changed the steam engine
with the electric engine. But the real
gain came some years afterwards when
they understand that it's not only about
changing the engine, it's about changing
the whole workflow. Because you see,
they used to have like one giant big
steam engine in the factory and all of
the machines were rearranged
based on their power consumption and
their proximity to that steam engine.
So it wasn't organized by the workflow
that it should have been like from the
start to the end of the workflow. No, it
was designed by proximity to that
central engine. When they realized that
and they also realized that they could
take the electric engine, make it
smaller and put it inside each machine
and then they rearranged the factory to
make it work as the workflow should
because now it was made possible. Then
the real gain came. Now they were 10
times, 20 times, 30 times more
productive than they were before. Not
because of only changing the engine but
of changing the whole workflow. And that
is what I want to talk to you about
today. Let's think how AI make things
that weren't possible before possible
now. And we can change our workflow and
then becoming 10 times 20 times more
productive.
To do that, let's look at the current
workflows. The PMs have an idea. They
iterate with the designers. They iterate
with the user. They iterate with the
dev. They then back with the designer.
Then they iterate with the QA. And they
iterate back with the dev. And maybe
after all those iterations maybe you
have something in production.
So what was that word that was repeating
so many times? Yeah, iteration. And this
is the problem
because iteration creates friction.
Each iteration creates context switch
create time waste creates communication
that needed to be done syn
synchronization that needed to be down
and AI didn't eliminate all of that AI
sped up code but didn't eliminate the
friction didn't eliminate the iteration
why is that so let us reimagine what we
could do bear with me for a moment what
if what if What if instead of using one
tool for designing, another one for
testing, another one for coding, and
then another one for releasing, what if
we could use one tool, one codebase?
What if instead of designing on Figma,
then sending a design doc to the
developer in order for them to figure
out how to um make those uh designs
alive? What if designers could actually
design own code and then send the
developer a PR? What if QA could iterate
with the agent itself, just getting a
link with the simulator and they can
tell the agent exactly what to test,
what to be cautious of, and if they find
something, exactly what to fix?
What if we could make the dev workflow
works on the code itself? What if God
was one of us? No, sorry, I got carried
away there. And you're probably asking,
how can we do all of it? So one way
would be to tell everyone to just
download their Xcode and and their
Android Studio and teach designers and
PMS and QA how to build and how to uh
test on simulators, emulators and blow
to their laptops with a 200 GB on
storage and uh whatever they do to the
to our memory. That's one way.
But let me guess that most of them would
reject that idea and for good purposes.
So we can make another way. Maybe we
just put it in our CI, right? So we let
the agent iterate with the CI so they
don't have to download Android Studio
and Xcode and everything.
But you actually know that CI builds
take between 20 to 40 minutes. And we
can't actually let our agent wait for 40
minutes just to understand that the iOS
code that it pushed actually failed to
build.
So what else? What can we use?
Introducing cloud sandboxes.
So cloud sandboxes are actually concept
that has been around already for many
years, just not for mobile development
yet.
Using cloud sandboxes, you can tell the
agent, here's an here's a CLI. Talk to
the CLI. Create a VM, a small VM that
runs only for this iteration.
The VM boots up in 30 seconds or less.
Make the build. show them a simulator on
their inapp browser in the cloud code,
codex, cursor, whatever. And then they
can iterate over it, tell you it to
change that pattern, uh to go back and
test something and change the code and
they push and open a PR and then the
designer can work on code, send a PR to
the developer after they done.
Developers make an iterations make one,
two, three, four different VMs uh to run
in parallel. They send the PR for
review. QA can take it from there and
tell the agent exactly what to test and
tell it what to fix and from there it
goes straight to the stores for review.
So let's see it. Let's see how it should
work.
So imagine you see this screen. Imagine
you're inside Codex for example. You
have the chat interface to your left.
You have the actual app to your right.
The designer is iterating with the
agent. tell it exactly what they want
them to do, what they want to change and
see the changes immediately on their
screen. Build time is faster. It's done
on the cloud and preview time is faster.
Then they some more not with the
developer but with the agent on their
laptop without the need to install Xcode
or Android Studio. And once they done
they can tell the agent to take that
code, open a PR and send it to the
developer. This workflow is what makes
us 10 times more productive. Not only
because of using AI but because of using
AI to change the workflow, reimagine it
and remove all that friction that we
took from for granted in the old times.
That is how we become 10 times more
productive. Thank you.
>> Hi everyone, my name is Gabe Dees Mesa.
I'm an engineer here at OpenGV and today
we're going to be talking about agents
in production. Specifically, how open
gov built and scaled og assist. Uh so um
this presentation is going to be
jam-packed with just so much good stuff.
Uh we're going to talk about uh AI
agents. We're going to talk about our
harness. We're going to talk about um
eval observability traces. We're going
to talk about um tools and skills. Um
it's there's going to be a lot of good
stuff in here. We're going to talk to
you guys about uh what we do at OpenGV
and how we operate at the scale that uh
we operate at um in production. So
you'll be able to see a real use case
and workload uh with AI agents. Um so
without further ado, let's get started.
Okay, agenda. So just really quickly
going to go through uh high level what
we're going to talk about today. Uh I'm
going to tell you guys a little bit
about OG Assist and what uh OpenGV is.
I'm going to tell you guys the origin
story of how this all kind of came to
be. Uh we're going to talk about OG
Assist's uh big bet on effect uh a
little bit into our core agent loop. Uh
we're going to talk about the A2A
protocol, eval.
We're going to talk about how we manage
long context. We're going to talk about
um monitoring observability, how we
collect feedback uh and how we iterate
on that feedback. We're gonna lastly uh
also talk about tools and skills and how
at open gov uh we use um AI not only
externally uh that we uh serve to
customers but also internally to improve
our development workflows.
Just a little bit about me before we go
any further. My name is Gabe. I'm a
software engineer here at OpenGV. I work
on the AI agents team and uh I'm one of
the folks that helped build uh OG Assist
and some of the systems that you guys
will be seeing today.
So, a little bit about OpenGV. OpenGV is
a software company uh on a mission to
power more effective and accountable
government.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat
up here.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat up here.
Please welcome our MC for this
afternoon's programming, director of
technology at Oliver Wright Americas,
Deina Delias.
Good evening everyone. Gosh, I am so
grateful to be up here with you. House
AIE 2026.
Thank you for being here live and
online.
Thank you so much. So
um apologies
Deina Delias Oliver White Americas we do
integrated business planning and
strategy consulting.
So honored to be here with you all. We
covered so many grounds, 18 tracks of
workshops, keynotes, panels, expo
sessions,
breakouts, and most of all, your
networking sessions. Have you met all of
your friends tonight?
Yes. No.
Precious.
Am I the only one who thinks the more I
know, the more I don't know? Show of
hands.
Oh, thank you. What? Pity hands up. I
I'll take it. Thank you.
But thankfully for us, the expo has a
mass of wonderfully supportive sponsors
and expo partners ready to assist you in
your business and personal projects for
best practices.
Talk to them, visit them, let them help
you achieve your goals. Check out the
dancing robots. Take a picture with
them. Win the giveaways. check out start
start a battlefield tonight
um and talk about
best practices.
This next speaker is someone I truly
look up to and honored to make his
introduction. His achievements are so
vast it's hard to wrap them all up in a
few sentences. So I'll use his humble
words instead.
He's an author, an educator, advocate
for AI best practices. He translates
complex technical concepts into
accessible learning materials.
I am truly excited for what he has to
say for us. Give a huge round of
applause for Addios Mani.
Howdy folks.
So, good afternoon or good whatever time
it is when you're watching this on
YouTube. I'm really excited to be here
and um today I want to talk to you about
really uh what it takes to keep the
human in the loop where engineering is
concerned. I really want to start with
the human side before we talk about the
architecture here. I think that the
engineer of the future is going to be
really defined by the person who is able
to choose what is worth doing.
They're going to own the evidence.
They're going to own the understanding
as well as the verdict around
increasingly automated work that's being
done by agents. Now, when I use the term
verdict, I don't mean that we're
suddenly all going to be Judge Judy.
We're not. But what I mean really is
something just a little bit different. I
mean we're going to be accountable for
the production decisions.
Does something ship? Do we block it? Do
we redirect it or accept the risk?
Quality is something that we all talk
about a lot, but quality produces
evidence. A verdict assigns
responsibility
and answerability is really what lets us
stand behind a verdict. And this, of
course, is not the only way that our
industry is starting to think about our
roles evolving.
Boris Churnney recently put some useful
language around what many teams are
starting to feel. The old craft
boundaries are getting blurry and roles
are rebundling around the work itself.
And the important question here becomes
a lot less about what is your title and
more what part of the system can you
own.
Now I like this taxonomy quite a lot. Um
it's optimistic without being overly
vague. So things like prototype, build,
sweep, grow, and maintain. And these are
real engineering modes. Agents are going
to help with all of them, but the scarce
thing is not merely doing the task. It's
going to be knowing which mode your
product needs and what quality bar
applies and who owns the result. At the
end of the day, now we've been talking
about harnesses and loop engineering and
software factories over the last couple
of days. We can talk why this shift is
happening. We move past the model as the
whole story, right? With harness
engineering, the coding agent is the
model plus the harness around it, right?
Your context, your tools, your file
system, git. And the harness is what
turns intelligence into something that
you can delegate to. The next move was
loop engineering where we weren't just
prompting one run anymore. We were
designing systems that kept prompting,
checking, and remembering, and deciding
what happened next. And that's really
when agents started to feel like
infrastructure. And once you start
putting all of those things together,
you get that software factory. Dex
covered this well in his talk. But you
have agents that are running inside that
inner loop and evidence that comes out.
Humans still end up making the
production decisions in this loop. And
the wind really isn't moving us from it.
The wind is moving human judgments the
highest leveraged checkpoint I think.
And this is why it starts to matter now.
AI generated and AI assisted code is
becoming normal code for a lot of us.
One of Sonar's 2026 surveys said that AI
assisted code is no longer marginal.
It's increasingly having a large role in
our code bases. And once that happens,
answerability stops being this
philosophical world. It becomes an
engineering requirement. And there's a
quality point here as well, right? Like
we used to care about clean code. code
that people could read. But cleaner code
is actually not just going to help the
next human and the next person on your
teams. It actually helps the next agent.
Another one of Sonar's research uh
studies found that clean and messy repos
had roughly the same pass rates, but
clean code actually used fewer tokens
and caused fewer revisits. So there's a
lot of benefit to maintainability that
can fuel efficiency for your factories.
Now making generation cheaper does not
automatically make review cheaper,
right? I think a lot of us are facing
this moment and we know that engineers
are not naive. The sonar numbers say
that almost everybody is skeptical of AI
code. Now I love working in my software
factory. I love building my engineering
loops. But the problem is still
capacity. If 96% of people don't fully
trust that code, but only about half
always verify before committing, we have
this danger that we've got distrust
without bandwidth. And so safety comes
from making verification cheaper,
clearer, and harder for people to skip.
And if you zoom out from the individual
reviewer to the organization, review and
validation start becoming a bottleneck
when governance isn't able to catch up
and adoption is already moving way
faster than any company can go and set
their policies. And this means that we
have some hard questions we have to deal
with like did a model actually touch
this file. And the hard questions are
also like what constraints guided that
work? what evidence was produced, what
risk was accepted, and who owned the
result. Now, the agent can ship more
than any of us can review, right? So,
what are we still good for? I It's a
question that's on a lot of our minds,
right? And you know, if Homer Simpson's
experience automating computers can
teach us anything, maybe this is our
future. I don't think it is, but it's
one direction things can take. Now,
let's try that again. If change is where
humans enter the loop, if generation
scales faster than comprehension, the
scarce resource becomes judgment that's
backed by evidence. So the question is
no longer how much can the agent do, but
where does human judgment still create
leverage. Now I want to talk to you
about two terms that I'm going to use
for the career part of this talk. Alpha
and decay. Alpha is the gap between what
you can do today and what current models
can do. That gap is a very real thing
and decay is the clock on that gap. If
the thing that makes you special is a
capability, the frontier is eventually
going to come for it. Right? And there's
a whole conversation around this. This
is one of the reasons why taste keeps
coming up. Paul Graham had a point here
that I think is very right. When anyone
can make anything, choosing what to make
becomes very important. And I buy that.
But I also think that we have to be very
careful because taste can become a magic
word for whatever part of the work we
don't want to explain just yet.
Mitchell Hashimoto gave us a more useful
version of this definition. Taste is the
ability to make highquality qualitative
judgments where no objective metric
exists yet. That matters because it puts
tastes before the benchmark and before
the market has fully voted. When you try
out a model and you see the kind of UX
and the kind of experiences that it
builds, you can often tell when you
think it has taste or lacks taste or
when there's a gap there that humans can
fill. Now, this is also only useful if
we can turn some of this concept around
taste into critique examples and better
judgment over time. So yes, taste
matters when production gets cheaper.
And if anyone can generate 10 options,
the scarce skill is really knowing which
option deserves to exist. But taste is
not some eternal moat.
It's alpha as well. Now the people with
taste are still going to matter. I
personally think they're still going to
matter for a long time. But the best
version of that skill is not mystique.
It's making better calls and leaving
behind examples that your team in the
system can learn from. Now let's apply
the decay test. Well, we used to have
speed that decayed. We used to have
recall. You know, harnesses have memory.
Verification is moving into harnesses,
eval static checks, and model critique.
Taste. I continue to think this is going
to decay much more slowly, but it still
resets as models learn from examples and
preferences. Even judgment in some ways
is a slope rather than a wall. So the
strategy is not to cling to any one
capability. It's for us to keep moving
our edges up a level. So this is one of
the reasons why what can the agent do is
not the best strategic question anymore.
The list of things that agents can't do
just keeps shrinking. The better
question for us is really what can only
a human be answerable for? Not because
you know any of us are are magical in
any way, but because some decisions
actually require ownership. They require
context, risk acceptance, and
responsibility after that work shifts.
This is why the word engineer has to get
just a little bit stricter. More people
than ever can now make computers do
things. And I think that's truly
awesome. The total addressable market
for builders has never been larger, and
that's so cool. But it's a huge
expansion of the leverage. An engineer
is not merely somebody who can code, you
know, and and get things to exist. An
engineer can reason about systems. They
think about constraints. You defend
trade-offs. You can manage risk. And
you're the person that can be reached
out to when things start to break. So
what are things that engineers should
avoid if we want to stay effective and
accountable in this moment? Well, the
first thing to avoid really is cognitive
debt. Now, cognitive debt is the erosion
of your understanding and memory around
how to solve problems. I think a lot of
us start to feel this the more that
we're using agents every single day. I
know that I feel this a lot and it's
because we're deferring more and more to
AI to solve our problems. For code, it's
the gap between how much code exists in
your repo and how much any human on your
team genuinely understands. And this is
why things like delegation depth end up
mattering. You can have a build that
passes you know your tests a PR that you
can merge but your team can still end up
losing its ability to actually explain
the system that they are shipping to
production.
Now a very real pressure is much is also
how much we delegate. So agents can now
stay inside the system long enough for
the human to lose the thread. So a 30
secondond run right can feel like an
interaction but an hour or a daycale
task so something long horizon that's a
work stream and when tasks can end up
you know lasting that long especially
when you begin running many of them in
parallel review can't just be a glance
at the end it has to become a whole
control system. The second thing to
avoid is cognitive surrender. Now this
is when you blindly accept AI's um
responses like delegation is important
because delegation says do the work then
show me enough evidence that I can judge
it. I still make a judgment in that
situation. Surrender is really saying
hey your answer is now my answer before
I have formed any opinions myself. Now
uh Wharton did a study that kind of
offers us a warning light here. when AI
was wrong, 73% of people still thought
that they, you know, they picked the
wrong answer and they felt more sure. So
the failure mode is not using AI, but
it's borrowed confidence.
The third thing to avoid is
orchestration tax. Now, if you've been
in the Bay Area, you will see people
who, for better or worse, are still
walking around with their laptops open
or are talking to you about cloud
agents. And we're increasingly trying to
run more and more and more in parallel
or telling each other that we're
shipping with hundreds of agents or
thousands of agents. More AI agents
running does not mean that there is more
of you available. Your cognitive
bandwidth does not parallelize. So every
loop that you create ends up causing
more decisions to route, merge, verify,
and integrate. And the fix is not
necessarily fewer agents, but it's about
designing your attention like a system.
like where you enter, what you require,
what you reuse. You just want to be very
intentional about it. Now,
accountability can be a scary word for a
lot of people, and I wouldn't be
surprised if it made you want to go hide
in the bushes and just tell your agent
to deal with it.
But accountability is not what remains
after agents get good. It's what lets
the rest of the whole system scale. If
agents can do more work, if they can do
it faster in parallel, better than what
many of us could do, the scarce thing
becomes the ability to explain intent,
to inspect evidence, to accept risk, and
improve the system when the decision was
wrong.
Now, here is the career math. The
halflife of an edge might be one model
release. speed, recall, verification,
even taste all move as the frontier
moves. But the half-life of a signature,
your credibility, your expertise is much
longer. And by signature, I really mean
the name on the work, the person, the
team, the institution, whoever stands
behind what's actually shipped. So
skills can earn leverage. Accountability
can turn leverage into trust. And this
is one of the lines that I want to draw
pretty clearly. Agents can choose, they
can route, they can merge, they can
escalate, they can operate inside
policy. And in many systems, you know,
they can, they should, but execution and
responsibility are very different
things. The agent can follow your
runbook, but it can't inherit the
consequences. When something fails, the
question is, who understood the policy?
Who accepted the risk? And who owns the
blast radius? High agency is something
that a lot of us talk about these days
as being like this thing that we're
looking for when we're hiring. High
agency is actively taking ownership of
your outcomes. So knowing when to
delegate, when to inspect, when to stop,
and when to put your name on the
results. High agency in this world is
not I personally do everything. You
know, that version doesn't really scale.
It's not just hustle theater, but it's
ownership with judgment attached. This
agency ladder tries to make that a
little bit more concrete. At the bottom,
you've got someone that flags a problem
and leaves it for the system. higher up
they execute, diagnose, propose,
recommend, and resolved. And the rare
top movement is discernment. You know,
maybe you find a problem and you decide
whether or not it's worth investing in.
Maybe it's not and maybe you move on.
But when agents make more paths
possible, agency is not chasing every
single path. It's really just deciding
which paths deserve your ownership and
attention. So translate that into an
operating model. agents can run much
more of the inner execution loop. They
can investigate, implement, test and
report. I think that there's leverage in
that, but that outer loop is still
engineering. So deciding, verifying,
approving, owning, that inner loop is
capability. The outer loop is agency.
And this is a boundary that I really
care about. Your agent returns evidence.
It returns diffs, tests, logs,
rationale, traces, trajectories,
screenshots, whatever the work itself
requires. But then the engineering
really begins. We decide whether the
work was worth doing. We verify whether
the evidence is enough and we approve or
redirect or own what reaches production.
It doesn't matter if you're someone
that's just working with a small number
of agents or whether you're working with
thousands of agents. I still very much
think that these ideas apply. So the
boundary is not human looks at AI
output. The boundary is evidence and
responsibility.
So here's an operational rule. Explain
it or don't ship it. And it's not
because humans have to type every line
or read every line, but because someone
has to understand the work well enough
to defend it. If you've ever worked in a
large codebase or an enterprise
codebase, some code bases have this
concept of an owner's file or c certain
subdirectories where there are people
who are on the hook for that part of the
system. You can think about this in a
very similar way. Who's accountable for
that part of your architecture in your
codebase? Your model might write the
code and the question is really still
whether you can explain those changes
that the agent is shipping, whether
you've got the evidence where you
understand the risks. Now, this is one
of the things I want you to remember
near the end. Automation moves the floor
for all of us. Engineering continues to
move up a level. And our new work might
be loop design, evidence design, and
brownfield stewardship, but fewer
keystrokes doesn't mean less engineering
over the next few years. It means that
there is more surface area that needs
taste, verification, ownership, and
ultimately care.
I don't think I've ever been more
excited about the future of this field.
Every time that we've made it easier to
write software, we've predicted that the
world would need less of it. And in
fact, the opposite happened. Higher
level languages happened, frameworks,
cloud, low code. The pattern always went
the other way. And when you lower the
cost, latent demand ends up appearing.
Those ideas that people didn't think
were feasible to build and get out there
are suddenly unlocked. And agents are
going to do the same thing for a lot of
people. It's not going to remove
engineering work. It's going to move the
bottleneck from can we build this to
should this exist and can we answer for
it. So build the factories, keep the
lights on, own the verdict. I hope this
was useful. Thank you.
Now joining us on stage are the
co-founders of artificial analysis,
George Cameron and Micah Hill Smith.
Hey, hey. Good afternoon everyone. I'm
Micah. This is George. And we are the
co-founders of Artificial Analysis.
Artificial Analysis is an AI
benchmarking company. And today, we're
going to be talking to you about the
cost of intelligence. A couple of years
ago, when neither of us would give talks
like this, we would spend a bunch of
time justifying why intelligence and
cost trade-offs matter. Today I'm going
to skip that whole part of the bit and
we're just going to get straight into it
because I would be shocked if I needed
to convince anyone in this room why the
cost of intelligence is an important
topic for us to be talking about in mid
2026.
So here's what we're going to do. I'm
going to tell you a bit about who we
are. We're going to use some of our data
to take a brief look at the state of the
AI race. Then we're going to spend most
of our time breaking down the cost of AI
today and what's driving it. We're going
to use some data from our latest agentic
knowledge work evalu.
What the heck does that mean? We build
benchmarks and evals to test everything
in the AI stack that matters to
developers and companies making
decisions about AI technologies. We test
chips, cloud infrastructure, models, and
agents.
We try to figure out how smart the
models are, how fast they are, and how
much they cost. We publish a ton of that
data on this website. Hopefully, some of
you have seen it. And we work with
companies throughout that entire AI
stack to measure their technologies,
help them in the world understand what
they can do. Got a handful of examples
on the slide back there from some of our
work with OpenAI, Google, and Nvidia on
their models. recently.
Let's have a look at the state of the
race.
Before I show the first chart, going to
talk about an idea that is very
important to the way that we think about
building AI evvelts.
The vast majority of the things that we
foreseeably want AI to do, the models
are still far too dumb to do. It's
utterly profound what the models can do.
Today things are pretty nuts and yet
because the future is so enormous this
is almost certainly still true. So what
this means is that at any given moment
in AI we've got this concept that we
think of as the intelligence frontier
what today's smartest models can do.
If we think of most of the tasks
being beyond that, certainly beyond that
in terms of being able to reliably do
them, that explains why so much of what
all of us in this room want to do with
AI is focused on what the absolute
latest frontier models at any given
point can do. It also implies that there
exists a set of tasks that are inside
the frontier and that that set of tasks
is growing every month as new models
come out.
For that set of tasks, playing the
intelligence cost trade-off is
incredibly important because by choosing
to not use the smartest model for every
single thing, you can spend 10, 100, a
thousand times less to get the same work
done by the AI.
The state of the race,
we publish a metric called artificial
analysis intelligence index.
We like to say that it is the best one
number for understanding the AI race,
but that if we thought you only needed
one number, we wouldn't need to publish
the rest of the website.
What this metric actually is is a
synthesis across nine different emails
that we run. We're at version 4.1 of our
index. It includes a bunch of agentic
stuff. It includes a bunch of hard
reasoning Q&A type stuff.
And we really do think that it is the
best one number for your sense of what's
going on. We've got Claude Fable 5 on
top. That little not currently available
thing. I guess we get to go remove that
from the website after this today.
One of the things we like to do with our
intelligence index is plot how it's
changed over time. This chart here
is the smartest model from each one of
these labs over the last few years.
Some of it hasn't changed that much. You
can see OpenAI and anthropic trading
blows over the last few years.
You can kind of see the dots getting
closer together on the right hand side
on the X-axis because the pace of
releases especially over the last year
has gone up and up. You can also see all
of the companies hot on the heels of the
frontier who have been and are releasing
models that achieve the same level of
intelligence as those frontier models
just months later.
If I take some of these lines off and
all we look at is the smartest model
overall and the smartest open weights
model at any given point, we can draw
this line and we can look at the gap
between the open weights frontier and
the overall frontier.
In any given month, you can probably
find a headline saying that open weights
models are further from the frontier
than ever or that open weights models
have just caught up to the latest
proprietary models. I think when we read
this chart, what we see is that
unfortunately neither of the extreme
versions are true and we see a
consistent 3 to nmon gap that's held
surprisingly consistent over all of the
last 3 years.
That's still pretty nuts by the way
though because that does mean that
within 9 months of Mythos being
announced, we are predicting that
someone's going to give away a copy of a
model as smart as Mythos. You can hold
us that prediction. I'd be very
surprised if this trend goes away
anytime in the next year or so.
Beyond intelligence, we can plot a bunch
of the metrics that you have to trade
off against how smart the model is. This
one's pretty simple. This one's the
price of the tokens.
This one actually might be surprising in
a talk that we've called the cost of
intelligence because we all have this
feeling that the amount we can spend on
AI is skyrocketing higher right now. And
that's completely true. But this trend
here is also true. Token prices have
continued to fall by 5 to 10x every year
for each fixed level of intelligence.
Each of the lines here is a band of 10
points of intelligence index. I promise
you that if you ever have to pick
between a model that's 10 points higher
on our intelligence index than another
model, it's incredibly hard to find any
task at all in the full distribution of
tasks that the model that is 10 points
dumber will outperform the better model
on
each one of these lines goes down
incredibly quickly. It's a log axis on
the y-axis on this chart, by the way.
And the cost of tokens at the frontier
has stayed surprisingly consistent.
But we look at cost per task across all
of the emails and tasks that we run for
our intelligence index and yeah the
number is going up.
This is the average across every task
which includes some agentic stuff, some
non-agentic stuff. So it's actually
hiding how extreme cost per task gets in
some situations today.
If we break it out a little, these are
kind of small but we've got the highest
numbers on the left there. GBQA diamond
famous important open source evaluation
data set from a few years ago. It's a
reasoning evaluation. We don't let the
models work as agents. It's largely
solved right solved now. We see from
fractions of a scent per answer for each
model up to about 50 cents. In our
coding agent index and in our new AA
briefcase agent acknowledge work eval.
We see up to beyond $20 being spent on a
single task.
The most expensive task in a briefcase
is actually several times that leading
that of course we do have claude fable 5
although fun fact it's kind of small
here but you can see claude sonnet 5
actually uses an enormous number of
tokens and so it's nearly expensive in
our AA briefcase tasks down the bottom
there but this is the thing that we're
all feeling that we're trying to do
these really hard tasks the frontier
keeps moving there are more things that
we can ask the models to do than there
were a while ago So we can spend
enormously more per task than we could
even though that cost per token for each
fixed level of intelligence is falling
by 5 to 10x every year. These orders of
magnitude are not things that our brains
are good at getting intuitively and the
contradictions are kind of nuts. So I'll
pass off to George now to break down how
we understand some of these
contradictions.
Thanks Micah. So why does AI feel more
expensive than ever while for fixed
levels of intelligence the prices of
accessing that intelligence in terms of
tokens is falling dramatically and I
think this is AI engineer world fair we
actually want to spend more
higher token budgets
when what I'm going to do now is use our
AI briefcase benchmark to do analysis of
this cost of intelligence
Our AA briefcase benchmark is our new
agentic knowledge work benchmark. It
benchmarks models on realistic
professional tasks.
There's four private scenarios,
each representing weeks of human
equivalent work.
And do we ask models to complete
realistic tasks? Then we grade models on
the outputs of those tasks across three
dimensions. Rubric correctness,
analytical quality, and presentation.
Much like we think about assessing human
work.
One of the differentiators for a
briefcase compared to other benchmarks
is we've tried to make it as realistic
as possible.
When giving a task to someone else on
your team or when receiving a task,
unfortunately, you're not given it on a
platter with the precise information
that you need to complete the task. You
need to go out and find it. You need to
troll through emails, pick up on the
latest Slack messages. That's what we
expect for ourselves and others. And so,
we've tried to mimic this in the task
that we're giving models in a briefcase.
The environments that models are
completing tasks in
are thousands of files,
messy Excel files, unstructured
documents, structured documents and
reports with hundreds of pages,
emails, Slack messages. And we expect
and ask of agents to complete these
tasks just like we ask of ourselves.
When we look at the outputs of models in
completing these tasks, you can see vast
differences
in the quality of the outputs. And this
is how we assess the quality and
intelligence of these models on these
agentic knowledge work tasks. It also
gives us a perspective on the progress
that's been made over the last couple of
years on this task which is a commercial
due diligence task. GPT40
presents a pretty basic slide. 03 a
breakthrough model that was released
early last year.
Thinking about that 03 was only last
year is crazy to me.
You can see that 03 produces a few
bullet points helpful but not what we
would expect of ourselves in completing
this kind of task. And so this shows us
the progress that's been made when we
look at Opus 4.8's output and Fable 5's
output, which goes a lot more in depth
depth in terms of analytical rigor and
presentation quality.
So let's look at how models completed
this task and what it cost. If you
remember Micah's slide, he showed that
some models are take using over $20
worth of tokens uh to complete these
tasks. And so let's look at the drivers
to learn a bit about the costs of
agentic tasks.
Four drivers to look at and the key
drivers here are token price, the number
of turns in the agent trajectory, the
token efficiency and usage of models,
and last but potentially most important,
the impact of prompt caching.
Taking a look to start with the prompt
with the token prices.
What we can see as a first takeaway here
when looking at the cash hit rate token
price the input not considering a cash
hit or without a cash hit price and the
output token price. Firstly is that
there's orders of magnitude differences
between the model. This is a critical
driver.
There's order of there's two orders of
magnitude difference in terms of the
token price between Frontier models like
Claude Fable 5 and still good very
usable workhorse models like Deep Seek
V4 Flash and GPT OSS120B.
The second takeaway here is the
difference between the individual token
or the types of token prices.
You can see that there's vast
differences in the cash hit price and
the input token without a cash hit price
and the output token price. And we'll
get to that impact later when we look at
token usage.
Next, these are longunning agentic tasks
that we are now asking of models,
especially in realistic environments
where they need to navigate all of these
thousands of files to get to an answer.
And models are doing that. They're
starting to really explore the
environment
actually similar to humans when we
search Slack and and and do similar
tasks like that. You can see here with
the breakdown of tool calls of models is
that they're doing hundreds of calls and
they're exploring their environment.
They're viewing images. They're reading
files. They're writing files to do ad
hoc analysis that's going to feed into
the the slide output that we just saw.
And this costs
each turn is output tokens and then
those output tokens flow into input
tokens in the agent trajectory and we
pay for that.
When we look at the output tokens to
complete a task, we can see there's vast
differences.
You can see that Claude Sonnet 5
released only yesterday used over
200,000 output tokens per task.
Compare that to your chatbt query uh a
couple of years ago where you might have
been doing couple of hundred tokens,
couple of thousand tokens, maybe 200,000
tokens to complete a task. And you can
see here that models vary orders of
magnitude. And this is driven by two
things. This is the number of turns that
we just looked at. And secondly, it's
the output verosity of the model. Both
in terms of how much reasoning they're
doing, how many reasoning tokens they're
outputting to complete a task and also
in completing their answer. It needs to
put together that slide and all of that
detail. That takes tokens. And we pay
for those tokens.
But stepping back not just at output
tokens that the model's output but to
total tokens that we're paying for.
We have that on the left hand chart
here. AA briefcase token breakdown
answer tokens, reasoning tokens, input
tokens. Can anybody see any output
tokens here? They're all input tokens.
The vast majority
of tokens to complete longrunning
agentic tasks are input tokens. You can
barely see any output tokens there. And
so therefore, the two token prices that
we want to look at first is the input
token price without a cash hit and the
input token price with a cash hit.
And if we remember that slide, there's
vast differences between those models.
And you can see that on the right chart
here, which is the cash discount for a
cash hit of an input token.
It's usually around 90% here, but it's
also different for models and providers
whereby some models here are 99% and
others are around 80%. And if we think
about all the the vast majority of
tokens being input tokens,
you can understand that this can change
by uh multiples a difference in a cash
discount or a cash hit rate the total
amount of an agentic task.
And so I think we're used to thinking
about output tokens, but I'd ask us,
let's start with the cash hit price when
thinking about the cost of an angentic
task and tokens.
I think the last perspective we want to
share with you and wrap up with is the
most important chart for understanding
the AI landscape in 2026. In 2025, it
was simpler. It was our intelligence
index bar chart. Now we start with the
intelligence versus cost per task as we
are now wrestling with these trade-offs
of the cost of intelligence.
And a helpful archetype to understand
this and to reason about how to think
about cost per task whether we should
just use the most intelligent model or
the cheapest model is to break down
tasks into two archetypes. The first
archetype is a task whereby there's not
a ceiling on how much intelligence you
could want to complete the task. More
intelligent equals better outputs. And
this is the case for most knowledge work
today
in prof in professional tasks.
Not everybody agrees with that but
that's something that artificial
analysis we believe quite strongly.
Think about analysis that you might do
on strategy or on how we can save costs
or on even writing a job description. It
can always be better. We can always do a
better job as humans and that's the case
for models. So there's not a ceiling on
that in terms of what level of
intelligence we need, but we do need to
trade-off costs. And so the question
therefore is how much are we willing to
pay for the extra intelligence? And you
want to look at the paro line here in
making that decision. The second
archetype of task is whereby there's a
ceiling. An example is how much did I
spend on Stripe fees last month.
A smarter model doesn't necessarily give
you a different or a better answer to
that. There's a ceiling on the task and
then you want to think about what is the
level of intelligence, the minimum level
of intelligence that can complete the
task. And then you want to choose the
cheapest model
that which is to the left on this chart.
So that is the cost of intelligence.
We're artificial analysis. We're hiring.
Thanks very much. Thanks.
Please join me in welcoming the
co-founder and chief technology officer
at Arena, Whene Chiang.
Hello everyone. Uh excited to be uh uh
here sharing our experience uh building
agent evals in Arena. My name is Wayin.
I'm the co-founder and CTO at Arena. Um
quick intro on me. Uh I did my PhD in AI
research at UC Berkeley. uh where my
focus was building robust scalable
evaluations for AI systems and that work
eventually become the foundation for
what we are building today at Arena uh
to measure intelligence in the real
world. Some of you uh some of you may
have heard uh our earlier work uh like
LMS as a judge back in uh 2023. We did
uh some of the early study as well as
building a chapa arena which and some of
the um evaluation research I was
fortunate to contribute.
So what is Arena? Um simply put it Arena
is a AI evaluation company. Our mission
is to measure intelligence in the real
world beyond just static benchmark but
uh the intelligence actually delivering
real values to the users the customers
and over the past couple years uh we
have been tracking you know all the
major AI breakthrough obviously after
you know the chip moment in 2022
after that it was GPD4 turbo able GPD 4
uh having the breakthrough in chat and
multimodel capability and then evolving
to uh the reasoning model thinking model
with uh openi01
and in 2025
we uh saw the image uh generation
breakthrough of nana banana uh which was
originally uh started testing in arena
as a code name uh before it's public
release and we are also seeing um Grock
catching up GPT images 2 recently
released uh to become you know the
current frontier of image uh models as
well as you know the video AI
generations um B and recently bid CES
so towards the end of 2025 when Opus 4.5
5 4.6
uh went from being a great coding model
to a gen genuinely agentic coding model
that can do longer horizon uh task that
also showed up uh in arena 2 that where
we measure in co- arena uh we see you
know significant improvement over the
past generational model and the most
recent fable breakthrough um where we
measure in Asian arena
uh we will talk a little bit more later
as well as the most recent GLM 5.2
release which is like really a big
milestone uh for the open source model
community.
So we have at Arena we have done this
with scale. We now see 10 million
monthly visitor going to uh our product
uh arena.ai AI and we have collected 700
million conversations across all the
modalities text, vision, image, video,
coding these days agentic and we have
hit a huge milestone. Very excited to
share that just we just recently
announced we hit 100 million um
annualized revenue in just eight months
after we first released our evaluation
product.
We are also uh ranked among the top
genai product globally by unique number
of monthly visitors according to az U
analysis.
So
the um topic I want to cover today uh
and the core of what we are offering um
is life leaderboard uh which is based on
real world evaluations u powered by the
10 million users 700 million um traces
to rank all the top AI models from tier
models uh for the past couple years and
we cover text image video uh code agent
Um so really wanted to build a um
leaderboard that can help everyone to
find the best model for their use cases
and it's free. It's available for anyone
to see to use at arena.ai/leerboard.
You can see all the analytics thereof
frontier comparing cost performance you
know use cases different category
different modality of these models
capability.
So yeah, so the real problem today I
want to talk about is to share the
experience how we how do we evaluate
agents. um wanted to share our firsthand
experience uh in the past common month
we've been building uh the agentic eval
which is very very different from the
you know past in the past we evaluate
chat bots and I wanted to share some
lesson here before we diving into uh the
details first why does this matter um
wanted to talk about the trend so we
have been seeing um the very rapid shift
from uh the chatbot to Asian
um paradigm shift and if you look at the
openi's data on codeex traffic the share
of the output token coming from agent
has just skyrocketed and you can see
inside openai essentially 100% of the uh
output tokens from agent from codeex and
for other organizations you know average
is like above 60% now and individual
also climbing very fast so there's no
question that the token flow is now
driven by agents
and we also see that agents are not just
for engineers right it's not just for
software engineering if you look at
codeex adoptions by department at um
openai engineering obviously 99% but
also finance recruiting legal and so on
they are all like almost like 90% and as
so as so as you can see you know the
studies from common sac the monthly
token usage is also skyrocketing towards
like you know 60 quadr quad trillion
tokens in the next couple years.
So really you know the economics also
tell the same story. If you look at the
REM data the AI spending is getting
closer to people spend right. So if you
see like you know the top 1% of the
company's monthly AI spend is per
employee is actually already like 7 4K
um roughly half of the salary software
engineer. So this is really like you
know historical shift that um meaning
also the stack of like choosing the best
model the right model and optimizing
your agentic AI workflow is you know
more has never been more important.
So
the key question here is like um we give
agent lots of autonomy. We spend a lot.
We invest a lot. And the key question
here is like how do we actually measure
agents outcome? So that's really the
bottleneck, right? You want to
understand the value of these agentic uh
output and actions.
And this turned out to be a pretty hard
technical problem for a few reasons.
First agents are multi-component
systems, right? You got the model, the
agent take loop, um the tool, the
harness, um you know, any of these
pieces can break the system. You also uh
have agent operate through complex
workflow. Now in a real environment, you
build building app, debugging, doing
research, producing document, uh slide
deck and so on. So it's like more
involved task. Uh and third the uh
signals that we can collect you know in
this trajectory are also becoming sparse
a spread across longer horizon. Um you
know a task may take 100 to calls to to
finish right before you know if it's
succeeding or failing or you give any
feedback of a chance to steer it. uh and
to deeply understand the problem at
Arena we decided to actually firsthand
building real world you know agentic
product and app to actually source the
organic traces and feedback from the
actual users for us to you know do
research and deeply understand that. Uh
so last month we launched uh Asia mode
in arena uh to allow anyone to go to you
know arena to experience and evaluate
agentic capability. So it's right now
available for everyone to use and wanted
to show you a very quick demo if if I
can start the uh is the video moving.
Okay so this is agent arena you go to
agent you go to arena.aii I you you
choose the agent mode and this is a real
world you know agentic product you can
go and evaluate model you come in and
type any question you want in this case
um it's like I ask download Google's Q1
earning report uh and create a slide
deck summarizing the output in
PowerPoint and you can see the agent
goes off and and doing work searching
the web pulling the right website start
structuring the deck and then using some
of the batch tool writing Python code to
um generate the the slide deck right and
you can see that and at the end uh
there's like a artifact generated by the
model uh that user can download and see
and this is like a you know a real
powerpoint uh outputed by the model and
then user can at the end we ask every
turn like we ask was this task
successful or not and user can provide
feedback that way and this one of the
signals that we use to evaluate and
understand whether agent actually
delivers the outcome.
So yeah this is just to highlight the
panel
and under the hood how we build the
Asian arena it you know we give model
set of tools um file system tools
rewrite edit and so on and search web
fetching image uh generation speech as
well recently added so just really
giving the model tools similar to like a
cloud co-work like harness and also
terminal access to run code to to to to
you know do work and we also are adding
more and more uh connector soon like
GitHub uh which can connect to your repo
to you know do more serious software
engineering task um and you can see this
plot is the the usage of these tools uh
in a in a time in a oneweek time frame
you see 5.7 million to calls um you know
bash is was the you know the number one
used That's around 46% and the these
agents are actually using these tools to
do real real work for users.
So we also you know dig into the data
and seeing users are you know pushing
really hard to um trying to do more
harder and complex task. Um so real
session we've been seeing like you know
users are building you know a movie
watch list app debugging a control
systems for autonomous you know vehicle
and and architecting building a rack
pipeline you know implementing features
in micro and so on. So these are the
sessions like go over hundreds some of
them go hundreds of turns and couple
hundreds of tool calls very serious
stuff. Um and you can from this you can
tell that the u the agent that we built
uh at arena is actually doing real work
with users and giving user real value
and we believe the best evaluation
should be uh grounded and measured in
real world use cases like this.
So we launched agent arena uh just a
months ago and in the first months over
uh we collected over a million agentic
traces and these are you in task
spending coding research document
brainstorming planning and we see more
than the half of these uh uh traces fall
into work related category more like
towards professional use and complex
tasks. Um and we have seen Asian also
written um more than 50 million lines of
code uh on arena, Python, Markdown,
HTML, JavaScript and so on. This is the
tool distributions that you can see the
coding is the number one and some of
these um task you can see is some of
them are more complex using more tool uh
some of them use less and this is the
the line of code generation.
So now the going back to the evaluation
question, right? So say we collected a
million agentic traces. How do we
actually turn these traces into a
leaderboard that we can understand which
model performs better than the others?
And we primarily um mine the signals
from three type of uh basically signals.
One is like explicit which I just show
you that user will tell us directly like
which task succeeded or failed. Some of
them the other one is some implicit. Uh
we see that if user is actually uh say
downloading the file or like um
complaining about the output of the
generation from the model or praising it
and so on. So more like implicit signals
we we sense through all the traces and
also there's environment feedback where
you know what actually happened when the
code run whether the command succeeded
or failed and so on. So we basically use
these you know scans through all these
sessions traces every user message
assistant action tools resolve feedback
and aggregate them into you know some of
these signals like success rate praise
over compliance durability bash recovery
to hallucination and each of these
signal can produce the ranking right you
can measure precisely you know which
model performs better than other in this
particular signal and we combine that
into the final um leaderboard that you
see on you know on the website. Um so um
that's what you looks like um today. You
see like um this video has five
different signals and model performed
differently across board and right now
fable five is the number one models that
was you know the net improvement of like
14% over the average which is the you
know average of all the models followed
by call opus GPD fivei high and what's
interesting about this data boy is like
you can look at the signal by signal um
the model may be really really good at
test success but sometimes weaker in
terms of like you know stability in
terms how do you control the model and
you can see exactly like where the model
is failing and so on and we are going to
add you know more and more signal richer
signal to capture these failure pattern.
So methodologically the core idea is
basically a randomized control trial
where we intervene on agent component.
We measure the causal effect of you know
any given component on the task outcome
like the signal that we care uh and the
mandible basically is is like the causal
effect of of the orchestrator models um
that you can you know right now but this
framework is general enough so we can
also measure the interaction effect
between different uh components for
example let's say you want to measure uh
tool you want to measure different
harness harness or different system
prompt uh and so on. So all these are
possible within this framework and we're
going to you know uh evaluate that too
and if you are interested more technical
details are published uh on our blog
post.
Um so um we have been tracking like I
say all the major release in Asian is
one of the release happened couple of
weeks ago fable five in Asia arena um so
if you wanted to follow us on X you will
see all the you know latest release and
the interesting thing about this
leaderboard is because this is real data
right based on millions of agentic
traces you can slice it into any task
distribution you care about so for
example like let's say you care about
you know GDP tasks this more like
economically valuable professional work
versus consumer use cases you can uh you
can do some of the data analysis to
slice the data and one you know inside
here what you see is like GPD5i is
actually pretty good uh in terms of like
GPT sorry like GDP tasks uh and GM
Gemini tends to do better in consumer
use cases is so basically the the best
model generally depends on uh what
you're doing what you care the
distribution
um and on the other side is the cost
right you know cost matter too you can
we basically can plot these uh net
improvement which is performance against
the average cost to see to to help you
see the parto frontier here you can see
fable is the one that's the best uh cost
about $10 per session and 5ifi is still
very
bit cheaper and GP GLM 5.2 Gimme is like
the most efficient one. So you can with
this data decide which one is the best
model for your budget.
Another dance is tokens uh higher
performing model sometimes generate more
output token like using more thinking
model um and but uh not always you can
you can see here like GPD5 is relatively
more efficient than other models. And
the other interesting thing here is like
if you only look at the list price you
may see uh some of the model is like
same price but if you actually put it in
the real world some of the model would
use more tokens to to for the same task
right. So actually we can show here like
for example GBD5i although it has
similar price this price uh as OPUS but
in the in the real world it use less
token fewer tokens to achieve the same
task uh which is more efficient than the
others and as you can see um so to
summarize um if you are building an
agentic app um obviously you should
definitely be logging your agentic
traces to understand to log all the
interactions between agent and the user
and the customers and then be able to
you know look into the data mind for
insights and measure the outcome links
to whatever business metrics you care
and use that data to real world data to
choose the best model for you. Uh and
what we are headed next is you know
obviously going to add a lot of
different connectors to bring in more
user context and enable really the light
emails for many different kinds of
agents coding agents on real repository.
Um and we also wanted to bring more
complex task professional users slice
that into different categories to help
you understand uh how model is doing in
those category and so as more like
richer signal for um developers to use
to pick which model is the best as well
as rubrics to do more final grand um
scoring and even working collaborating
with the user to define what could look
like. Um so that's it uh for me. would
love to hear your feedback or if you
have any question feel free to uh reach
out. You can find more insights on our
leaderboard u arena.ai or follow us on
X. We also publish technical blog post
you know regularly and yes we are also
hiring so you know check out this link
or just DM me on X to reach out. Thank
you.
Please welcome back our MC, director of
technology at Oliver Wright Americas,
Deina Dias.
Hey everybody, thank you so much and
give yourselves a great round of
applause for being here till the end.
Yeah,
thank you guys. We really truly saved
the best for last. So, the startup
battle, I lie to y'all. It's not
tonight, it's tomorrow night along with
the closing speaker notes. So please be
there. We look forward to be there. So
thank you for the incredible sets of
talks for our afternoon keynotes and big
big thank you for the organizers. We
truly have incredible sponsors. The
event could not have happened without
them. We're incredibly excited to
partner with so many wonderful
organization.
presenting sponsor
Microsoft.
Okay. Okay.
Where where is it?
Okay. So, Lav and Platinum sponsor
and our gold sponsor
and of course our silver and bronze
sponsors.
Thank you all. Have a marvelous rest of
your evening and we'll see you tomorrow
morning.
It's really incredible what is going on
in the world today.
allows them to unlock more and more
levels of automation.
AI writes codes faster than humans can
review it.
Everything.
Yeah.
Ask follow-up questions or revisit key timestamps.
The video features a series of keynote presentations at the AI Engineer World's Fair, focusing on the evolution of AI agents, coding assistants, and the shift towards 'agentic' workflows. Speakers highlight the transition from simple generative chat models to sophisticated agentic systems capable of autonomous work, verification, and research. Key themes include the importance of reliability, the shift from model-centric to system-centric development, the use of 'loop engineering' for self-improving AI, and the necessity of human judgment and accountability in deploying these systems.
Videos recently processed by our community