HomeVideos

WF2026: Autoresearch & Keynotes ft. Anthropic, Google DeepMind, Amazon AGI, Sonar, Arena, Recursive

Now Playing

WF2026: Autoresearch & Keynotes ft. Anthropic, Google DeepMind, Amazon AGI, Sonar, Arena, Recursive

Transcript

12127 segments

4:12

Heat. Heat.

4:38

Heat.

4:42

Hey, heat. Hey, heat.

4:46

Heat. Heat. N.

5:57

Heat. Heat.

6:17

Heat.

6:20

Heat.

6:40

Heat. Heat.

7:09

Heat. Heat.

7:13

Heat. Heat.

7:35

Heat. Heat.

7:55

Heat. Heat.

8:16

Heat. Heat.

9:07

Heat.

9:19

Heat. Heat.

9:53

Launch control. We have a go. Roger.

10:14

I need

10:24

I need it.

10:29

I need

10:31

the right

10:41

baby.

10:43

Baby

11:14

Ladies and gentlemen, welcome to the AI

11:16

Engineer Worlds Fair. Thank you for

11:19

joining us as we continue an exciting

11:21

week of innovation, technical insights,

11:24

and conversations shaping the future of

11:26

AI. Now, please join me in welcoming

11:30

your MC, developer advocate at IBM, Tjas

11:34

Kuman.

11:47

Good morning, AI engineer.

11:52

We are here. We made it. We are here. It

11:55

is day two. It is such an honor and a

11:57

privilege to see so many of you here

11:59

today. This conference has broken

12:01

records, right? Last year uh was was way

12:03

fewer. This year, 7,000 people.

12:05

Incredible. Huge round of applause. This

12:07

is this is it. This is it.

12:12

This is where it happens. Listen,

12:13

there's announcements. There's

12:15

takeaways. There's content across 18

12:17

tracks. 18 track. There's expo sessions.

12:20

There's breakouts. There's all kinds of

12:22

things, right? And and undeniably, I'll

12:25

say this, there is value. Yes. If you've

12:29

got value, make some noise this morning.

12:32

Absolutely. Absolutely. I have learned

12:35

so much uh from so many brilliant people

12:37

here and and I have no question that you

12:40

have as well. Uh we had an incredible

12:42

keynote yesterday. We had so many

12:43

keynotes yesterday um where where Swick

12:45

started the conference talking about

12:46

loops. Um the the theme was was loops.

12:49

Why is that funny? Um

12:52

okay, it wasn't a joke but um but we had

12:56

more keynotes after that about the the

12:58

golden age of AI, right? Um one thing

13:00

that really stuck out to me and I'm sure

13:02

many of us was wiring the agent into the

13:05

intent upstream really unlocks more

13:07

work. When we start to say why things

13:09

are important, we're able to unlock more

13:11

work and quality work. We don't just

13:12

hand it the task, but we say do this and

13:14

and this is why and this is how you

13:16

verify, this is how you deploy. We we

13:18

get so much more done. Um Teresa talked

13:20

about reliability, how important it is.

13:22

She talked about the 30x productivity

13:24

gap between leaders and laggers. Uh

13:27

showing us that really it's about

13:28

reliability more than anything else. Um

13:31

huge focus about evals at this

13:34

conference. Um, and finally, I I was

13:36

really struck by by Daksh yesterday who

13:39

talked about uh reviewing 1 million AI

13:41

generated PRs and and found some

13:43

incredible insights. If you didn't catch

13:44

that, I highly recommend the videos, the

13:46

live stream. So cool. Uh, one thing that

13:48

stood out, Claude, uh, code generates,

13:50

what was it, three times uh, more off

13:53

bypass vulnerability code, unfortunately

13:54

for now, but it's just so cool all the

13:56

insights that come out of this. Um,

13:58

today we've got a lot of things. It is

14:01

jam-packed day and I'm very, very

14:02

excited about it. There's the newspaper

14:04

if you haven't yet read the news. We

14:06

have a newspaper now analog uh just to

14:08

balance you know the AI. Uh so there's a

14:10

there's a daily print newspaper

14:12

available for you. There's a live stream

14:14

audience. Hello live stream. Thank you

14:16

for joining. Um there is over 100 expo

14:20

partners. Anyone been to the expo? These

14:22

expo booths are incredible. I've seen so

14:23

many cool things. There's robots lying

14:25

around. So much stuff. There's also this

14:28

cool device that I got the B one of the

14:30

sponsors. uh it's a notetaker but for

14:32

in-person meetings. Anyway, check out

14:34

the expose. It's so incredible. Uh we've

14:36

got 3.5 days of expo and four uh stages

14:39

as well, expo stages. So, look forward

14:41

to that. We want to offer a huge thank

14:44

you and a massive round of applause for

14:47

the incredible sponsors. Honestly, this

14:48

conference would not happen without the

14:50

support of our sponsors. So, please

14:52

everybody, your hands together for the

14:53

sponsors of the conference.

14:57

We've got Microsoft, the presenting

14:59

sponsor. Keep it going. We got

15:01

Microsoft.

15:03

We've got the lab and platinum sponsors.

15:07

We've got You've got to keep it going.

15:09

We've got the gold sponsors.

15:14

We've got silver and bronze. We've got

15:17

so many sponsors. And this conference

15:19

genuinely would not be possible without.

15:21

So, we're very very thankful. Um, now we

15:24

get to introduce we get to open the

15:25

state. This is so cool. Today is going

15:27

to be such an incredible jam-packed

15:29

agenda and I hope all of you can make

15:32

all that you want. I mean, there are

15:33

quite a few tracks, but don't worry. Uh

15:35

there's a live stream, there's also

15:36

videos. We're going to start introducing

15:38

our first speaker. Oh, I'm excited about

15:39

this one. Who saw the announcement about

15:42

Fable yesterday?

15:44

>> Yeah, let's go. I This is so exciting.

15:47

So, so uh coincidentally,

15:51

the first talk has changed today. Uh

15:53

we're going to This conference moves at

15:55

the speed of AI. It's so cool. Um, our

15:57

first speaker, uh, Tariq comes to us

16:00

from Antropic. Give it up for Tariq.

16:02

Comes comes to us from Antropic.

16:06

Oh, I'm excited about I was talking to

16:07

him backstage and I said, "What what's

16:09

this going to be about?" Um, this talk,

16:12

I think the first time it's ever been

16:14

given, if I'm not mistaken, is about is

16:16

going to teach us all how to work with

16:19

the new mythos class of models uh, of

16:22

which Fable is going to be soonly

16:23

available. So, your biggest round of

16:26

applause for Tariq.

16:32

Please welcome to the stage member of

16:35

technical staff at Anthropic, Tariq

16:37

Shihipar.

16:53

Hey everyone, I'm Thoric. Uh, I work at

16:55

Enthropic on Cloud Code. Uh, before we

16:58

get started, we have a tradition on

16:59

cloud code where we take a selfie before

17:01

a talk. So, if you don't mind, if you

17:03

strike a pose with me, I'll, uh, take a

17:05

quick selfie at AI engineer.

17:10

Okay. Incredible. Well, uh, yeah, to

17:14

kick things off, like we said, Fable is

17:17

back. Um,

17:20

we're rolling it out later today. Uh,

17:24

keep stay tuned for exact timeline. Me

17:26

and Cat Woo and Simon Wilson will be

17:28

doing a fireside chat at 12:30. We might

17:31

have some updates for you then. Um,

17:35

but Fable is a model I'm just so so

17:38

excited about. It's one of those

17:40

anthropic models where you just like

17:41

you're just going to remember it. Like

17:43

Sonnet 3.5 new, Opus 4, Opus 4.5. It's a

17:47

model that I just have a lot of like

17:49

affection and excitement for. And the

17:52

best way to describe Fable to me is like

17:55

the the map is opening up, you know,

17:58

like you are playing like an RPG and

18:00

you've been on the tutorial and now you

18:03

get to the point where the like, you

18:04

know, the open world starts, right? And

18:07

there's so much that you can do and

18:09

explore. Uh but there's also it's also a

18:13

little bit intimidating and confusing,

18:14

right? Because there's so much you can

18:16

do. And so what I wanted to do in this

18:19

talk is give you guys a field guide to

18:22

fable, right? How do you work with this

18:25

new class of models?

18:28

So I've got four parts to it. I've been

18:31

working on this as a series of articles

18:33

and blog post. Uh but you know when we

18:36

announced Fable was coming out I was

18:38

like okay let me do uh all of this at

18:40

once at the talk uh you know uh

18:44

speedrun. So there are four parts

18:46

unhobling claude finding your unknowns

18:49

dealing with the grief and being

18:51

unreasonable.

18:53

So first unhobling claude

18:58

uh I think something we say really often

19:01

is that the models are grown not

19:04

designed right we don't wake up and be

19:06

like we need 99% on su bench right like

19:10

the models are you know something we we

19:12

grow carefully we give it data and

19:14

feedback and compute um but ultimately

19:17

it's you know something that we it's a

19:20

little bit organic and we sort figure

19:23

out and learn with the model as we use

19:25

it. And so um that what that also means

19:29

is that what contains them is us, right?

19:31

The harness we put them in and the way

19:33

we prompt them is basically like a

19:36

function of our understanding of Claude,

19:39

right? And by unhobling it, I mean how

19:42

can we understand Claude better to

19:45

unleash it? And we need to understand

19:48

Fable more. So I think one of my points

19:50

is that you know uh we're still so early

19:53

and I think there's a lot more

19:55

understanding in Fable uh to unlock

19:59

and uh I think I'll give you a quick

20:02

example about how models get smarter

20:04

because it's a little bit unintuitive

20:06

right like there I saw this viral tweet

20:08

a couple weeks ago being like you know

20:10

why can't LLM say which Pokemon end in

20:13

aw there are a thousand Pokemon right

20:16

and turns out there are two who whose

20:19

names end in AW crocodile and dreadnot,

20:21

right? And it turns out if you ask like

20:23

a normal chat model, it can't answer it,

20:25

which is kind of confusing because like

20:26

you know it definitely knows all the

20:28

names of the Pokemon, right? But if you

20:32

uh ask cloud code, it can, right?

20:34

Because what it does is that it fetches

20:36

every Pokemon and writes a script to

20:38

filter for AW, right? And so this is

20:43

what I mean by like unhobling claude. We

20:46

call this capability overhang, right?

20:49

Cloud gets smarter in spiky ways. So it

20:52

doesn't just remember every Pokemon and

20:54

reason through it, but if you give it

20:56

the code execution tool, it can find the

20:59

two Pokemons that end with AW, right?

21:01

And so this is I think part of the

21:03

challenge with Fable is figuring out

21:04

this capability overhang. What is now

21:07

possible? And I think this is like a

21:08

discovery that I'm excited to go on with

21:10

you. Uh to make this a little bit

21:12

clearer, I'm going to talk about a few

21:14

different examples of how models have

21:16

progressed in the past. Um one of the

21:19

big examples obviously is like chat. You

21:22

know the chat models were had to be

21:24

given context, right? Like maybe you

21:25

paste in your codebase and maybe naively

21:28

you might have thought like you know the

21:29

way we solve coding is by the context

21:31

just gets really large and I can just

21:33

paste in my entire codebase. You know

21:35

it'll be a 100 million context window.

21:37

But it turns out that instead if you

21:39

give it arms like you give it the bash

21:41

tool and ways to work with the

21:43

environment it can build and search its

21:45

own context and that's sort of like the

21:47

insight that led to cloud code right and

21:49

so again spiky like a new like

21:52

innovation kind of right in how we think

21:55

about and work with the model and then

21:57

recently we've rolled out cloud tag uh

22:00

and what's sort of unlocked cloud tag is

22:02

its ability to work proactively and

22:04

multiplayer uh cloud code, you know, is

22:07

something that you have to prompt for it

22:09

to do work, right? And uh this ability

22:12

for cloud to wake itself up and do work

22:15

is something that we think is unlocking

22:17

the new wave of agents. But there's

22:19

there's more here. So for example, uh we

22:22

recently removed 80% of the system

22:24

prompt for cloud code, right? And this

22:26

is one of the ways in which models, you

22:29

know, and what they need uh changes over

22:31

time. So originally like you know maybe

22:34

back in Son of 3.5 new the best

22:37

practices for a system prompt was a

22:39

small system prompt few tools and lots

22:41

of examples right and then as the models

22:44

get smarter you can give them more

22:45

information and more instructions and

22:46

they start following them and so it's a

22:48

larger system prompt with lots of

22:50

examples and many tools right but most

22:53

recently we found this new class of

22:56

models want fewer want a smaller system

23:00

prompt the examples tend to constrain it

23:02

because it's actually more imaginative

23:04

than the examples we give it. And so uh

23:07

and we tried to give it con context and

23:09

not just constraints. We really try and

23:11

avoid being like do not do this. Um

23:14

which is really necessary for the

23:15

previous models. Um and so this is like

23:19

a way that the system prompt is changing

23:20

and and probably will continue to

23:22

change. Uh another feature I really like

23:25

is the ask user question tool. This is

23:27

something I worked on when I first got

23:28

to cloud code and and it's uh when

23:31

claude, you know, a is is planning or

23:34

wants to ask you a question, it can show

23:36

you a multiple choice dialogue. Uh for

23:38

Opus 4, it could barely call it. I had

23:41

to like really tweak the tool to make

23:43

sure that it was uh that it would work,

23:47

right? And then sometime opus 4.5, I was

23:49

like, well, what if I asked it to like,

23:51

you know, ask me 40 questions about the

23:53

spec, it can start interviewing me,

23:55

right? And so its ability to ask

23:56

questions jumped, right? And then most

23:59

recently with Opus 4.8 and Fable, I can

24:02

now build a whole HTML report with the

24:05

questions embedded inside of them. And

24:08

uh it's just like a whole new way of

24:10

interacting with uh with Claude, right?

24:12

And and so this progression of like how

24:15

Claude can get information from you is

24:17

also changed. Um speaking of which, uh

24:20

markdown and HTML is something I've also

24:22

talked a lot about. um you know it

24:24

turned initially markdown was a a good

24:26

output for the model um you know it

24:29

could show a little bit of rich

24:30

information and then you know with plan

24:33

mode it started to be for you like you

24:35

could understand what cloud was about to

24:37

do um and now you know claude can build

24:39

you these in-depth HTML reports right

24:42

and so again a way of this the models

24:44

getting smarter in a spiky way I really

24:49

like to emphasize that this is closer to

24:51

a biology than a physics, right? It's

24:54

still very empirical, very organic. Um,

24:57

we don't know all the rules, but there

24:59

is some sort of science behind it,

25:00

right? Like there is an intuition to

25:02

build as well. And so I really, you

25:04

know, encourage you to treat Fable like

25:06

that. Uh, one of my favorite papers uh

25:09

that at Enthropic that we've written is

25:10

on the biology of a large language

25:12

model. Um, all of our research papers

25:14

are meant to be read by, you know,

25:16

people with various degrees of technical

25:18

expertise, but this is one of my

25:19

favorites. So, uh, if you're looking to

25:21

learn a little bit more, suggest you

25:23

check it out.

25:24

But, so, uh, yeah, we talked about

25:27

unhobling Claude, but it turns out when

25:30

you're working with Fable, you also need

25:31

to unhobble yourself, right? And so, one

25:35

of the things that I think a lot about

25:37

is that the map is not the territory,

25:40

right? When I'm working on a coding

25:41

problem, the plan and prompt and spec

25:44

that I have in my mind is the map,

25:46

right? But the territory is the actual

25:49

codebase, the real world, the

25:51

constraints that Claude needs to

25:53

navigate, right? And whenever Claude

25:55

runs into something in the territory

25:56

that's not in the map, I call that an

25:58

unknown, right? Claude has to figure out

26:01

what to do about it. It's a decision

26:03

point that I haven't specified. And

26:05

Fable is one of the first models where I

26:08

felt that like I really have to figure

26:10

out my unknowns because if not it's

26:13

going to traverse such a large area that

26:15

like it's going to run into a lot of

26:17

them. So how do you figure out your

26:19

unknowns? Um

26:22

it I fable bottleneck my abil by my

26:25

ability to match the map and the

26:27

territory to find my unknowns. So a few

26:32

um few ways to think about this. I like

26:34

to think of it in a a matrix. So like

26:37

for any problem, I have a bunch of known

26:39

knowns. This is usually like what I

26:41

write in my prompt. What do I want?

26:42

Right? Then I have known unknowns.

26:45

Things that like I know I haven't don't

26:46

really know yet, but I just haven't

26:48

figured out yet. I can um uh yeah, then

26:52

I've got unknown known. Like what's so

26:53

obvious that I just wouldn't write it

26:55

down, you know, but I I know it when I

26:57

see it, right? And then finally,

26:59

unknowns. Unknowns. What haven't I

27:01

considered at all? What do I not know?

27:03

right? Like what is something that if I

27:05

knew could change how I prompt Claude?

27:09

And and luckily you can use Claude, you

27:11

can use Fable to find your unknowns. So

27:14

I'm going to go over a few examples of

27:16

how I do that with Fable. Um the first

27:19

is I like to do what I call a blind spot

27:22

pass. So I like to say something like,

27:24

"Hey, I'm working on a new O provider

27:26

that I know nothing about uh like in

27:29

this codebase. Can you do a blind spot

27:30

pass to help me figure out my relevant

27:32

unknown unknowns and help me prompt

27:33

better? Right? And so this like might

27:36

have Claude go through the the O module

27:39

and figure out like, oh, you know, this

27:40

is kind of like a hairy little uh dead

27:42

end that comes up a lot. Maybe searches

27:44

my git diff or slack. I might tell it

27:46

where there's context, right? So that I

27:49

can learn about, you know, all the

27:51

gotchas. And and you can use this very

27:53

broadly, right? You can use it to teach

27:54

you about new fields. I I recently did

27:57

this for color grading when doing video

27:58

editing. Um because I think this is

28:00

really powerful and and Fable is

28:02

incredible at it. Um in many ways the

28:04

model knows more about you know almost

28:08

everything than I do. I just need to get

28:10

it out of it. Um then I like to use

28:14

brainstorms and prototypes. Uh this

28:16

helps me figure out my unknown known

28:19

right things like especially for design

28:21

for me it's like know it when you see

28:22

it, right? So, I might ask it to uh

28:25

create a dashboard. Um, and I tell it I

28:27

have no visual taste. Uh, make me an

28:30

HTML page with four wildly different

28:32

design decisions so I can react to them,

28:34

right? And and you know, you tweak this

28:35

as you want, but like the idea is to

28:37

sort of get an idea of like what are the

28:40

things that you uh you know, you can't

28:42

describe in words, right? And uh like

28:45

work with the model to help figure that

28:47

out.

28:48

Uh then f then interviews. So once I

28:51

have an idea of like, you know, this is

28:53

what I want to do. Uh there's probably

28:55

still a lot of like uh unknowns here,

28:59

right? Where I might not have considered

29:01

something. I might not have specified

29:03

it. And so I'll ask Clog to interview

29:05

me, right? And I'll give it a little bit

29:07

more context in any of these questions.

29:09

Like giving it a little bit more context

29:11

about you and the work and the stage

29:12

you're at, like, hey, yeah, prioritize

29:14

questions that would change the

29:15

architecture is extremely helpful.

29:19

Uh then references. One of the best ways

29:22

to give Claude a map is to give it

29:24

another map, right? So instead of me

29:27

writing out the spec, uh I can just say,

29:29

"Hey, here's some code that represents

29:32

what I want to be done, right? It could

29:34

be in a different uh system or language.

29:37

Uh but just read this code, understand

29:39

it, and then use that to start your

29:42

work, right? And uh again, this can be

29:44

in a lot of different ways. If I'm

29:46

making a a React component, I might have

29:48

an HTML mockup that is my map, right,

29:50

that I pass in as a reference. I think

29:51

this is really really powerful and Fable

29:53

is really incredible at it. Uh something

29:57

else I've like really appreciated is

29:58

implementation notes. So if uh while

30:01

you're running Fable uh and it runs into

30:05

an unknown, ask it to log it, right? So

30:07

that um you uh you can see where the

30:11

deviations happened and then you can

30:13

sort of figure out why as well. you

30:14

know, we'll usually give you some

30:15

context about what happened.

30:18

And then finally, I like to get a fable

30:20

to quiz me about what happened. Uh, just

30:23

to make sure I understand what I'm doing

30:24

and I can represent this work, you know,

30:26

when I'm creating a PR or merging it.

30:29

Um, this is a really great way of like

30:31

making sure that you're like really in

30:33

the loop with Fable. And I think that's

30:35

like one of the most important parts of

30:37

Fable is like staying in the loop and

30:39

making sure that you uh you get what you

30:43

want. So

30:45

um those are some of my tips for working

30:47

with Fable. Uh I also want to say that

30:50

the first time I used a mythosclass

30:53

model uh used Fable I felt both a huge

30:57

sense of like gain but also a sense of

31:00

loss and I I wanted to talk a little bit

31:02

about that you know um when I think

31:06

about coding before LLMs it feels like a

31:09

foreign country you know like I used to

31:11

run a YC startup about 30 people and we

31:14

were just constantly forced into

31:16

trade-offs because of how hard code

31:18

right? Like we could make the the app

31:19

fast or we could try prototyping a new

31:22

feature and and this might take a month

31:23

or this would take two months and so we

31:25

had to choose and it was just really

31:26

really hard. Um and now I went back to

31:29

that codebase a couple weeks ago and I

31:32

thought about some of the things that I

31:34

wanted to do and uh it was just way

31:37

easier. It was like the things that

31:39

would have taken me weeks I could do in

31:42

hours, you know? And uh at some point

31:44

it's like yeah like how can you not

31:46

laugh but also how can you not cry

31:48

honestly like it's like one of these

31:50

things where um I really really loved

31:52

programming and writing code by hand. I

31:55

love the feeling of like seeing the

31:57

codebase in my mind and like rotating it

32:00

but I also remember just you know like

32:02

staying up late nights trying to debug

32:05

working on things for weeks without

32:07

working right. I just remember swimming

32:10

in failure. I just remember that like

32:12

the most of the projects I've ever

32:13

worked on have failed. Most startups go

32:15

bankrupt. You know, I think just overall

32:18

programming and coding is extremely hard

32:21

and like as much as I enjoy those highs,

32:24

I I can cannot go back, right? And uh

32:28

the way my reflection here is like the

32:30

only way out is through, right? There's

32:32

still a lot to learn with the coding.

32:34

There's a lot to learn with Fable. Uh

32:36

but I think if we try really hard and if

32:38

we like stay in the loop, we unhobble

32:40

it, uh we can get there, you know, and

32:43

we can come out on the other side, uh

32:45

with just um so much more. And so the

32:49

last bit I wanted to talk about is is

32:52

the so much more part, right? I call

32:54

this being unreasonable.

32:57

Um one of my favorite parts of anthropic

33:01

is that we believe that trade-offs are

33:03

not real. Um, like I think that very

33:07

often I like in my previous company I

33:09

was very used to being reasonable. So

33:11

I'd like write down this list of

33:12

priorities and I'd be like, "Well, I

33:14

guess we can prioritize this against

33:15

this, right?" Um, and uh, like, you

33:18

know, that makes sense. So we'll we this

33:20

will be our priority this quarter, but

33:22

what if you uh just did all of it? You

33:24

know, what if you forced reality to show

33:27

you the trade-off, right? Um, this is

33:30

something I've really valued at our

33:31

culture and anthropic. And my reflection

33:33

going forward is that I'm going to be a

33:35

lot less reasonable. Um, I think one of

33:38

this like the math of Claude and Fable

33:42

really changes how you think about

33:43

trade-offs. And there are so many

33:44

trade-offs that you make implicitly in

33:46

your head, right? Like good, fast,

33:48

cheap. Now it's pick three, right? Um, I

33:52

think that like the best way to like do

33:55

more ambitious work is to uh like

33:58

reframe and make big make ourselves more

34:00

ambitious because I think the only way

34:02

to prove that agents work is to do the

34:05

best work of our lives faster than ever

34:08

before. Um, you know, for example, I

34:11

made this deck last night in about four

34:13

hours with Fable. I feel like it's a

34:16

it's a deck I really like and I I really

34:18

enjoyed it, but I also um you know did

34:20

it really fast. Uh and I think that if

34:23

you're here, you know, at AI engineer,

34:25

the world is kind of looking at you to

34:27

prove that AI works, right? That it's

34:29

not just like a fad or something, but

34:31

that it can make us more productive and

34:33

also save us time. And and that's my

34:35

resolution for this year is to work, be

34:38

more productive, but work less and spend

34:39

more time with people I really care

34:41

about.

34:43

Uh, I think it's also worth calling out

34:46

that building is easier, but generating

34:49

value is still hard. And I think this is

34:51

something that we run into, you know, as

34:53

AI engineers sometimes where we think so

34:55

much about the process of building and

34:57

our our setups. Um, but the the point is

35:01

to generate value, right? And uh there

35:03

it takes a lot of swings. It takes a lot

35:06

of tries to find the valuable stuff. Uh,

35:10

but that really is the goal. And that's

35:12

like you know again what the world is

35:14

looking to us to prove that AI can

35:16

really transform it. So to to end I just

35:21

wanted to say like go explore make it

35:23

real and uh yeah be less reasonable.

35:27

Thank you.

35:42

Please join me in welcoming the chief

35:44

executive officer at Sonar, Tariq

35:47

Shakat.

36:04

Morning everyone.

36:06

Do you enjoy that last talk? That was

36:08

amazing. Um, you particularly love the

36:10

end, the being unreasonable part. I

36:12

thought that was awesome. Um, I also

36:15

want to just I'm trying to calculate the

36:17

odds of Tar following Tar as the first

36:19

two sessions in the morning. Uh, I think

36:22

the odds are pretty low on this one, but

36:24

uh, thrilled to be here today. Um, as as

36:27

we just mentioned, I am with Sonar. We

36:30

are in the code verification space and

36:32

I'm here today to talk about

36:33

verification. And I think we're all here

36:36

uh in large part because we believe to

36:39

some extent that AGI is here. It's

36:42

coming. The models we just heard about

36:44

Fable, it's really incredible what is

36:46

going on in the in the world today. And

36:49

yet we work almost exclusively with

36:51

enterprises around the world. And the

36:53

conversation that we have more is the

36:56

question mark version. Is AGI here? And

36:59

why are they asking these questions?

37:01

It's because you can read the news every

37:04

day. And I'm not trying to name and

37:06

shame here, but if you look at KPMG

37:09

putting out reports that they have to uh

37:12

uh retract because of hallucinations, uh

37:16

EY doing the same thing, law firms

37:19

getting into lots and lots of trouble

37:21

because of madeup citations, madeup case

37:25

law, things like this. I think we can

37:27

really start to question how do we get

37:29

value out of AI? The models are amazing

37:32

as we just heard, but the hard part as

37:35

the other target just said is getting

37:37

value out of it.

37:39

The struggle is that AI slop is

37:43

everywhere. I'm sure you all see this

37:45

inside of your organizations. I'm sure

37:46

you see this in your everyday life that

37:50

AI is amazing. The models are incredible

37:52

at generating very plausible output.

37:55

They're incredible at generating things

37:57

that sound correct, but are they

37:59

correct? And how do you know that

38:01

they're correct is a big problem. And

38:03

it's a big problem in professional

38:05

services as we saw. It's a big problem

38:07

in legal. But really, I think if we're

38:09

honest, it's it's a big problem in every

38:11

sector, in every field, whether it's

38:14

marketing or finance or you name it. You

38:17

have this question of how do you

38:18

actually know if it's true? How do you

38:20

know if it's good or if it is slob? And

38:23

the question that we we deal in the

38:26

coding space in particular, we deal with

38:28

software development. And the question

38:30

we get as we talk to I'm sure many of

38:32

the people here in the room and a lot of

38:34

our customers is, isn't software

38:37

development different?

38:39

And we can look at the data on this and

38:43

uh the mythos models. Um this is data

38:46

from um meter. Uh you may have seen this

38:49

MER um the coding agents are getting

38:51

better uh very quickly. They're getting

38:54

a lot better very quickly. And you can

38:56

see uh the progression the exponential

38:58

curve here. What this shows on this

39:00

chart is how capable are the models at

39:03

completing tasks that humans would take.

39:06

So can they complete a task that takes 1

39:08

hour, 2 hours, whatever it is. is the

39:10

latest Mythos model at least per the

39:12

benchmarking which was done a month or

39:14

so ago in the preview mode was you're

39:17

getting to 16 to 18 hours. So they're

39:20

actually able the agents are able to

39:22

complete longunning tasks and it really

39:25

is starting to transform how work is

39:28

happening. But the critical caveat when

39:31

you read the data is this is at a 50%

39:33

success rate. Okay. So it is again able

39:37

to complete tasks but is it able to

39:39

complete tasks correctly is the

39:42

question. So if you start looking at

39:44

let's dial up the accuracy rate you dial

39:48

it up to 80%. And there's still progress

39:50

but it is much slower progress. Instead

39:53

of 18 hours you're at about 3 and a half

39:55

hours or something along these lines.

39:58

And by the way this is still at 80%

40:01

accuracy. And as I was presenting this

40:02

to the CTO of one of my uh large

40:04

customers, his response was, "Betaric, I

40:07

would still put someone who gave me 80%

40:10

accurate information on a performance

40:12

review probably, right? This isn't

40:14

necessarily enterprisegrade.

40:16

The problem is that the models

40:20

themselves, and full disclosure, we have

40:23

not yet uh done this benchmarking on the

40:25

Fable models obviously because they are

40:27

just being released. But as you look at

40:30

the models, the models are getting

40:31

smarter, but they still produce a lot of

40:34

problem problematic code. This is

40:37

benchmarking that we do. We give the

40:40

models a series of over 4,000 problems

40:43

and we basically ask it to generate the

40:46

response to the problems and then we

40:47

analyze both the functional correctness

40:50

which is critical and they all do

40:51

extremely well on this notion of

40:54

functional correctness, right? Um, but

40:56

then we look at how complex is the code,

40:58

how buggy is the code, how secure is the

41:01

code. And what you see with even the

41:04

state-of-the-art models is that

41:06

complexity is still high. It's actually

41:08

quite variable as you can see here. Um,

41:10

GPT55 is done particularly well on the

41:13

complexity side of things. It still

41:15

generates bugs. It doesn't generate

41:17

massive amounts of bugs, but it still

41:19

generates bugs and it still generates

41:22

security issues. So this is the output

41:25

of the models that are going into the

41:27

agentic workflows. And again, this is

41:30

not, you know, I'm at the AI engineer

41:32

conference. This is not me saying AI is

41:34

fake or or um incorrect, but it is um

41:39

trying to address this question of how

41:42

do you really get value in a production

41:45

setting out of AI? This is a study that

41:49

was done in Carnegie Melon uh University

41:52

and it looked at what is the actual

41:55

productivity benefit that you see from

41:57

the use of AI coding agents. And what

42:00

you see I think really resonates with a

42:02

lot of what I see firsthand in the

42:05

market which is you have a initial just

42:08

amazing boost of productivity of

42:12

velocity in particular. what you see is

42:14

a three to 5x boost in productivity or

42:16

in in velocity. Um that dissipates in

42:21

three months. At the end of three

42:22

months, it starts to come back to the

42:25

the normal before you were using the

42:27

agents. And if you ask why, it is

42:30

because of the two pieces in red here

42:31

that you start to see there's an

42:33

increase in velocity, but there's an

42:36

increase in security issues. there's an

42:39

increase in maintainability issues.

42:40

There's an increase in reliability

42:43

issues and there's an increase in

42:44

complexity. So essentially you're

42:46

building the technical debt as quickly

42:49

as you are generating the code or maybe

42:52

even more quickly and that creates a

42:54

different set of work. it creates a

42:55

different bottleneck. And so to us, this

42:59

is now the critical question in AI,

43:04

which is in a world in which code is

43:06

provable. And there's sessions that um

43:09

uh I'm actually very much looking

43:10

forward to attending about formal

43:12

methods and proofs and things like this.

43:14

Code is provable, but when you start

43:16

dealing with large code bases, software

43:18

is not. It's still very complex. It is

43:21

still very messy. there's lots of um

43:24

dependencies. There's lots of uh

43:26

technical debt already in most code

43:28

bases. And so this question of

43:31

verification is actually key. And what

43:33

I'm going to be arguing is that you can

43:36

treat verification as an afterthought or

43:39

you can bake verification into the

43:42

process. And if you bake it into the

43:43

process of generating code, of doing

43:46

software development, you can actually

43:48

start to get materially better outcomes

43:50

from the coding agents than if you view

43:52

it as an afterthought, if you view it as

43:54

just the old school code review.

43:58

So as we've been thinking through this,

44:00

we basically have constructed a

44:02

framework and there's lots of competing

44:04

frameworks around this, but I'll just

44:05

talk you through uh ours. We call it the

44:07

agent centric development cycle. for

44:10

shortand we call it AC/DC sometimes and

44:12

the idea here is how do you get

44:14

verification powered to Gentic loops at

44:17

the center's a lot of focus on the code

44:21

generation piece like how do you

44:22

actually get the models and the agents

44:24

to generate the code that you need to

44:27

solve the problem and what we argue is

44:29

that you should surround this with the

44:33

right disciplines the right tools the

44:35

right processes to do three things to

44:37

guide the agents and tar is talking a

44:40

lot about different aspects of this

44:41

actually. Guide the agents, verify the

44:44

outcomes and then solve the problems.

44:46

And you have to make this part of the

44:49

discipline, part of the process, part of

44:51

the new software development life cycle

44:53

if you want to be successful in the AI

44:56

world. So if I double click on some of

44:59

these pieces, what do we mean by guide?

45:01

We've done a lot of experimenting around

45:03

guide. We've just launched a product um

45:06

yesterday I think called sonar vortex

45:08

that starts to get into this area. What

45:10

we find is critically important is to

45:12

think about guide as context and

45:15

constraints and we separate out context

45:18

and constraints very deliberately

45:20

because context is you have your code

45:22

repositories. How do we make it easier

45:24

for the agents to understand for the

45:26

models to understand what is in your

45:29

codebase? If you have a million lines of

45:32

code, if you have a hundred million

45:34

lines of code, you have a billion lines

45:35

of code, the agents work better if they

45:37

understand your codebase. So, how do you

45:38

give it architectural awareness? How do

45:40

you provide uh semantic navigation uh

45:43

maps um and uh and help them understand

45:46

the territory to borrow what Tar was

45:48

just talking about and we find it

45:50

equally valuable and I don't think this

45:52

part is talked enough about to provide

45:55

the constraints as well. You have

45:58

guidelines that you want your code to

46:01

follow. You have dependencies you are

46:03

okay using. You have dependencies that

46:06

you are not okay having. You have coding

46:08

standards. You have guardrails. You have

46:10

intended architecture. We spend a lot of

46:12

time talking about existing

46:14

architecture. But what about where you

46:16

want to go? And so this idea of context

46:18

and constraints uh we've found in our

46:21

testing generates a massive improvement

46:25

in agent effectiveness and a massive uh

46:28

improvement in token consumption. O over

46:31

30% reduction in tokens being used to

46:34

solve a given problem. And and if you

46:36

ask why, it's because you're actually

46:38

making the life of the agent easier.

46:40

You're helping it navigate better.

46:43

So then we get into the heart of this

46:45

and we really think of guide as

46:46

preemptive verification. How do you make

46:48

sure there's less to verify, less to

46:50

fix, this sort of thing. Then you get to

46:52

the heart of verification and what we

46:55

believe quite strongly and what we've

46:57

seen work in practice is this idea of

47:00

zero trust multi-layered verification.

47:03

Zero trust every model has biases. Every

47:06

model produces has a character has a

47:08

personality. So, let's make sure we use

47:10

different models and different

47:11

techniques to make sure your code is

47:13

safe, to make sure it's reliable, to

47:15

make sure it's secure. And multi-layered

47:18

really speaks to the earlier point that

47:21

software is complex. Software is very

47:24

messy. Software has lots of of of

47:27

intricacies involved with it. And so

47:29

what we believe and again have found to

47:32

be quite um impactful here is that a

47:36

combination of algorithmic verification

47:38

looking at things like data flows,

47:40

control flows, known patterns, secrets,

47:43

these areas combined with what is now

47:46

possible with agentic verification

47:48

looking at intent, business logic, the

47:50

unknown unknowns. Actually again to

47:53

borrow from the last uh presentation the

47:56

fusion of these things the the

47:58

deliberate multi-layered fabric that you

48:01

put in place can actually you can see

48:04

the results of this in production. So as

48:07

we look at our partners and customers

48:09

who use a multi-layered verification

48:12

approach they are reporting AI derived

48:16

production outages being 44% less

48:19

frequent than the ones who do not. So

48:21

you can start seeing a material

48:22

improvement in reliability, in security

48:25

and in maintainability.

48:28

And then the last point I mentioned is

48:31

technical debt does explode. Right? As

48:33

you generate code, technical debt is

48:36

also generated. And again, this is not

48:39

stop doing it. This is be aware and

48:42

let's start controlling it. And so what

48:45

we um have seen be super effective is to

48:49

have an active process to have an active

48:52

discipline again around code maintenance

48:55

and thinking about how you do verified

48:58

code maintenance. Um I won't walk

49:00

through every step of this but a the

49:03

agents whether that is a set of

49:06

remediation agents whether it's a strong

49:08

discipline around verification does keep

49:10

your codebase clean

49:12

and a lot of people have asked me all

49:15

right but do agents care about clean

49:17

code human developers care about clean

49:19

code do agents care about clean code and

49:21

what we find again is they absolutely do

49:24

because the agents have to understand

49:26

the codebase if they're going to operate

49:28

on it so this is is a oneshot view. Um

49:31

we think this is something that

49:32

compounds. But if you just do the exact

49:35

same agentic tasks on a typical codebase

49:39

and then one that has been cleaned, you

49:41

see a material reduction in the amount

49:43

of tokens, reasoning, energy, etc.

49:45

needed for those cleaner uh code bases

49:49

versus the typical code bases. Right? If

49:51

you make the life of the of the agent

49:53

easier, if you maintain your codebase,

49:56

then you'll actually see compounding

49:58

effects.

49:59

Now the important thing in our mind is

50:02

to construct the system. And this is how

50:04

I started is saying, you know, I'm sure

50:06

all of us do code reviews, you may use

50:08

static analysis tools, you may use AI

50:10

code uh review tools, a whole range of

50:13

things. And we believe that you have to

50:15

put this in a system. And again, uh,

50:17

we're happy to in our booth downstairs

50:20

talk through what this looks like, but

50:22

we really believe that the construction

50:25

of the software development life cycle

50:27

in an AI world um, needs to embed this

50:30

notion of guide, verify, and solve

50:32

inside of it. And you need to do it in

50:34

three loops. And you need to think about

50:36

these three loops. There's the agentic

50:38

loop, which I think is the key buzzword

50:40

of the conference. Um now but how do you

50:42

provide the agents as it's generating

50:45

the code as it's doing the work with the

50:47

context and constraints

50:50

with the inloop verification so that the

50:52

agent is getting verification as it's

50:54

working and how do you fix problems

50:56

that's that's the blue loop here what we

50:58

what we talk about is the inner loop

51:00

verification piece there's a second

51:03

which is your continuous improvement

51:04

process and how do you really combine

51:07

the power of algorithmic and agentic to

51:11

generate your your pull request, review

51:15

the code and by the way the velocity of

51:17

this has to go up massively. So to

51:19

review the code using agents and to this

51:22

multi-layered verification and then you

51:25

have your evals and I think the opening

51:27

speaker talked about how eval may be the

51:29

buzzword of the um conference. You have

51:32

your evals and you have your quality

51:33

gates to check are you actually passing.

51:36

So you have your your code maintenance

51:38

loop, agentic loop, CI verification loop

51:41

and deliberate design of these loops

51:45

with verification at the center is a

51:48

compounding system. It's a system that

51:51

reinforces itself and it reinforces

51:53

itself in the positive and it reinforces

51:55

itself in the negative. And we've seen

51:58

customers who uh have kind of neglected

52:02

as they've rolled out AI coding tools,

52:04

they've neglected verification. they've

52:06

neglected this idea of code quality, of

52:09

code um maintenance, things like that,

52:12

and you get into a downward spiral

52:14

pretty quickly. This is what the

52:15

Carnegie Melon uh case study uh or study

52:18

actually shows is that you actually have

52:21

all the benefits start to dissipate or

52:23

you can get into this self-reinforcing

52:25

loop. And one of the tests we did with

52:27

one of the large banks who are using

52:29

some of the cutting edge the folks who

52:32

are all around here today um cutting

52:34

edge agentic coding tools they can get a

52:37

92% reduction in issues if you actually

52:41

take this guide verify solve approach

52:43

inside of those agentic loops. If again

52:46

this compounds it's not that each loop

52:49

is 92% better. it's that as you go

52:51

through solving the problem over minutes

52:54

and hours that you actually see a

52:56

compounding benefits.

52:59

So that is uh essentially how we see the

53:03

benefit here. The how we see the

53:05

controlled um value creating use of AI

53:10

in enterprise settings. And when I say

53:12

enterprises, people with existing code

53:14

bases, people with with you know

53:17

millions of lines of code already.

53:19

There's the agentic loop, there's a CI

53:21

verification loop, there's the code

53:24

maintenance loop. I am required by my

53:26

marketing team to put up a version of

53:28

this that has our products on here. So

53:29

these are our products and you can come

53:31

and see us later. But the most important

53:33

thing is really to say our

53:36

recommendation is this agent the ACDC

53:38

agentcentric development cycle. The core

53:41

part is deliberate verification built

53:43

into the system. So if you'd like to

53:45

learn more um we have a booth that's the

53:48

big red booth downstairs. We'd love to

53:49

talk more. We have some doubleclick

53:52

sessions coming up. So please do uh join

53:54

those and uh have a great conference.

53:56

Thank you all.

54:10

Joining us on stage is a member of

54:12

technical staff at Amazon AGI lab onjab.

54:34

Good morning. It's so great to be back

54:37

here at the AI Engineer Worlds Fair.

54:42

Just a year ago, the hard problem was

54:46

getting an agent to find a button and

54:49

click it on a screen, especially screens

54:52

it had never seen before.

54:55

Now agents can drive browsers and

54:59

they're starting to also drive desktop

55:02

apps.

55:04

But what we figured out click clicking

55:08

was actually the easy part.

55:14

What we didn't solve is the actual work.

55:18

And what do I mean with this? Let's take

55:21

a very simple example. A new team member

55:24

starts on Monday. And maybe your job is

55:27

to set up their accounts, add them to

55:30

your Slack channel,

55:32

book intros with colleagues,

55:35

order the laptops, etc.

55:38

And nobody really owns this end to end

55:42

process in the company and it might be

55:45

also touching five different systems.

55:49

Now, agents can most likely perform each

55:53

single individual individual step of

55:55

this workflow,

55:57

but agents still struggle to do this end

56:01

to end because the real work lives

56:04

within the seams of all of those

56:06

different applications, of all of those

56:08

different steps you have to take. And

56:11

this is mostly where it all falls apart.

56:15

The agent can use every single tool you

56:18

give it, but it still can't do the full

56:21

work.

56:25

So why do we see this gap?

56:28

Think about for a minute what we

56:30

actually built.

56:32

We taught computers to use computers.

56:36

So what do I mean with this? We started

56:39

building out the basics. We taught them

56:42

clicking, scrolling, typing, calling an

56:46

API, filling out a form, and we got

56:49

those steps, these steps really

56:51

reliable, and you can string them

56:53

together in a workflow. And agents these

56:56

days are fairly good at like operating

56:59

those workflows.

57:02

So, why can't you not just hand them

57:04

more of your work and then literally

57:06

just walk away and trust it to be

57:09

completed?

57:13

So all the things I talked about like

57:15

using a tool models itself,

57:19

tool use, stringing agents together,

57:22

this is all capabilities

57:25

and we mostly figured out how to add

57:28

capabilities to models.

57:31

Now the next hard part is really

57:34

reliability

57:36

and without reliability we cannot really

57:39

build up trust in those systems.

57:43

So here's a quick gut check and maybe

57:45

all of you can just think about an agent

57:48

doing work in an end to end workflow.

57:52

How often do you think that actually

57:55

succeeds these days? Maybe 60 maybe 80%

57:59

of the time.

58:02

And it sounds really fine, but if you

58:05

look into this,

58:08

if your agent

58:10

one in four times deletes a database,

58:13

you will never touch that agent again,

58:16

right?

58:17

So

58:19

when you need this reliability, you

58:22

really need to be it in the nines. You

58:25

need to have the trust that it actually

58:27

can do the work successfully.

58:34

Now, there's actually one place

58:37

where we made enormous progress on

58:40

reliability and trust and this is

58:43

coding, right? Think about how fast

58:47

coding evolved.

58:50

I still remember the first time when it

58:53

started autocompleting for you, right?

58:55

You just tapped autocomplete. Amazing.

58:58

Then short time later, it started to

59:01

write functions. And we thought that is

59:03

amazing.

59:05

And now look at these days. Coding

59:07

agents write the code. They open up the

59:11

pull requests themselves. And we had it

59:13

earlier this week. Code keeps flying by.

59:17

So once in a time we were able to just

59:20

every single line that it generated we

59:22

felt like the urge we need to really

59:24

read it and make sure it's correct right

59:27

I think most in the audience here can

59:29

still relate to that

59:31

these days I think hardly anyone is

59:35

still doing that like we cannot even do

59:37

that right code is generated at such a

59:39

pace

59:41

at the same time coding made that jump

59:44

so why is that because we were able to

59:48

bring it from just being capable the

59:51

coding agents to actually be reliable

59:53

and then trusted.

59:57

So why is that? Why was coding first

60:01

solved?

60:03

It's because code is verifiable.

60:06

You can run it, you can test it, you can

60:09

check it and you can be for sure that it

60:12

worked.

60:14

So reliability showed up in the first

60:18

place you can actually verify the

60:21

answer.

60:26

But here's the catch.

60:28

Most of the work we do if you look at

60:30

the broader knowledge work areas is not

60:33

like that.

60:35

Knowledge work is messy and heck the

60:39

whole real world is really messy.

60:43

Did the report I created land? Is the

60:46

design on brand?

60:48

Did it get it what I actually meant? So

60:52

there is no unit test that can answer

60:54

those questions.

60:57

So verification really hits the wall

61:00

right where most of our work lives.

61:03

It's living in the seams of all of those

61:06

applications we're using on a day-by-day

61:08

basis.

61:09

And nobody really has cracked this part

61:11

yet.

61:13

How do you make an agent reliable when

61:16

there's no way to verify the answer that

61:20

easily? And that's a field that is still

61:22

wide open.

61:27

So, how can we solve this?

61:30

Well, so how do humans handle messy

61:33

work? I mean, we're successful at it,

61:35

right? Each of us like every day we work

61:38

across different systems. We manage out

61:40

how to onboard a new colleague. We do

61:43

this.

61:46

Well, we're doing it by figuring things

61:49

out together. You grab a colleague, you

61:52

jump on a Zoom meeting, you're

61:54

discussing things, you're looking at the

61:56

problem to solve, you're discussing p

61:59

pointing at systems, and maybe two

62:02

minutes later, you solved it. You're

62:04

done. But none of this work is actually

62:07

directly verifiable.

62:10

And we do this all day.

62:14

So one of the things is we're looking

62:16

mostly at the same screen, right? If

62:18

you're jumping on a meeting with a

62:19

colleague, you see the same screen, both

62:22

of you, and you can actually like figure

62:25

out really quickly what needs to be

62:27

done.

62:29

So this is what the agent these days is

62:31

missing.

62:33

You don't necessarily need a bigger

62:35

brain. What you need is this shared

62:38

context. Because if we're looking the

62:41

agent and myself at the same screen, I

62:44

probably have much less explanating to

62:47

do.

62:51

So what kind of agent do we really need

62:54

to build to achieve this?

62:57

And today's agent, as I said, they can

63:00

already see a screen, right? and they

63:02

can click and take actions in it. That

63:04

part works.

63:06

But if they fire off actions, what they

63:09

usually do, they move on. They don't

63:12

watch what happens or recover if one

63:16

step didn't succeed or something goes

63:18

sideways.

63:20

And we need an agent that can actually

63:23

work like you do, like humans work.

63:27

And one example is robotics. If you just

63:30

look for a moment as how robotics do it,

63:33

a robot perceives what's around it and

63:37

it plans what to do and then acts. So

63:41

this loop here from perceiving to

63:44

planning to acting, this is actually

63:47

what we also would need on a screen.

63:51

And it starts here really with the first

63:53

word which is perceive.

63:56

The agent has to take in the screen the

63:59

way you do,

64:02

not scrape the code behind the page, but

64:05

what's actually rendered, the layout,

64:08

the state, what just changed the work,

64:12

what we're doing, and then do it.

64:18

And it would also have to keep up in

64:21

real time. Think about how we as humans

64:24

work together.

64:26

You jump in, you react to build on top

64:29

of what each other you say.

64:33

And today agents can still don't do it.

64:37

What we're doing is we're sending a

64:38

prompt, we're waiting, it goes away and

64:41

at one point the agent come back and we

64:43

might have to take a couple of turns,

64:45

right? Because what the agent come back

64:46

with is not exactly what we might want

64:49

to do. So we're sending another prompt

64:51

say, "Hey, go back, do this, do this

64:52

differently." And we have this long back

64:55

and forth which we got so used to from

64:58

our chatbot experience and from this

65:01

rhythm taking turns.

65:03

But what we actually would need, think

65:06

about it, is an agent that can react

65:09

while you're still working. Wouldn't

65:11

that be really cool, right? Like at the

65:14

same time you're working, it can also

65:17

come up with suggestions, can help you,

65:20

and there is no waiting time.

65:23

So basically an agent that perceives

65:26

what you perceive and understands what

65:30

you mean.

65:34

We call them perception agents.

65:41

So why perception agents? Why do they

65:43

matter? So first they complete the loop

65:48

on computer use.

65:50

Today's agents again they can act, they

65:52

can click, they can type, they can

65:55

scroll,

65:56

but what they can't do well is looking

66:00

at the results and whether it actually

66:02

worked out. A perception agent can read

66:07

the rendered screen so it can confirm

66:10

its own output instead of just firing

66:12

off those actions

66:14

and then hoping.

66:17

Second, it doesn't need an API or

66:20

backend process.

66:22

And that's important because it works

66:24

off the rendered interface. It sees the

66:27

same pixels and the structure you see.

66:31

And most of today's software people use

66:33

every day don't expose APIs at all.

66:39

And then third, the input also goes the

66:42

other way here. Instead of writing a

66:45

long paragraph to describe what you want

66:47

to change, let's say you're working on a

66:50

website and you want to describe all the

66:53

changes you want to apply. Instead of

66:55

writing this really long description,

66:58

wouldn't it be great if you can just

67:00

point to it and say, "Hey, here this

67:02

heading needs to change. Hey, can you

67:04

update this section?"

67:07

This is a much more precise signal and

67:11

less lossy than text.

67:13

and the agent can act exactly on what

67:17

you marked.

67:21

So this is where we started and I'm

67:25

happy to share that we just recently

67:27

launched the first two pieces of our

67:31

perception agent harness

67:34

open source.

67:36

There's two pieces. There is annotation

67:39

which you can use to tell it what you

67:41

want.

67:43

And then the second piece, the

67:45

verification part gives the agent the

67:48

capability to check its own work.

67:54

So let me show you the first one. So

67:57

here's a very quick demo on our

68:01

annotation tool.

68:03

This one is a Chrome extension, so it's

68:05

super easy to use. And I'm going to play

68:08

here this quick video demo.

68:11

So you have the extension installed and

68:13

then you can just select different

68:16

elements on a screen. So this example,

68:19

we're just drawing around the heading

68:21

there, marking the section. And maybe

68:24

you want to change it. Why not? Let's

68:25

change it to red.

68:28

You could also select the elements on

68:30

this page. You see how if I hover over

68:32

it, finds the right element. You click

68:34

it, you select it, and say something

68:36

maybe double the font size. And you see

68:39

also how the agent here captures on the

68:42

screen exactly the feedback, the

68:45

location, the style elements and it

68:47

creates this complete summary which you

68:50

can then use and then give your agent to

68:53

implement. So there is no back and forth

68:55

anymore because you captured exactly

68:58

what you saw on screen and the agent can

69:01

see the same thing.

69:06

Now let's have a very brief look at the

69:08

second one at verification.

69:12

So the idea of verification is that you

69:15

can describe let's stay in this case of

69:18

the web development. You can describe in

69:21

a design MD file what your design rules

69:25

are for this.

69:27

And then what happens if I play this

69:29

video here, the act the agent can

69:32

actually check its own work against

69:34

those design specs.

69:37

So it will take what you defined, the

69:39

colors, the components, your layout, and

69:42

it turns it into those rules if you

69:44

don't have it written before yet. And it

69:46

does two kinds of checks. Then it does a

69:50

visual check, which is really cool. So

69:53

everything is on brand, for example.

69:55

it's the right layout.

69:58

The other part is also checking user

70:01

flows. So what it does there, it

70:04

actually walks through this experience

70:06

through the app for example depending on

70:09

the tasks available. It might add a

70:11

task, it might delete a task like a real

70:14

user would. So it helps you walk through

70:17

those user flows as well in an automated

70:20

fashion. And then once it's done, it's

70:23

writing a report which you can review

70:25

and it's going to call out which tests

70:28

passed and it's going to tell you

70:30

anything that didn't.

70:33

So ultimately, you're the one that

70:36

doesn't have to click through this at

70:38

midnight at the end of the day because

70:40

great work. The agent already did this

70:42

job for you.

70:46

Now there might not always be a screen,

70:51

right? So I talked a lot right now. I

70:53

called it perception. I talked about the

70:55

agent sees what you see on a screen.

70:59

But there are times in your day where

71:02

you don't have a screen. Maybe you're in

71:04

the office. You're walking into a

71:07

meeting with a colleague.

71:09

So I did a fun experiment yesterday

71:13

at the conference here. So I grabbed my

71:15

colleague Giovanni who is also here and

71:18

actually on the second floor there's a

71:19

great like little meeting booth. We

71:21

found that by coincidence. So we went in

71:23

there and we had our design meeting. And

71:26

the goal here is really kind of show you

71:29

how perception is so much more than just

71:31

the visual part. So in this example,

71:34

what we want to show you is perception

71:36

can also be listening in the room to

71:39

what you're discussing.

71:41

And you can see here on the picture,

71:43

both of us are wearing our B devices.

71:46

Big shout out to B for sponsoring these.

71:49

Um, so we're sitting there. We have our

71:52

B devices that can do a transcript.

71:54

They're listening to what we're saying.

71:56

And then we have this design meeting.

71:57

And I had a couple of great ideas how to

72:00

change this website. Um, you will see

72:02

them in a in a second here. So let's

72:06

have a quick look how this changed the

72:08

same workflow on this website using this

72:11

device.

72:14

So we had the discussion the be did the

72:17

transcript and you can see here on the

72:18

right we're pulling this meeting

72:21

transcript right in there is a whole

72:24

detailed summary of the meeting.

72:27

There is

72:29

what we discussed and then it basically

72:32

captures those insights. We have them

72:34

right here and we can click apply. So

72:37

what this apply button does is it sends

72:40

it straight to the agent. And you can

72:42

see here my crazy ideas to turn the

72:44

background to yellow, turn the heading

72:46

to red, and also change an emoji

72:49

directly applied. And it also straight

72:52

kicks off the verification right away.

72:55

So it creates this report and and

72:57

luckily this color scheme was apparently

72:59

into in the approved rules otherwise

73:02

this would have looked like you did some

73:03

weird things here. But again you could

73:06

change those rules if you don't want to

73:07

have yellow backgrounds and it will make

73:10

sure um that we still adhere to those

73:12

guidelines. It would flag anything

73:14

that's off. So you have the judgment

73:17

call if you want to either update the

73:19

design specs because you actually like

73:21

yellow or you take an action and say no

73:24

um fix this violation.

73:28

But this is really the very first step.

73:32

These two pieces are the very first

73:34

beginning and we're building out the

73:36

rest in the open because these patterns

73:40

can only get better if more people are

73:42

using them, building on top of them,

73:45

breaking things. So my ask here to you

73:48

is go and try them out. They're on our

73:51

GitHub repos.

73:54

Tell us what we're missing. Give us the

73:56

feedback what you would like to see

73:58

where this should go next. because

74:00

ultimately none of us get smart alone

74:04

and that's the whole point. We want to

74:06

build AI that makes all of us smarter

74:10

together.

74:14

Now, if you're interested in a little

74:16

bit more on human agent interactions and

74:20

how we see those patterns changing, I

74:23

would highly recommend this podcast by

74:25

my colleague Danielle Persik. She is a

74:28

cognitive scientist and runs our AGI ACI

74:32

team at the lab and discusses a lot

74:35

about human computer interaction

74:37

patterns with experts in the industry.

74:39

You can find the podcast on on a popular

74:42

podcast platform.

74:44

We also have more sessions this week. Um

74:47

so check them out. We have a booth down

74:49

there. We have expert talks. We also

74:52

have another computer use track talk

74:54

coming up with my colleague Gav Mishra

74:57

at 1:30 in the computer use track.

75:00

Highly recommend checking out his talk

75:02

from RL to IRL.

75:06

And then ultimately come find us. We

75:08

have a huge presence down at the expo

75:10

hall. We would love to continue the

75:12

conversation with you all. If you're not

75:15

here in person, you can also check out

75:17

our code on our GitHub repo and check

75:19

out our website.

75:22

And with that, thank you very much.

75:34

Please welcome to the stage the vice

75:37

president of research at Google

75:40

DeepMind, Benois Schillings.

76:10

All right, good morning. Uh this is

76:12

really quite exciting to be here and

76:14

have a chance to to speak with all of

76:15

you. Uh my name is Benois Shellings. I'm

76:19

actually a bit of a noob when it comes

76:21

to to machine learning. Till a year and

76:24

a half ago, I was working for Google X

76:28

which some of you may know. We've done

76:30

things like Whimo which seems to be at

76:32

every street corner now. Uh we also do

76:35

things like Glass. So you know we we had

76:37

a mix of hit and success but in many

76:41

ways this was for me an interesting

76:43

formative experience on how to run a

76:46

research team in a place like deep mind.

76:49

I do have an incredible team. Uh my team

76:53

goal in deep mind is basically to

76:56

develop whatever technology will be

76:58

needed to make Gemini incredible between

77:01

one month and one year from now. So one

77:04

month because if you start to work on

77:06

what is needed in one week, that's a

77:08

very different type of job and one year

77:11

because I don't think anybody can really

77:14

predict anything that far. So that's

77:16

already pretty ambitious in my opinion

77:18

to think about things that would happen

77:20

one year in the future.

77:23

We do many things under that role. Uh a

77:27

lot of it is related to code which will

77:29

be the main subject of my talk today. uh

77:32

but we also do a lot of research on what

77:34

is the evolution of reasoning for models

77:37

for instance or we do topology research

77:40

what are new type of network that might

77:43

bring better performance uh we do

77:46

fundamental work in the science of

77:47

reinforcement learning which is so

77:50

fundamental to what we're doing today

77:52

with ML

77:56

let's do a bit of an origin story Um,

78:02

we started the project at X named

78:04

Pitchfork in 2018

78:08

which was aimed at looking at how ML

78:11

could really improve the way code is

78:13

being written. And this was very

78:16

interesting because in 2018 when we

78:18

presented that at Google

78:21

honestly nobody would give us the time

78:23

of day. uh there was that point like why

78:26

would you ever need ML to to write code?

78:30

Um at the same time I think that we

78:32

totally underestimated how fast this

78:34

could go. When we did that project

78:36

originally the idea was to look at how

78:39

we could speed up the evolution of a

78:42

piece of code. How could we make many of

78:45

those small changes which slows down

78:47

code speed development? you know the

78:49

small edit which requires a review that

78:52

takes three days and how we could

78:53

compress that cycle.

78:56

Some people were talking about vibe

78:58

coding writing code in English and at

79:00

the time honestly I totally dismissed

79:02

that I was that's why we have

79:04

programming language English is not a

79:06

programming language. Well, I I I guess

79:08

I was pretty wrong on that front, but

79:12

the resistance we felt at the time

79:14

reminded me of how my own career was

79:17

pretty resistive to to change. Um, I've

79:20

been writing code for

79:23

45 years. Uh, I started by writing video

79:26

game for Apple 2 and Commodore 64. So,

79:29

uh, my formation was to write assembly

79:32

language. And when you spend a long time

79:35

writing assembly language, you look at

79:37

compilers with a lot of suspicion,

79:39

right? Are those things really working

79:41

correctly? And then when you switch to

79:43

C++ and use compiler, you lose you look

79:46

at garbage collected languages as this h

79:49

that's not real programming. You need to

79:51

manage your memory. Well, today I use

79:54

Python and VIP coding. So even old dogs

79:57

can learn new tricks. So uh but but I I

80:00

I do understand what happened there.

80:04

I think that we have a number of eras in

80:07

what happened with software and and the

80:09

first one was you know the one where I

80:11

started writing code where the

80:14

fundamental limit was really the machine

80:17

and there was a lot of work to go and

80:19

extract the last ounce of power out of

80:22

those machine and that was the days of

80:26

assembly language where you really

80:29

needed to be incredibly accurate in the

80:30

way you were writing code computing

80:33

became much cheaper and we switched to

80:35

the modern cloud era where getting the

80:39

best performance is not the most

80:40

critical aspect. You can actually brute

80:43

force many problems but really what

80:46

became the limiting factor was the

80:49

ability for us to design in a modular

80:52

way. You know this was the era where

80:54

software was write it only once and this

80:58

was this whole idea of how are you going

80:59

to build libraries? How are you going to

81:02

write functions? How are you going to

81:04

break down that problem into something

81:06

that is long-term manageable?

81:08

The limitation there and that determine

81:11

a lot of how our software process are

81:13

working where actually the human brain.

81:16

Uh a traditional human typical human is

81:19

able to get the context between seven

81:22

and nine tokens. I mean we have very

81:25

rich tokens but you compare that to

81:27

modern ML where the context is basically

81:30

going to be infinite pretty soon. uh

81:33

that fundamental limitation of human

81:35

determined a lot of how software was

81:37

being written. This is over and we're

81:41

switching now to that AI frontier where

81:44

really writing the code is not the

81:46

challenge anymore. uh I'll speak some

81:48

more about it but the bottlenecks are

81:50

really how do you ensure that that code

81:53

is what you really wanted because

81:55

writing the code is easy but getting

81:57

what is needed for a specific problem

81:59

can be much harder to to specify

82:02

so humans at least in the near future

82:05

will be that role of architecture or

82:07

thinking of what are really the

82:09

implication of that piece of code I'm

82:11

getting the ML to to design inductive

82:14

thinking is another category where I

82:16

think Humans still have a a very clear

82:19

edge which is to look at a system in a

82:22

much wider context and to be able to

82:24

detect patterns and from those pattern

82:26

take some decision.

82:30

So where are we today? U superhuman

82:33

syntax generation.

82:36

When is the last time I built Gemini to

82:38

write a function for me and I looked at

82:41

the function and I was like I can do

82:42

that better.

82:44

It's over. uh I think that the minutia

82:47

of code writing I mean you can fight you

82:49

can argue you can find counter example

82:51

but that time is is gone where we still

82:55

have a lot of work to do is multi-step

82:58

code base uh software engineering is not

83:01

about writing code software engineering

83:04

is the first time you join a company and

83:07

you realize that there are 35 million

83:10

lines of PHP in the codebase and that

83:12

you need to make some changes that

83:14

that's the day you understand what

83:16

software engineering is and that's a

83:18

place where our models today or frontier

83:20

models are progressing but this ability

83:23

to manage that extreme complexity and

83:26

break it down into man manageable pieces

83:28

is a place where the frontier is still

83:30

moving

83:32

um it goes all the way to architecture

83:36

you look at I don't know the Google

83:38

architecture

83:40

thanks god we have Jeff Dean which was

83:42

you know the the key architect there but

83:45

that's the level of thinking which has

83:47

many implication which can go from how

83:49

do you do hardware optimization how do

83:51

you manage security how do you build a

83:54

system so that 10 years later you're not

83:56

full of regrets and I think this is

83:58

really the the range of progress we are

84:01

working on today so code is over but

84:04

there's plenty to do there's plenty of

84:06

progress to be made

84:09

now code is a very unique problem and in

84:12

some way that's the reason We we did

84:14

pitchfork on this. Um

84:18

first of all, code is a lot of data.

84:21

There are other domains where you can

84:23

find a lot of data to train your model,

84:25

but code was so incredible. You could go

84:27

and go on GitHub and start to to scrape

84:30

GitHub. So this was one of those problem

84:32

where the amount of training data was a

84:35

very unique situation.

84:37

It is also a domain where doing

84:39

verification is reasonable. You can run

84:42

a piece of code, you can compile it, you

84:44

can have unit test. So the ability to

84:47

figure out is the model generating

84:49

something correct was something that was

84:51

pretty reasonable to do. That brought us

84:55

where we are today. But today what

84:57

happened is that we ran out of training

84:59

data. I think that 80% of the new code

85:03

added to GitHub today is machine

85:05

generated. So the notion of human

85:08

bringing some knowledge that can be used

85:11

for mining and to train model is

85:13

reaching an end. But the good news is

85:16

that we can do selfplay and selfplay is

85:19

something we always liked a lot at deep

85:21

mind. I suppose all know alpha zero.

85:24

Alpha zero became a superhuman go and

85:27

chess player without any human knowledge

85:30

just by playing against itself. We are

85:34

now at that stage where frontier model

85:36

for code are able to do the same where

85:39

they can create their own challenge.

85:41

They can judge the validity of the

85:43

answer. They can even to some extent

85:45

judge the architecture. So that ability

85:48

to do those hundreds of millions of

85:51

hours of selfplay writing code is the

85:54

thing that will bring us to the to the

85:56

next layer. You know it's interesting.

85:58

Um

85:59

do the experiment. Take a a brilliant

86:01

software engineer, lock him in a room,

86:04

lock him or her in a room for two years

86:06

and feed pizza and give the mission you

86:09

need to become a better software

86:10

engineer. What do you do as a person?

86:13

You you give yourself some challenges.

86:15

Challenges that you can verify and you

86:17

keep working and coding on those

86:19

challenges. We can do the same here. So

86:22

this is an issue of how much compute,

86:24

how much selfplay time we can have, but

86:27

that will bring the horizon of how far

86:30

we go in superhuman coding.

86:34

So the economics of code are changing

86:36

dramatically. You know, as I say, we

86:39

developed a whole software engineering

86:41

culture and infrastructure and set of

86:43

companies based on the assumption that

86:46

writing code was the hard part, that

86:48

this was the expensive part. We're now

86:50

in a world where writing code is free or

86:54

nearly free. That's why I've got the

86:55

tilda there.

86:58

That means that the amount of code that

87:01

we're going to see produced is going to

87:03

explode. And there are some hard

87:05

implications to that. First is the

87:07

question of design and adequacy. How in

87:11

front of that mountain of code which

87:13

will be written or written dynamically,

87:15

how do we keep systems which works and

87:17

are reliable at the microscopic level?

87:20

Great role for human.

87:22

It is also the issue that you know we're

87:26

writing code and we're not reading it

87:28

very much anymore. I mean I know we

87:31

still have code review but I would

87:33

predict that in one year we'll let

87:36

Gemini or other model generate the code

87:38

and nobody will actually look at it. You

87:41

know it's similar to compilers who still

87:43

check the assembly output of their

87:45

compiler and maybe someone there but

87:50

that's probably the end of it. So the

87:52

same thing is going to happen to code

87:54

and that brings some question of what

87:56

are the new process that we need to put

87:58

in place to keep that manageable

88:01

and that's where I've got a a bit of a

88:04

list active guard rails. I mean you've

88:07

all seen the news of mythos looking at a

88:10

piece of code and detecting a

88:13

unreasonable number of vulnerability in

88:16

that code.

88:18

there is a rush to go and patch those

88:20

vulnerability but I think that's going

88:23

to be a never ending process you know

88:25

we're going to get a certain layer of

88:27

vulnerability discovered by models we're

88:31

going to fix those models will get

88:33

smarter they will go a bit deeper and

88:35

find even more subtle vulnerability so I

88:38

think that the first aspect is that we

88:40

need to think at least as much about

88:43

code security and the implication of a

88:45

piece of code than on the code writing

88:48

itself and the grail and you know

88:50

something my team is working actively on

88:53

is instead of detecting the

88:56

vulnerability and then suggesting some

88:58

fix how about teaching model to write

89:01

correct things from the start

89:04

and that is very very hard to do because

89:06

it is very context dependent

89:09

the other aspect is that you know that's

89:11

what I call inductive architecture

89:14

uh I think that models today are still

89:17

not very good at transferring knowledge

89:20

of taking knowledge from one domain and

89:23

applying it to another one or taking two

89:26

concepts and finding the intersection of

89:28

those context to be those context to be

89:31

able to do deductive thinking. If we

89:34

really want to write those very complex

89:35

software system using ML that is a skill

89:38

that we need to teach and you know one

89:40

aspect of that is to really teach models

89:43

how to do correct planning in front of a

89:45

problem. How do you look at a very

89:47

complex problem and decide what is the

89:49

right decomposition of that problem that

89:52

will bring the best clarity or

89:54

correctness to the to the problem.

89:57

We also need to change the way we do

89:59

evaluation. I mean u threebench is

90:03

infamous in in my book because

90:06

threebench verifies if a piece of code

90:09

runs and produce the right output.

90:12

That's only a small part of as I

90:15

mentioned earlier of code engineering.

90:17

So for instance, I think that we need

90:19

some problems much more in those

90:21

benchmarks that we use which are

90:24

open-ended problem. I I'll give an

90:26

example. Uh I love the question of text

90:29

compression. How many bits per character

90:32

do you need and how far can you go? So

90:34

that's a very simple eval to to write.

90:36

You just take a piece of 10 megabyte of

90:39

code and you tell the model write the

90:41

best compressor you can that is lossless

90:44

and the loss function in that case will

90:46

be you know the size of the compressed

90:48

file plus the size of the source code

90:50

that's never ending I mean those

90:52

problems are I think what's going to

90:55

force those model to do novel things

90:57

like creating totally new algorithmic

90:59

for instance and I I think we're now

91:01

getting to that stage

91:06

Writing code or doing software

91:08

engineering is not thinking as a chain

91:11

of tokens.

91:13

Thinking and reasoning today is chain of

91:15

toad which has been you know very

91:17

successful and improve models a lot. But

91:20

humans of course are much more complex

91:22

in the way they think about problems. I

91:25

always think that code writing is a very

91:27

visual activity and that can be I don't

91:30

know the block diagram of what you're

91:32

doing or the flow of data through your

91:34

code. uh but saying that code will be

91:37

just a set of token that you emit that

91:40

are going to be the code I think goes

91:42

only up to a certain point that's a very

91:46

interesting aspect to what we do at

91:47

Google Gemini we made that choice from

91:51

the onset that this would be a

91:52

multimodal model that you know text was

91:55

only one of the modality that Gemini

91:57

would be able to apply and we're

92:00

starting to see you know how can a model

92:02

start to think in term of spatial or

92:05

dynamic representation to to solve

92:07

problem and I think that's going to

92:10

become a must have

92:13

another interesting question is is this

92:16

time to create a new language for models

92:20

Python you name it have been invented

92:22

for humans and those language are not

92:26

very good to write safe or reliable code

92:28

I mean they're great to write code but

92:31

they're certainly not the the best thing

92:34

I think We're getting to the point where

92:36

since the pain of writing the code does

92:38

not exist anymore. How about we make

92:41

writing the code much harder by having

92:44

you know very strongly typed languages

92:46

or you know some inspiration from lean

92:49

on how to write code that by design it's

92:52

not going to be perfect. I mean program

92:54

is something which has some limits but

92:57

at least putting the burden of

92:58

correctness on the model. So I don't

93:01

know if we have some language designers

93:03

here but I I I think there's something

93:04

really to be done there and it doesn't

93:06

need to be human readable. I I don't

93:08

think that that will matter anymore.

93:12

So beyond code um code is a universal

93:16

language to solve problems. I think that

93:19

what we're starting to see is this

93:21

ability to experiment very quickly in

93:23

code is impacting other domain very

93:27

quickly because doing experiment becomes

93:29

basically free. So I think that looking

93:32

at that intersection of code writing and

93:36

atoms or science is another big front

93:39

that we are opening that is the place

93:41

where true novelty is going to appear.

93:46

two which are especially exciting for me

93:48

is chemistry. Um you know as humans we

93:52

do not understand chemistry or we

93:54

understand a very very small sliver of

93:56

chemistry. Once you have more than 20

93:58

atoms in your molecule it's like wow we

94:01

don't know what that thing is going to

94:02

do. I think we're going to see

94:04

incredible things emerging out of that.

94:06

I mean once you are able to put 10,000

94:09

atom together that starts to look like

94:11

life. So what are all the other things

94:13

you can do with 10,000 atoms?

94:16

Biology. You probably heard plenty about

94:18

it, but you know, biology is the case of

94:22

nature did an incredible engineering job

94:24

and terrible job at documentation.

94:27

Um, but we can crack through that now.

94:30

Models are able to find those

94:31

relationship that might be elusive for

94:33

us. So I think that that is something

94:35

that will open incredible door. And then

94:39

there is what I call the gold we cannot

94:41

see. Humans are incredibly biased in

94:44

what we feel is the correct solution. I

94:47

mean, we're the result of an

94:49

evolutionary training that help us

94:50

survive in the jungle, right? Not doing

94:53

quantum computing. So, I think that even

94:56

though we can be brilliant and

94:57

innovative, there are a whole bunch of

95:00

progress and breakthrough that can be

95:02

done which we just cannot see or

95:05

perceive. If I had more time, I would

95:07

give some examples. I think that's one

95:09

of the thing where ML is such a

95:12

different viewpoint on many of those

95:13

problems that we're going to get the oh

95:17

my god this was in front of us the whole

95:18

time and we could not see it. So

95:21

exciting times ahead. Thank you very

95:23

much.

95:32

Ladies and gentlemen, as we continue

95:34

today's program, please welcome back

95:36

your MC developer advocate at IBM, Tea

95:40

Scamar.

95:52

What an incredible start to the day.

95:54

Woo! Everybody's leaving. This looks

95:57

amazing from here. Um

96:00

before we break off uh or after um let's

96:04

take a moment and acknowledge the

96:05

sponsors. Honestly, this would not be

96:07

possible without them. We're going to

96:08

get the slides up. Listen, you need to

96:10

give them your biggest round of

96:11

applause. I mean, it is so cool. Thank

96:13

you. Thank you. Thank you. Thank you,

96:16

Microsoft. Thank you to all the other

96:18

sponsors here. This event would not be

96:20

possible without them. There's plenty of

96:22

other things happening um in the other

96:24

stages, but there's no doubt that um

96:27

evals are a huge deal in AI. In fact,

96:30

they're the gate of quality, right? Um

96:32

we can ship a lot of things, but if

96:33

they're not eval, we ship a lot of slop.

96:36

And so, uh our next discussion, our next

96:39

session is going to be from Aparna

96:42

Dinakan from Arise, who's going to talk

96:43

to us a little bit about EVAS. Please,

96:45

your biggest round of applause for

96:46

Aparna.

96:50

Please welcome to the stage co-founder

96:53

and chief product officer at Arise

96:56

Aparna Dinakaran.

97:12

Hey everyone, can you all hear me? All

97:15

right, let's go. Oh, let me go one back

97:17

here. Awesome. Well, hey everyone. My

97:20

name is Aperna, one of the founders of

97:21

Arise. We work with some amazing teams

97:24

to help them build evals. Um, and we

97:27

have an incredible lineup of talks for

97:29

you all today at the Evals track. Um,

97:32

it's happening in room 2005

97:35

and there's going to be amazing speakers

97:37

from Turnbench and Uber and Snorkel kind

97:39

of all happening after this. Um, but

97:41

today I'm here to talk to you about the

97:42

future of evals. Evals have gone from

97:46

the new skill that every PM and every AI

97:49

engineer has to learn to the thing that

97:52

every serious AI team is betting on.

97:56

We've been really fortunate to get to

97:57

work with some of the best AI teams in

97:59

the world. So we get a front row seat

98:01

into not just what's happening when

98:03

they're building their actual agents and

98:06

before they actually ship, but actually

98:08

the eval teams are running on their live

98:12

production agent via their traces.

98:15

Little bit of some stats for you guys.

98:16

We run over a 100 million evals every

98:18

month. The average team runs about 12

98:21

different eval jobs with the top teams

98:24

running over 3,800 different evaluators.

98:28

And offline evals, online eval, they

98:30

each have their own place. But today,

98:32

what I'm actually going to talk to you

98:33

about is the teams that are running

98:35

evals on their traces. This is actually

98:38

what's helping teams figure out what's

98:41

working, catch their failures, and

98:43

that's the type of data you need to fuel

98:46

your continual learning loops.

98:48

And the industry kind of agrees. I mean

98:51

all the CPOS of Anthropic, OpenAI, all

98:54

you know GDB, you have Gary Tan saying

98:57

eval are everything you need. And the

98:59

whole industry kind of agrees. So we

99:01

added evals. They catch all the

99:03

failures. Right?

99:06

Here's the problem. While we were

99:08

building all of these firstgen evals,

99:10

the thing that we were actually

99:11

evaluating has changed underneath us. In

99:14

2023, it was about just answering a

99:17

prompt. In 2024, we started to see all

99:21

different frontier models. They've added

99:23

tool calls. They've added reasoning.

99:25

They've added deep research. Now, what

99:27

we have is teams running loops on real

99:30

world data with sub agents kicked off on

99:34

um long horizon tasks. Every one of

99:36

these was actually a massive jump in

99:39

complexity. And we didn't just make the

99:42

problem harder. we actually got a

99:43

fundamentally different type of problem.

99:46

What that meant is that as these systems

99:48

got more complex, so did the way that

99:50

they actually fail. We're really lucky

99:53

because we have our own agent that we've

99:54

built, Alex, that lives in our UI and we

99:56

get a kind of get to feel this pain

99:59

ourselves. Every time the Frontier Labs

100:01

added new functionality, we added it to

100:03

our agent. And now Alex can has much

100:07

longer memory. It has the ability to

100:09

create dynamic UIs. it can go search

100:12

across an enormous volume of traces. But

100:15

we also realized that it would forget

100:17

context. It wouldn't know when something

100:19

was done. Um, sometimes it would just

100:21

get stuck in these loops. And the key

100:24

thing here is that the classical LLM as

100:26

a judge evals that probably many of you

100:29

have written in this room just weren't

100:32

enough for us to be able to catch all

100:34

the types of failures that we were

100:37

experiencing. I mean, it's just

100:39

fundamentally different, right? you have

100:40

a deterministic flow and now what we

100:43

have is literally every time a user

100:45

interacted with Alex it would create a

100:46

new UI that's a fundamentally different

100:49

trajectory

100:50

so this led to our really big revelation

100:54

what if the best way to evaluate an

100:56

agent was actually with an agent

101:00

doesn't mean that all of the ways that

101:02

we did eval with deterministic evals

101:05

with LLM as a judge classic eval doesn't

101:07

matter anymore but it just means that we

101:09

have a different type of tool to solve a

101:11

different type of problem. Agent as a

101:13

judge is about adaptive dynamic

101:17

analysis. LM as a judge just gives you a

101:19

fixed rubric with these fixed scores.

101:21

It's what everyone's doing. But when

101:23

your agent's doing completely different

101:26

trajectories every time a user puts in

101:28

data, it just means that you need a

101:30

fundamentally different type of eval.

101:33

My take is that most teams today are

101:35

doing the first two, but the future of

101:37

eval is actually having all three.

101:41

And today I'm actually excited to share

101:42

we've released agent as a judge um to

101:45

help our teams on their eval journey.

101:48

We've released signal. Signal is

101:50

actually a longunning agent that can

101:52

read traces sent in discover patterns of

101:55

issues. Um, it can figure out types of

101:59

problems that a classical LLM as a judge

102:02

eval just would never be able to do with

102:04

these deterministic rubrics. It's helped

102:06

us figure out um very subtle failures

102:09

that you wouldn't even think of doing

102:11

such as something going on in a loop for

102:13

multiple times. It was calling the same

102:15

tool uh for repeatedly long time. The

102:17

trajectory was inefficient. And actually

102:20

what this does is because it has all

102:22

that analysis, it can go put up a PR and

102:24

put up a fix. So, if you want to learn

102:27

more, come to our come to our booth.

102:29

We're right by the OpenAI booth. We'll

102:31

give you a demo. We'll show you a bit

102:32

more about it. Um, we're also, like I

102:35

said, taking over the eval track. So,

102:37

come to room 205. We're going to be

102:39

talking a lot about the future of evals

102:42

and what they look like. And if you just

102:44

want to hang out with our team, we're

102:45

throwing a viewing party for the USA

102:48

World Cup uh game tonight. So, uh check

102:50

out the Luma and register to come join

102:52

us. Awesome. Thank you all so much.

103:08

story of how this all kind of came to

103:10

be. Uh we're going to talk about OGs uh

103:14

big bet on effect uh a little bit into

103:17

our core agent loop. Uh we're going to

103:19

talk about the A2A protocol, eval.

103:23

We're going to talk about how we manage

103:25

long context.

103:27

Hi everyone, my name is Gabe Dees Mesa.

103:30

I'm an engineer here at OpenGV and today

103:33

we're going to be talking about agents

103:34

in production, specifically how OpenGV

103:37

built and scaled OG Assist. Uh so um

103:42

this presentation is going to be

103:44

jam-packed with just so much good stuff.

103:47

Uh we're going to talk about uh AI

103:49

agents. We're going to talk about our

103:51

harness. We're going to talk about um

103:54

eval observability traces. We're going

103:57

to talk about um tools and skills. Um

104:02

it's there's going to be a lot of good

104:03

stuff in here. We're going to talk to

104:05

you guys about uh what we do at OpenGV

104:08

and how we operate at the scale that uh

104:10

we operate at um in production. So

104:13

you'll be able to see a real use case

104:16

and workload uh with AI agents. Um so

104:19

without further ado, let's get started.

104:23

Okay, agenda. So just really quickly

104:25

going to go through uh high level what

104:28

we're going to talk about today. Uh I'm

104:30

going to tell you guys a little bit

104:31

about OG Assist and what uh Open Gov is.

104:34

I'm going to tell you guys the origin

104:35

story of how this all kind of came to

104:38

be. Uh we're going to talk about OG

104:41

Assist's uh big bet on effect uh a

104:45

little bit into our core agent loop. Uh

104:47

we're going to talk about the A2A

104:48

protocol, eval.

104:51

We're going to talk about how we manage

104:53

long context. We're going to talk about

104:55

um monitoring observability, how we

104:58

collect feedback uh and how we iterate

105:01

on that feedback. We're gonna lastly uh

105:03

also talk about tools and skills and how

105:06

at open gov uh we use um AI not only

105:10

externally uh that we uh serve to

105:13

customers but also internally to improve

105:15

our development workflows.

105:18

Just a little bit about me before we go

105:20

any further. My name is Gabe. I'm a

105:22

software engineer here at OpenGV. I work

105:24

on the AI agents team and uh I'm one of

105:27

the folks that helped build uh OG Assist

105:30

and some of the systems that you guys

105:32

will be seeing today.

105:34

So, a little bit about OpenGV. OpenGV is

105:37

a software company uh on a mission to

105:40

power more effective and accountable

105:41

government. Um so, OpenGV sells ERP

105:44

software. That's things like budgeting,

105:46

procurement, asset management, and

105:48

permitting. And um we were founded about

105:51

14 years ago. And what's cool is um we

105:55

have this thing called OG Assist. And OG

105:58

Assist is this little button on the top

106:01

of all of our products in the in the

106:03

navigation bar. And what's cool is um

106:07

all of our product suites and product

106:09

teams um have built tools and skills in

106:14

order to power this button. So, for

106:16

example, if I open up uh this this um if

106:21

I click this button and I open up OG

106:23

Assist, it says, "Hey, um I'm going to

106:25

ask about rate codes, which is very

106:26

specific to utility billing, the current

106:28

product that I'm in." And you can see

106:30

that inside of this kind of chat

106:32

interface, I'm able to speak to an

106:34

agent, and the agent is able to make

106:36

tool calls in order to um look up

106:39

information against data inside of that

106:42

suite. So, it's really cool um to be

106:45

able to kind of first party create these

106:48

experiences uh through the capability

106:50

that we've built called OG Assist.

106:54

Okay, so just a quick story about how

106:57

this all came to be. So, um, a little

107:00

while back, we we we saw that AI was

107:03

really starting to take off and a

107:05

principal, uh, spun up this new team

107:06

called the AI agents team and asked me

107:08

to join and, um, instantly I said yes

107:12

and OG Assist started to to grow and we

107:15

started to integrate, uh, OG Assist into

107:17

all our products and, uh, not only our

107:20

back-end capabilities, but also our

107:21

front-end capabilities as well. So,

107:23

you'll see that one of the capabilities

107:25

that we give the agent is it's able to

107:28

um see what's on the screen and and see

107:31

and and and take action on what's on the

107:34

page. So, you could see that um I'm

107:36

asking the agent here, hey, hey, what's

107:38

on the screen? Can you maybe highlight

107:41

uh some of the next steps that I could

107:42

take? So, you can see that the agent

107:44

here is thinking. It's saying, okay,

107:46

what tools do I have available to use?

107:48

And hey, let me go and highlight

107:50

something that you could actually click

107:52

on and and tell you more about it. So

107:54

just another capability of OG assist and

107:56

just a little short story about how this

107:59

all came to be.

108:01

So the big bet on effect. Um so I really

108:05

wanted to include this slide because um

108:08

here on the agents team we made a huge

108:10

bet to um to to bet on effect and

108:15

suffice to say it's paid off in

108:17

dividends. Um we write effect. So effect

108:22

is this library for typescript. It's

108:24

open source and it helps you write

108:26

better um typescript code. uh you know

108:29

it's got a lot of uh stuff baked in it

108:32

like a sk a schema similar to like ZOD

108:34

if you've ever used that. It's also got

108:37

um things for error handling uh for

108:40

logging for traces for uh it's just got

108:43

so much in there. It really helps write

108:45

better code and structure your code

108:47

better and helps with architecture,

108:50

spinning up new services for uh and and

108:52

for us on the agents team really helping

108:55

uh design and build the the core agent

108:58

loop. So you'll see throughout this

109:00

presentation sprinkled in um how effect

109:03

on our team uh has paid off in

109:07

dividends. So we we really love effect

109:09

here at open gov and we encourage other

109:11

folks to try it out and um yeah let's

109:14

keep going

109:17

the effect native loop. So originally we

109:20

were on lang graph and that was fine

109:22

until the team really started to scale

109:25

uh and our use cases started to evolve.

109:29

So we decided to move over to our own

109:33

kind of effect native agent loop to have

109:35

full regency over this uh agent loop

109:39

such that if we have complex use cases

109:42

or features that we need to build we

109:44

could kind of get in we we had full

109:48

control of the of the agent loop. And

109:50

not only that but now we're fully on

109:51

effects. So all the cool things you get

109:53

with effect is now propagated throughout

109:56

the entire agent loop like the tracing

109:59

structured concurrency, the logging,

110:01

everything is more fine graining control

110:03

and it it really allows us to really

110:06

unlock the full potential uh having our

110:09

own agent loop from the ground up. Um so

110:13

another thing I wanted to mention is on

110:15

the left side you'll see a code example.

110:18

This is really the basics of the effect

110:21

loop that we're using. Uh we're using

110:23

this thing called the effect AI package.

110:26

And in that package, there's this thing

110:28

called um there's a chat and a language

110:31

model. So with the chat, you can

110:33

instantiate like an a chat for example.

110:35

And then you could stream text using um

110:39

that that kind of stream text function.

110:42

You could pass in a prompt. And what's

110:44

cool is uh with a language model under

110:47

the hood of since we're kind of doing

110:49

dependency injection, we could pass in a

110:53

different language model if we were to

110:55

uh hot swap to another one for example.

110:58

So really just having full control of

111:00

our own agent loop just kind of gives us

111:02

all the levers and it really just

111:04

unlocks the full capabilities of the

111:06

model and uh for the team as well to

111:09

have full agency over this this loop.

111:13

Another thing I wanted to mention is the

111:15

agentto agent protocol. So here on the

111:18

agents team, we've had a lot of success

111:20

with this protocol. So this protocol

111:22

being the protocol that Google created

111:25

um kind of an open protocol for agents

111:27

to intercommunicate. But um we found

111:30

this very useful for uh defining our

111:33

agent routes for example in the back end

111:37

and our model and our schema to follow

111:40

this kind of uh agent protocol. So we

111:43

modeled so for example there's this

111:44

thing called an agent card which you see

111:47

here and it's got the name of the agent

111:49

a description etc right and having this

111:53

kind of rigorous protocol this rigorous

111:55

spec really helped drive our development

111:59

and drive alignment because you know all

112:01

we had to do was um align with this spec

112:04

and follow this spec and we knew that

112:07

this was kind of the contract that our

112:10

front end and backend and would both

112:13

consume and and produce. So, um this uh

112:17

I would say also has been uh very

112:20

helpful for us and and what's really

112:22

cool is A2A has a lot of extensions,

112:24

right? So, you could extend the protocol

112:27

uh add in like metadata. Uh there's also

112:29

A2I

112:31

um so lots of fun stuff uh with A2A

112:35

protocol, but uh this is kind of what's

112:38

worked for us. So, just sharing that

112:40

with with you folks.

112:42

feedback and eval. So here the quote is

112:46

shipping is the start not the finish. So

112:49

what we do here uh on the agency team is

112:52

we have kind of multiple ways we do

112:54

evals and collect feedback. Um obviously

112:58

you know we'll have folks uh call in or

113:00

or email us or or just let us know and

113:03

tell us but the main way is we have this

113:05

thumbs up and thumbs down mechanism. And

113:07

here uh someone is able to tell us, hey,

113:10

this this worked really well. This was a

113:11

great response or that wasn't a great

113:13

response. And that signal we take and

113:15

we're able to iterate on uh and we could

113:17

take it back and help improve uh you

113:20

know the response in the future. Um we

113:23

also have automated evals. So in in the

113:26

in RCI we we have evals that run against

113:30

real completion. So we could test the

113:31

prompt against hey did it hit some

113:34

tools? Did it do what it's supposed to

113:36

do? And that also helps with our

113:38

accuracy. So, uh those automated evals

113:41

in conjunction with collecting feedback

113:43

really help us um improve our

113:48

our our tools, our skills, um our

113:51

harness and and that's really how how

113:54

we're able to iterate so fast and so

113:56

quickly.

113:58

Humans in the loop. So this is a really

114:00

cool feature we built where we

114:02

deterministically interrupt the agent

114:04

loop. If there is a tool call approval

114:07

required. So if an agent tries to make a

114:09

tool call that it needs human approval

114:12

for it'll show this UI and the human uh

114:15

can click accept or reject. So

114:17

explicitly rejecting or explicitly

114:19

accepting uh the action that the agent

114:22

is trying to make. And this ensures that

114:24

uh you know we're building trust and

114:26

also ensuring that uh you know we're

114:29

being safe especially when the agent is

114:31

trying to do a mutating operation and

114:33

always always always making sure that um

114:36

humans are in the driver's seat

114:40

sandboxing. So, another thing that we uh

114:44

worked on um kind of similar to the

114:46

safety slide we just saw was um whenever

114:50

an agent tries to execute code or tries

114:53

to create files, it does so in a

114:55

sandbox. So, we gave our agent sort

115:30

All

115:55

right. All right. Hello everyone. Really

115:58

excited to be here. It's a big room.

116:01

Very uh very cool conference so far. Uh

116:04

I want to talk to you today about

116:06

something that's been on my mind for

116:08

many many years. This is actually the

116:09

first time I I talk about it. Sort of my

116:11

version of going to Mars. Um and that is

116:14

the Eureka machine. A machine that will

116:17

eventually invent pretty much all future

116:20

inventions for humanity. Uh and the way

116:23

we're going to get there is uh by taking

116:26

a step back and thinking about what else

116:28

has given us a lot of really incredible

116:30

inventions uh namely evolution and how

116:33

that leads us to automating research and

116:36

pushing the scientific frontier forward.

116:39

And this is uh joint work with a lot of

116:42

uh amazing folks uh at recursive.com

116:45

uh and even some uh folks at AIX

116:48

Ventures. And some of these slides are

116:50

uh actually inspired by uh and taken uh

116:52

partially from one of my co-founders at

116:54

recursive Tim Rockel.

116:56

So uh why do I talk about evolution and

116:59

why is it so important? Uh, I think

117:02

basically evolution is this like

117:04

open-ended process that has gotten us to

117:07

a lot of different things that we really

117:09

like. Uh, it started in biology. It's

117:12

moving to science, technology, and

117:13

eventually I. And I think it can inspire

117:15

us in a lot of different ways to build

117:18

better AI systems as well. In fact, uh,

117:22

whenever we take out and there's this

117:24

famous saying, whenever I fire a

117:26

linguist, my accuracy goes up. Uh I

117:29

think that's true for machine

117:30

translation back in the day. And it may

117:32

be true that we should fire all the AI

117:35

engineers uh and that that are here uh

117:38

and have them mostly manage an actual AI

117:43

engineer that is AI and works on AI. Uh

117:46

and so that may be uh one of the

117:48

conclusions of this talk. Uh, and I

117:51

think most of us are going to be excited

117:52

about it because it means that we'll all

117:54

become managers of such an AI rather

117:57

than having to do the nitty-gritty

117:59

ourselves. All right, so let's start

118:01

with evolution, right? The really really

118:03

big picture, three and a half billion

118:05

years or so. Uh this is kind of the

118:08

incredible process uh that has led from

118:11

you know simple bacteria and plants and

118:14

fish and amphibians and so on to after

118:17

many billions of years us. Right? That's

118:21

that's a good starting point. That gives

118:22

us some indication that evolutionary

118:25

processes can do pretty amazing things.

118:27

Right? But now let's zoom in and uh go

118:30

maybe down to a few million years. There

118:33

we can also see how in the very first

118:36

primitive ways technological evolution

118:39

has basically increased the world uh

118:42

sort of product uh in terms of monetary

118:45

value. It's a little bit harder to

118:47

estimate in the beginning, but we can

118:49

see these sort of sequences of

118:51

exponentials and most exponentials

118:53

eventually become S-curves. They flatten

118:55

out. But humanity has done pretty well

118:58

by basically developing uh many of these

119:02

very basic technologies, hunting,

119:03

farming, but then also thinking about

119:05

science, the scientific method um in the

119:08

early days of the enlightenment and of

119:10

course the industrial revolution. So now

119:12

we can zoom even further. Uh and no

119:14

worries, we're eventually going to get

119:15

to nanohat and actual auto research and

119:17

and what we're doing. Uh it's a very

119:19

very quick zoom. Um and now we can zoom

119:22

down to the last few thousands of years.

119:25

And what we're seeing there is that with

119:27

more technology, we were able to sustain

119:30

more people, right? So when we're

119:32

working on pushing that frontier

119:33

forward, uh we're very certain that that

119:35

will lead to more human flourishing,

119:38

right? And especially in the last few uh

119:41

hundred years, we're seeing this

119:43

incredible explosion in the population

119:45

of people because of technology and the

119:49

evolution uh that it brings and in many

119:51

cases that evolutionary process is run

119:54

by us. So it's sort of conscious uh but

119:56

there are sort of interesting uh

119:58

inspirations that we can take from that

120:00

as we're thinking about the evolution of

120:01

AI in the next cycles. uh in fact and I

120:05

might not agree with everything with

120:07

Mark Andreasen but uh he is very smart

120:09

and we agree on a lot of things. Uh and

120:11

so I think he wrote this really great uh

120:14

technoptimist manifesto in which he I

120:16

think correctly points out that the only

120:18

perpetual source of growth for the

120:20

entire economy. A lot of people worry

120:22

about AI taking jobs and things like

120:24

that but the truth is it will very very

120:26

likely increase uh the economy massively

120:29

and that will benefit benefit a lot of

120:31

us. And so the perpetual source of

120:33

growth is technology. Uh in fact we can

120:36

go even further and say that there's no

120:39

material problem and again it's not sort

120:41

of psychological problems and things

120:42

like that but no material problems uh

120:45

that cannot be solved with even more

120:46

technology. Right? We have a problem of

120:49

starvation. We invented a green

120:50

revolution, darkness, light, uh cold,

120:54

indoor heating, heat, air conditioning

120:56

and the list goes on. So I think we can

121:00

kind of realize that this evolutionary

121:02

process has been going on for a very

121:04

long time and continues to make a huge

121:06

amount of progress. In fact, the

121:08

progress is so fast that there can

121:12

within one lifetime be a major major

121:15

shift. Right? If you're born in 1900, uh

121:19

then three years when you're three years

121:20

old, the first human ever was able to

121:23

thanks to the Wright brothers kind of

121:25

have sustained motored flight. And then

121:28

about 60ish years later in 1969,

121:32

humans flew all the way to the moon.

121:35

Right? So that within one lifetime,

121:37

humanity went from like no one can fly

121:40

for a very long time other than sort of

121:42

gliding down a hill or something. No one

121:44

can really fly to we all fly to the

121:47

moon, right? And so for us, I think what

121:50

that means is we're probably, and I

121:53

sometimes say this, we're like too late

121:55

to explore Earth. We're too early to

121:57

explode the stars, but we're right on

121:59

time to build an AI that could actually

122:03

do what flying did for some in one

122:06

lifetime due to intelligence. We can

122:08

build and move from AI being worse at

122:11

everything that we do to possibly being

122:14

better at any specific task that we do,

122:16

right? And that that will probably be

122:18

our our 60-year time frame. And because

122:20

everything moves faster, it might only

122:22

be 30 years or so. So then uh there's an

122:26

interesting connection between

122:27

technology and science and theory right

122:29

like sometimes the application comes

122:31

first and then we develop the theory

122:33

later and then improve the technology

122:35

sometimes the theory comes first and

122:37

from that we can build new kinds of

122:39

technologies and so it's very helpful to

122:42

think a little bit about the philosophy

122:44

of science and no better uh to be

122:46

inspired there than popper wrote that

122:50

just like in other types of evolution

122:52

when we choose a theory We also choose

122:54

one that is best uh in competition with

122:57

other theories. Of course, you need if

122:58

you wanted LMS to do that, they need to

123:00

find them. You need web search for

123:02

instance. Um but uh in the theory that

123:06

best holds its own uh it's one that just

123:09

like evolution has a certain natural

123:11

selection process, right? It proves

123:13

itself. Uh and there is also a sort of

123:16

survival of the fittest going on in

123:18

scientific theories.

123:20

And uh in fact uh a lot of science

123:23

according to Popper is basically us

123:25

proposing a new theory hypothesis or

123:28

explanation or description and then

123:30

subjecting it to rigorous empirical

123:32

testing. That is the uh essentially

123:36

evolution evolutionary pressure of

123:39

scientific theories.

123:42

And basically that was a very short uh

123:45

run through uh sort of the history of of

123:48

open-ended evolution uh which hopefully

123:50

makes us all realize that more science

123:52

will lead to more technology which will

123:54

lead to more growth which will lead to

123:55

more human flourishing. And so that then

123:58

begs the question does it make sense for

124:00

us uh to try to just scale up and spend

124:03

a lot of our resources as humanity to

124:05

scale up scientific discovery in order

124:07

to lead uh to this flourishing.

124:10

uh when when you double click into that

124:12

you kind of realize um which Dislam uh

124:15

already realized a long time ago uh that

124:17

the exponential growth of science will

124:20

actually be at some point halted by the

124:21

lack of people working on it right

124:23

there's so many niche subfields now in

124:26

all the different areas of science that

124:28

is very hard to get a million people to

124:30

work on that particular thing uh and so

124:33

as a result of this incredible widening

124:35

of the scope he says uh the number of

124:38

people focusing on any single section of

124:40

it has decreased. And that then leads us

124:44

to really thinking about how could we

124:46

automate this and automate scientific

124:48

discovery. And that then leads us to

124:51

what I call the Eureka machine. This is

124:54

basically uh our attempt at trying to

124:59

build a machine that automates the

125:01

process of scientific discoveries. And

125:04

uh in fact I like in a couple months

125:05

I'll have a book coming out on on this

125:07

uh exact idea. Uh and so I'll just give

125:10

you a super high level highlight of how

125:12

such a Eureka machine could be built for

125:14

basically everything from physics,

125:15

chemistry, biology, neuroscience,

125:17

medicine, uh economics, astrophysics and

125:20

so on. And there are essentially four

125:22

pillars that are all extremely important

125:24

to this machine. One is of course you

125:26

have to understand what knowledge is

125:29

already out there. Uh what uh things

125:31

humanity has already invented. uh you

125:34

have to get all the scientific

125:36

measurement uh data into as the second

125:39

pillar this machine. Uh then for things

125:42

that you cannot yet measure we don't yet

125:46

know you should try to then build

125:48

simulations. Anything you can simulate

125:50

you can verify and you can then solve

125:53

with AI. Uh and if all else fails or at

125:58

the very end of these processes, you

125:59

still need to have some kind of uh

126:01

physical industrial like a lab uh that

126:04

actually can run real experiments in the

126:05

real world. And on top of all of this uh

126:08

you'll have basically uh an agent swarm

126:11

that will deal with all of these

126:14

different sources of knowledge and data

126:16

and experimentations and and rewards. Uh

126:20

and in terms of you know the

126:21

foundational model of knowledge of

126:23

course we also you know it basically is

126:26

is a good example of how every single

126:30

technology we've built so far especially

126:32

in AI but also before that the internet

126:35

browsers GPUs and so on we can rethink

126:38

and there are a lot of startups possible

126:40

in rethinking every single one of the

126:42

layers of technology as infrastructure

126:45

for super intelligence right at UW.com

126:48

for instance we work on web search for

126:50

LMS, right, and agents and so on. Uh,

126:53

and that actually is quite different,

126:55

right? Uh, agents can read thousands of

126:58

very long snippets um, rather than just

127:00

10 blue links with like a very short

127:02

snippet. And so you can rethink each of

127:04

these different uh, layers of technology

127:07

that we've built for people uh, and uh,

127:10

rebuild them for AI in order to use them

127:13

as tools to then build uh, super

127:15

intelligence.

127:17

Now that is essentially uh the sort of

127:20

why like like we want to build super

127:24

intelligence in order to automate

127:26

science. Uh and to me that will be the

127:29

next big step function change uh in in

127:33

humanity uh and technology as we know

127:36

it.

127:37

Now how do we actually build it? Uh I

127:40

think the best way to build it is to

127:43

have it built itself. Right? We moved as

127:45

a field and especially natural language

127:46

processing for instance which I've

127:48

worked on for many years. We moved from

127:50

not having linguists, this feels like

127:53

ancient, you know, BC uh history, uh but

127:56

before Chat GBT, um we we moved from

127:59

having linguists tell us a bunch of

128:01

things about language and then training

128:02

statistical models on top of that. And

128:04

when we allowed neural networks to

128:06

actually automate learning those

128:08

features with word vectors and uh other

128:10

neural network architectures and

128:11

backto-back uh end to-end learning and

128:14

back propagation, we basically uh were

128:17

able to get much bigger improvements. Uh

128:19

then we did a bunch of architecture

128:21

engineering. Now a bunch of people at

128:23

least are working on a unified

128:24

architecture. Uh but even that unified

128:27

architecture has a lot of manual

128:28

processes. And so it's clear over and

128:31

over again in AI that when we take out a

128:32

manual process and we replace it with a

128:35

learned system, improvements will

128:37

follow. Uh and so that's why I think uh

128:41

we should try to build a speaker machine

128:43

by having an RSI that builds itself. And

128:46

the beauty is that

128:49

only now um AI can actually do this

128:54

because AI is code and AI can code. Now

128:57

this this ability to really code in

128:59

longer and longer time horizons has

129:01

really only happened in the last like

129:04

six to eight months and that now enables

129:06

such an RSI to work on itself to develop

129:09

almost a certain sense of self-awareness

129:11

of its own shortcomings and then fix

129:13

those shortcomings. Uh and then once we

129:16

have that machine that has gotten really

129:18

really good at doing research in AI

129:20

itself, we can then use it to do AI

129:22

research for a lot of other things uh in

129:25

in other scientific fields. And so at a

129:29

high level it's quite easy right we have

129:31

three steps ideation implementation and

129:34

validation of ideas. That's true for

129:36

basically almost every scientific field.

129:40

And so uh to end maybe on some very

129:43

specific examples uh we have built this

129:46

first kind of version of such a Eureka

129:49

machine uh and we wanted to just show

129:52

that it works on some small uh samples

129:54

that a lot of people know and are aware

129:57

of. And so we basically started uh with

130:00

three things that show you and give you

130:02

a very first glimpse of and and sort of

130:05

simple proof points uh of what such a

130:08

machinery can do. And that was basically

130:10

better training, faster training and and

130:12

better kernels uh for for Nvidia GPUs.

130:15

Um the first one nano chat um I'm sure

130:18

many of you have heard of it. A lot of

130:20

people think that's already recursive

130:21

self-improvement and it is kind of a

130:23

weak form in the sense that usually when

130:26

you do auto research it's it's not

130:28

recursive self-improvement, right? True

130:30

recursive self-improvement is when you

130:32

have an AI that has a sense of

130:34

self-awareness of its own shortcomings,

130:36

full access over everything uh in its

130:39

arsenal from pre-training to RL training

130:42

and harnesses and everything and then

130:44

actually updates that entire system in

130:46

the next version of itself. Now you can

130:49

also take such a system and just ask it

130:52

to improve some other process some other

130:53

AI like a small nanoad run where you can

130:56

train something in five minutes and that

130:58

is really exciting. It's an important

131:00

milestone but it's not actual RSI. So

131:02

here basically showed three examples of

131:06

such an auto research um uh system and

131:09

what it can do and uh after a very very

131:12

short time it essentially was able to

131:14

outperform

131:15

many uh different teams and teams that

131:18

also use uh other AI research. So let's

131:21

double click into some of these. Nanohat

131:24

is really exciting example. Uh basically

131:26

you train a very small uh chat model uh

131:30

in less than uh five minutes and you

131:33

basically want to have it get to the

131:36

best possible bits per bite uh number.

131:41

And so the whole community had worked on

131:43

this uh for uh quite some time and got

131:46

to uh 0.93.

131:48

And after training this for a little

131:51

more than a day or two, uh, we basically

131:54

got it down to 0.91.

131:57

Um, which is pretty exciting. Now, it

131:59

wouldn't be that exciting if all it did

132:01

was just find a couple of

132:02

hyperparameters um and tune them

132:04

carefully, but it actually did find

132:07

truly interesting novel ideas like hash

132:11

biograms and triam embeddings and tables

132:13

for those uh and mixing that into

132:17

various uh value paths of uh the

132:19

intention through variety of learned

132:21

gates. So, it actually started to doing

132:23

more and more interesting things rather

132:25

than just kind of tuning

132:26

hyperparameters.

132:28

Um another one a nano GBPD speedrun. Uh

132:31

obviously speed is very important. Uh so

132:33

here we're able to work on this again,

132:36

apply the system and after a very short

132:38

amount of time it got better than uh

132:41

people working often together with the

132:43

AI for over a year uh on on this very on

132:47

this benchmark and made the whole thing

132:49

another two seconds over two seconds

132:51

faster um at 70 seconds and again

132:55

discovering uh very interesting ideas in

132:58

the process.

133:00

And then the third one is scuda kernels.

133:02

Of course, we all care about not burning

133:04

through our GPU budgets too quickly. Um

133:07

uh and trying to be very efficient. I

133:09

think in general, it's actually kind of

133:10

shocking how inefficient a lot of

133:12

mixture of expert models still are run

133:14

in very large clusters that cost

133:16

billions of dollars and then only have

133:18

like 30% or so utilization. There's a

133:21

lot of work that's ongoing in the world

133:23

uh to improve that. and different fields

133:25

uh or different groups of people or

133:28

various different um yeah stages of

133:30

that. Uh but long story short um lots of

133:33

different cuda kernels are used during

133:34

training and testing and here um we

133:37

basically again took that system and

133:39

after uh a couple days it discovered

133:43

better kernels uh than the leaderboard's

133:46

best uh on the NVIDIA uh benchmark

133:49

website u by again quite quite a sizable

133:52

margin across all the different uh

133:55

categories of of those kernels. And

133:58

while we are pretty good at AI and like

134:00

we actually in the team didn't have any

134:03

particular CUDA kernel experts who just

134:06

spent their entire careers writing good

134:08

kernels. uh but still you know we do

134:11

just enough to make sure and worked

134:13

together with Nvidia to make sure that

134:14

there are no reward hacks here and and

134:16

other issues but actually found uh that

134:20

eventually these all checked out and

134:22

were indeed uh pretty much all the

134:24

different kernels uh found the best

134:26

solutions there and so with that I hope

134:29

I could convince you uh that indeed RSI

134:33

could be that next big uh scurve um an

134:36

exponential that gets layered uh on on

134:38

top of previous exponentials and uh that

134:42

should help us uh with not just AI but

134:45

eventually science and then all of

134:47

technology and then uh allowing many

134:50

more people uh to flourish on our

134:52

planet. Uh and so maybe I'll end on this

134:55

note here which is uh a lot of people

134:58

wonder how much longer AI can go right

135:00

every exponential eventually flattens

135:02

out and um it's actually quite hard to

135:04

know like when we even talk about

135:06

exponential growth in AI what does that

135:08

even mean there are many different I

135:09

call them spaces of intelligence and we

135:12

won't have time to go into all of all of

135:13

these but as soon as you actually try to

135:16

define multiple different dimensions of

135:18

each of these 10 spaces uh that make up

135:21

this complex like sort of volutric uh

135:25

thing that is intelligence. You'll

135:26

realize that there's still so much more

135:30

to go like on the upper bounds of

135:31

intelligence. We're still astronomically

135:33

far away from reaching those uh across

135:37

pretty much every single one of uh these

135:40

dimensions and the spaces uh that they

135:42

make up. Uh so if any of that is

135:45

interesting uh and you want to help us

135:46

build that um we'd love to hear from

135:49

you. Thank you.

136:04

Hey everyone, my name is Nishan Gupta

136:07

and I'm a software engineing tech lead

136:09

at Meta working on building the training

136:11

and inference infrastructure for the

136:14

meta super tangent lab and their

136:16

infrastructure organization.

136:18

Today we're going to be talking about

136:20

production val for authentic systems.

136:23

When most people hear the word

136:24

valuation, they think about benchmarks.

136:26

A model scores 90% on a benchmark. A new

136:29

version scores 92%. The team celebrates.

136:32

But agent systems have fundamentally

136:34

changed what the evaluation means. Today

136:37

the systems don't simply generate

136:39

answers. They plan, they call tools,

136:41

they retrieve information. They execute

136:43

workflows. They interact with the

136:45

production infrastructure. The question

136:47

is no longer did the model generate the

136:49

right answer. The question is did the

136:51

system behave correctly. Today I would

136:53

like to discuss how evaluation is

136:56

evolving from model benchmarking into

136:58

production infrastructure.

137:03

This is the problem almost every AI

137:05

organization is encountering today.

137:07

Offline benchmarks continue improving

137:10

yet production reliability often remains

137:11

unpredictable. Why is that? Because

137:14

benchmarks measure model capability.

137:16

Production measures system behavior. A

137:19

benchmark doesn't capture tool failure,

137:21

API outage, context changes, user

137:24

variability, long-running workflows. And

137:26

as systems become more autonomous, the

137:28

gap between the benchmark performance

137:30

and production performance grows. The

137:32

result is what many teams experience

137:34

today. High benchmark scores as you can

137:37

see, but unreliable production behavior.

137:42

Traditional ALM evaluation focus on

137:44

outputs.

137:46

But we should ask the question, did the

137:47

model produce a correct answer? Agentic

137:49

systems force us to ask a different

137:51

question. Did the system behave

137:53

correctly? Behavior includes planning

137:56

quality, tool usage, execution, workflow

137:59

execution, recovery from failures,

138:01

decision making. In other words, we are

138:03

moving from evaluating answers to

138:05

evaluating workflows. And that requires

138:08

fundamentally different evaluation

138:09

architectures.

138:12

Many teams still think hallucinations

138:14

are the primary AI failure modes. In

138:16

production, they are often just one

138:17

category. Agentic systems introduce an

138:20

entire hierarchy of failure modes. At

138:23

the very foundation, the memory

138:25

failures, retrieable failures, safety

138:26

failures. As you go up, you have to

138:29

think about reasoning mistakes, poor

138:31

planning, incorrect execution. At the

138:33

highest layer, you have to think about

138:34

multi- aent coordination failures. And

138:36

this is why evaluating only model output

138:40

misses the most production risk we

138:42

observe.

138:44

One of the most useful mindset shifts is

138:46

to stop thinking like researchers and

138:48

start thinking like a SR or a production

138:51

engineer. SR don't measure success using

138:54

accuracy. They measure reliability,

138:56

availability, latency, cost recovery and

138:58

agentic systems require the same

138:59

approach. The goal is not maximizing the

139:02

benchmark scores. The goal is to

139:04

maximize dependable outcomes. Rabi

139:06

becomes the northstar metric

139:25

values limited. In the middle there

139:27

scenario based valuations. These

139:29

simulate realistic workflows. And at the

139:32

very top you see production telemetry.

139:34

This is where the highest value

139:35

evaluation signals come from. The

139:37

surprising insight is that the most

139:39

evaluation data often comes from real

139:41

users interacting with real systems.

139:46

Now let's talk about offline. So offline

139:48

evaluations still matters but the

139:50

methodology changes. Instead of

139:52

evaluating prompts we evaluate

139:53

scenarios. For example, a customer

139:55

support workflow, a code generation

139:57

workflow, a research workflow. The agent

139:58

operates inside that simulated

140:00

environment. We measure the task

140:01

completion rate, tool correctness,

140:03

planning quality, resource usage which

140:05

is which becomes exponentially high at

140:08

high scale. The key takeaway 18

140:11

evaluation should be scenario driven not

140:12

prompt driven.

140:15

Once a system reaches production, every

140:17

interaction becomes a signal. This is

140:20

one of the biggest shifts in evaluation

140:21

thinking.

140:34

Oh, all right. Uh, all right. So, can

140:37

everyone see the uh slides? Oh, nice.

140:40

All right. So, good morning everyone.

140:41

Thanks so much for being here. Uh, my

140:43

name is Hio. I founded around Gina AI

140:45

since uh 2020 to 2025 and last October

140:48

we were acquired by Elastic. So, now I'm

140:50

running a model inference and training

140:52

team there. And uh uh so here's a

140:54

question I want to answer today. Uh so

140:56

big models get thinking better by at

140:59

inference time. Right? So we call that

141:00

test time compute. And can small

141:03

retrieve model do the same thing. Right?

141:04

Can it get better by thinking harder at

141:06

inference uh without making the model

141:09

any bigger? Uh to find out that I let

141:11

the agent run auto research overnight

141:13

and the answer turned out to be more

141:14

interesting than yes or no. Right? So

141:17

let me show you what I found out. So

141:19

first let me say what test time compute

141:21

is. So the idea is very simple. So

141:23

instead of training a bigger model, you

141:25

spend more compute at inference time. So

141:27

you get better answer back. Uh it shows

141:30

up in a very familiar forms uh such as a

141:32

best of insampling, self-consistency or

141:35

verifiers that rerank the candidates. So

141:37

non Brown from OpenAI uh put a number on

141:40

this. He found that a poker bot uh

141:42

sinking for 20 seconds uh got the same

141:45

boost at scaling the model for 100,000

141:47

times. Uh so that's the promise of test

141:49

time compute. So the real question for

141:51

us here is does this promise also for

141:53

the also hold for search.

141:56

So here's a reframe that turns this into

141:58

a retrieval talk. Uh search is already

142:01

test time compute. Uh so think about

142:03

what you do when you build search. You

142:05

take a train embeddings a train reanker

142:07

some multiffactor retriever and a query

142:10

expender and then you wire them into a

142:13

pipeline. So you are spending inference

142:16

to buy relevance and you are not

142:18

reaching for bigger model. You're

142:20

basically assembling more search at test

142:22

time. So the real question isn't whether

142:24

your model is big enough. It is how much

142:26

pipeline can you assemble uh at

142:29

inference and whether that pays off.

142:32

So there are two versions uh two ways to

142:34

build that pipeline and I will show you

142:36

both. The version one the first one

142:38

version a uh is the one I will go deep

142:41

on. So here an agent writes a little

142:43

program over a single frozen embeder or

142:46

encoder. It might chunk the document uh

142:49

do this scoring fuse uh with different

142:51

scoring strategy and feeds the results

142:53

back. So think of think of it as a

142:56

multipass uh algebra over embeddings.

142:59

The second one version B uh I will come

143:01

to later. So there a small agent wires

143:03

up the retrieval tools like grap embed

143:06

rerank over a corpus given a fixed uh

143:09

token budget. So it's the same idea

143:11

implemented at two different levels. So

143:14

let's start with version A.

143:16

So version A runs uh runs over a small

143:19

frozen encoder. So there the common

143:21

belief is that small models cannot

143:24

improve there and test time compute

143:27

exclusively belongs to the big reasoning

143:30

models. But let's look at what today's

143:32

embedded come from models such as E5 uh

143:35

Mistro uh Queen3 uh embed embedding

143:38

Gemma and even our own genome embedding

143:40

E5 they all distill from the large

143:43

language model backbones so that's the

143:46

dominant recipe today and if test

143:49

compute leave in the ARM representation

143:51

space then this detailed model should

143:54

somehow inherit it or do they so that's

143:58

exactly the question I want to find

144:01

So here's the intuition of uh for how a

144:04

frozen model, a frozen embedder could

144:06

improve at test time. Uh let let's look

144:08

at the three panels. Uh let's go from

144:10

the right uh left right to the left. Uh

144:14

so we go from the simplest way to score

144:16

a match on the left and to the most

144:18

detailed way on the right. So on the

144:20

left you have a single cosign distance

144:22

which is basically one vector per

144:23

document and one per query. So that's a

144:26

frozen cosine baseline. On the right you

144:28

have this cobert style latent

144:29

interaction where every query token is

144:32

matched against every document token. So

144:35

one can consider cobert as an extreme

144:38

case of test time compute. Uh the

144:40

interesting part is of course is it is

144:42

in the middle panel where I have

144:44

outlined in blue. So you can take a

144:47

frozen uh you can take the fro same

144:50

frozen encoder split the document into

144:52

sentences and max over them. So that's

144:55

basically what I call the test time

144:56

compute. You get closer to late

144:59

interaction but without adding new model

145:02

at all. Just more work on the same

145:06

embedding model again and again.

145:10

So let me make the question very strict.

145:12

So how much can a frozen single vector

145:15

embedding model improve at test time

145:18

alone? So I and I do mean by strict just

145:21

one frozen encoder behind an API and you

145:24

can call it as many times as you want

145:27

but no retraining no second model no

145:29

learned parameters. So the popular

145:32

method uh measured all break one of

145:34

those rules like height puts an error in

145:36

the query pass to route the query. GQR

145:39

as a second retriever and meta embed

145:42

trains new parameters. So we forbid all

145:44

these three rules. We forbid all these

145:47

three things. But even with the

145:49

constraint the search pipeline the

145:51

search space is huge. So how do you

145:53

search that with auto research of

145:56

course? So instead of me handcrafting

145:58

this programs an agent runs the research

146:01

loop by itself. Uh it changes one file

146:05

it runs a short experiment and if it

146:07

matrix improve it keeps a change

146:09

otherwise it reverse. So it does that

146:11

over and over all night. So it is kind

146:14

of like hill climbing uh but errorm as a

146:17

mutation function. So entry capacity

146:19

from astrobic uh describe it as follows.

146:22

So we are editing a python file in the

146:24

way uh you're not editing a python file

146:26

in the way that research researcher

146:28

would would. So you are writing a

146:30

markdown files that set up the

146:32

autonomous research or and that loop

146:35

generate everything that we were about

146:37

to see.

146:39

Uh so here's a whole loop in one

146:41

picture. uh just follow the box from

146:43

left to right. We have a proposer which

146:45

is a RM agent write a program over the

146:48

frozen encoder. We have evaluator uh

146:50

which scores that program and memory

146:53

logs the result and the registry the

146:55

black box on the far right uh collects

146:57

all of them. So 144 programs one per

147:00

generation. So now see the dash line uh

147:02

dash arrow looping back underneath

147:04

that's basically the feedback. So memory

147:07

conditions next programs and every runs

147:09

built on the last one.

147:12

So let me quickly go through the four

147:14

pieces. The first up is proposer uh

147:17

which is based on oppus 4.6 used purely

147:20

as mutation function. It reads the

147:22

current best program and memory file and

147:24

then it adds one Python file to propose

147:26

the next one. So there is no human in

147:29

the loop. Uh now here's the catch. It

147:31

only optimize the metric that you give

147:33

it to it not the metric you meant. So if

147:36

you reward in domain performance and if

147:38

you reward spending more compute then

147:40

that exa that that that is exactly what

147:43

it will chase. So whether the

147:45

improvements hold up elsewhere is a

147:47

separate question. So the next one is

147:50

program it just acturate Python program

147:52

over uh the encoder and the one piece

147:54

that matters is this embed function. So

147:57

that's a compute budget. So every

147:59

function call there basically re-mbbed

148:01

sounds text or switch the laurel adapter

148:04

or pick uh smaller dimensions. So one

148:07

call is one unit of compute. Uh there

148:10

are some other constraint such as the

148:11

program cannot introduce any

148:13

hyperparameters, cannot do task routing,

148:16

cannot add external models of course. So

148:18

this conra those constraints uh force

148:21

the agent to found task agnostic program

148:24

instead of a config that's secretly

148:26

optimized for each task.

148:29

Then comes the evaluator. So every

148:31

programs runs the same 14 evaluation

148:34

task or discovery task spanning legal

148:36

financial long document long context or

148:39

general retrieval problems. We score it

148:42

via delta and the CG against the uh

148:45

cosign baseline plus some cost ratio. I

148:47

will introduce the cost later. Now

148:50

here's a design choice that matters the

148:52

most. The loop only ever see these 14

148:55

task and there are 19 more held out task

148:58

the loop will never touch them or see

149:00

them. So later we can ask a very clean

149:02

question that does what wins here uh

149:05

also hold up there. So and the whole gap

149:08

the gap is basically the whole

149:10

experiment.

149:12

The last part is memory. So it is a

149:14

simple JSONL uh file with one row per

149:17

program. Each row stores the scores, the

149:19

cost, the parents and a short lesson. So

149:22

the proposal read this file before every

149:24

round and the whole search compounds

149:26

compounds over time. Uh but compounding

149:29

has both ways, right? It builds a real

149:32

win of course, but it alo also compounds

149:34

whatever bias uh from the objective. And

149:37

the bias matrix does not only mislead

149:40

one program, it steers the entire

149:42

family.

149:44

Uh so now let me set up the models that

149:47

we use here. We run the search on the

149:49

single encoder which is the Gina V5 Nano

149:52

uh only 200 million parameters

149:54

state-of-the-art on multilingual

149:55

retrieval. And we choose nano mostly

149:57

because the discovery phase as a

149:59

discovery phase model mostly because it

150:01

is small and therefore reduce the cycle

150:03

time of each experiment.

150:06

We hold out the bigger model uh from the

150:08

same family plus the unseen families

150:11

such as gema model and quinn model. They

150:14

share no training data, no backbone, no

150:16

tokenizer with the discovery model. We

150:19

also hold out the 19 evaluation task as

150:22

I talked before and this one those 19

150:24

tasks the loop never sees. So when

150:27

programs gets discovered in this loop it

150:30

has to generalize over all encoders and

150:33

all all 19 tasks.

150:36

So now before showing any result let me

150:38

define the cost of the test time

150:40

compute. It comes down to one just just

150:43

one number C which is the number of

150:46

extra forward passes through the

150:48

encoder. So let me explain it with two

150:50

cards on the slice. They do the same

150:52

thing but they they kind of mix in some

150:54

neighborhood information and then

150:56

rescore it. The card on the left is what

150:59

I call a soft centroidid. It averages

151:01

the document to uh vectors that you

151:03

already computed. And so there is no

151:05

extra forward passes.

151:08

Uh that means it's cost C is just one.

151:12

The card on the right is the first

151:14

sentence. Uh it reimbed the first

151:16

sentence of the talk top document which

151:18

is a brand new forward pass. So there C

151:21

is greater than one. So one reuse the

151:24

geometry that we already have. The other

151:27

spans compute on the new pass on the new

151:30

text.

151:32

So now that we comprise the compute, we

151:34

run that exact same loop under two

151:36

different rubrics. The first is compute

151:38

rubric. It admits a program only if the

151:41

in domain performance beats every

151:43

program before it. So it is actively

151:45

pushed to spend more compute at

151:48

inference time. The second is the

151:50

transfer rubic. So it keeps the program

151:52

only if it improves over over the

151:55

validation set with nothing getting

151:57

worse and it gets no reward at all for

152:00

spending compute. And to be clear, the

152:02

validation set is uh still comes from

152:04

what loop can see. So neither rubric

152:06

ever touch the 19 final evaluation task

152:09

or final hold out task uh and unseen

152:12

encoder. So that's a two rubric running

152:15

under the same loop. So let's see what

152:17

each one come up with.

152:19

So let's first look at the compute

152:21

rubric. So when you tell it to spend

152:23

more compute, it draws this very

152:25

beautiful clean curve. So the x-axis is

152:28

a compute you spend on the log scale and

152:30

y-axis is a score. There are in total

152:33

144 programs and 12 of them sit on the

152:35

par front. The cost running from just

152:38

one uh all the way up to almost 15 times

152:41

and the in domain score climbed nicely.

152:44

It it more than triples across that

152:46

front. So this looks exactly like tet

152:50

time compute scaling more compute more

152:53

quality. So if I stop here you will be s

152:57

but this is still in domain performance

152:59

we haven't run this experiment on held

153:01

out uh data set. So let's take a quick

153:04

look on this 12 programs and run them uh

153:07

run them on the hot out uh data.

153:11

So here are the 12 programs drawn as a

153:13

little diagrams. So don't have to you

153:15

don't have to read into each one. The

153:17

only thing that I want you to take away

153:19

from this is that they are all training

153:22

free recombinations of the same frozen

153:26

embedding models just chunking scoring

153:30

feedback and fusion. The cost climb

153:33

nicely steadily from left to right and

153:36

does look like a clean uh scaling story

153:38

but the improvement on the hel data set

153:40

as you will see is not.

153:44

So now we run those 12 programs on the

153:46

held out data set and same chart as

153:48

before. Compute runs from the left to

153:50

right and scores runs up and down. So

153:53

the dash line across the middle is a

153:55

baseline and look at the pink line. It

153:58

uh the compute rubric. It's basically

154:00

flat hugging zero all the way out. So

154:02

out of domain more compute buys you

154:05

essentially nothing. Now look at the

154:07

blue dots which is the transfer

154:08

programs. They all sit on the left

154:10

because they are cheap and everyone is

154:13

above the pink line. So the cheapest one

154:16

only has like as zero extra compute it

154:19

still be the most expensive program. So

154:22

more compute did not transfer the cheap

154:24

structure did.

154:27

So if we plot every program against

154:29

every held out uh task we get this heat

154:32

map. Uh the four blocks are the four

154:34

encoders and three of them we have never

154:36

seen in the discovery phase. In each

154:39

block the rows are the programs and the

154:41

column are 19 evaluation task. Green

154:43

means an improvement. Uh red or pink

154:46

means a drop. The picture is generally

154:49

mixed. Compute helps about half of the

154:51

sales but improvement are uneven. So on

154:54

on average it comes out flat. Compute

154:57

does help in places but it doesn't help

154:59

reliably across all new all new task and

155:02

all new encoders.

155:05

So now let's look at these uh look at

155:07

the other rubric the transfer rubric. It

155:09

picks the six completely different

155:11

programs and they are all very cheap and

155:13

most one and a half times uh more

155:16

compute than the cosign baseline. The

155:18

best one wins 83% on the held out data

155:21

set and it never lose on single task. So

155:24

now what what do this program uh

155:26

actually do? So they only test some

155:28

query and document vectors that you

155:30

already have and they add a little cheap

155:32

mass on top of that. Some notch the

155:35

query towards the document it already

155:37

likes. Uh some pick a few directions and

155:40

uh in the space and rescore uh along

155:42

those directions. So they are very small

155:45

structure change but enough to pull the

155:47

document uh the right document up. So

155:50

it's all re combination no new models

155:54

and this really transferred to across

155:56

models and languages. Remember in the

155:58

discovery phase we only use GINA

156:00

embedding Gina V5 nano and but the

156:03

improvement is positive across all four

156:05

encoders and the biggest bar is on the

156:07

JAMA and the Quint. So those on the two

156:10

families it never sees. So this is isn't

156:13

some quirk of one model is general is

156:15

rise on general embedding geometry.

156:18

So that was version a frozen encoder

156:22

with very cheap structure uh and it

156:25

scales but low compute uh doesn't scale

156:28

and auto research is how we found that

156:30

but let me move one level up uh from the

156:33

model layer to the search pipeline and

156:36

you will see the same test time compute

156:38

reflect in the pipeline level in 2025 we

156:42

have this deep research uh and agentic

156:44

search product uh which was basically

156:47

just a one loop over the uh open web. In

156:49

2026, we moved to a long horizon task

156:52

which adds implementation sandbox evals

156:55

on top of the retrieval and running for

156:57

hours. So both patterns need more

157:00

looping and more compute at test time.

157:03

So study this genic search at test time.

157:06

Uh I built three open source projects

157:08

for that. The first one is data room. So

157:11

you give a token budget, it searchs, it

157:13

reads, it rise. So over and over until

157:16

it packaged everything into a zip file.

157:19

So I call it data room because it

157:20

somehow reminds me like prepares the

157:22

data room for the investors uh back when

157:25

I was a founder. So that zip file

157:28

details the corpus on the uh you can you

157:31

can imagine this zip file is a detailed

157:33

corpus of the open web ready for the

157:35

next agent or large language model to

157:37

consume. And notice the token economy

157:39

here. So you are basically exploring the

157:42

web and build a corpus using very cheap

157:44

tokens from small language models and

157:46

then you save the expensive frontier

157:48

tokens for later for exploitation.

157:51

The second one is search box. So this is

157:54

a test bed to study agentic search and

157:56

two calling. It is design it is designed

157:58

to be air gapped. So the agent have no

158:01

internet access. It's basically like you

158:03

lock the agent in a room or in a box and

158:06

you give it a data room and ask question

158:08

about it. So to answer those question,

158:11

the agent has to assemble a search

158:13

pipeline at test time. A pipeline made

158:16

of local tools since like a grap, embed,

158:20

rerank. And this allows you to explore

158:23

some very interesting research questions

158:25

such as uh which tool does the agent

158:27

reach for first or is grab all you need

158:30

or does forcing more compute help on hot

158:33

questions or will the agent build up a

158:36

search pipeline that it will reuse

158:39

later. So search box is a test bed to

158:42

explore those research questions.

158:45

So but how do you evaluate uh aenic

158:47

search like that? Uh well you need hard

158:49

questions. Uh that's basically the third

158:51

project is knowledge graph. So it turns

158:54

a corpus or data room into a knowledge

158:56

graph and every fact become an edge and

158:59

linking from subject to an object. Then

159:02

we can work on the longest path through

159:05

that graph and those long chains become

159:07

multihop questions then that no single

159:10

passage can answer. So the agent has to

159:13

spend more test time compute connecting

159:15

the fats to get there.

159:18

So it's also the tool for building a

159:20

private verifier.

159:23

So let's connect all the dots together.

159:26

So I introduced two versions of test

159:28

time compute for search. Both versions

159:31

are doing the same thing. They are

159:33

spending mode compute at test time and

159:35

neither of them grows the model. In

159:39

version A, we found a special embedding

159:42

algebra over the uh fixed uh frozen

159:45

embedding that improves the search

159:46

relevance. In version B, we build a full

159:49

stack to found the best search pipeline.

159:52

We use a data room to maximize recall.

159:54

We use a search box to maximize

159:56

precision. And then we use knowledge

159:58

graph to build evaluation. So finally,

160:01

it gives us a pipeline that with strong

160:03

search relevance. It is basically two

160:06

different levels, but they share the

160:08

same bet. Spending more test time

160:10

compute, not a bigger model.

160:14

So finally let me let me leave you with

160:16

a big picture. Search is test time

160:20

compute. So don't reach for bigger

160:22

model. Do more search at inference

160:24

instead. You don't have to do this

160:26

design by yourself by hand. Uh auto

160:29

research helps you discover this

160:31

probably overnight. Uh so and this is

160:34

how we scale the test time compute. And

160:37

that is basically my the end of my talk.

160:39

Uh you can grab all the slides from the

160:41

QR codes here. There's a paper and

160:43

projects on my GitHub and archive. And

160:45

if you are uh if you are around this

160:47

evening, Elasticity is also holding a

160:50

hacker zone in town. So the QR code QR

160:53

code right is right there. Uh so come

160:55

and uh build with us. Thank you so much

160:57

and happy AI engineering.

161:10

In 2026, coding agents will quietly

161:13

retire their first software platform.

161:16

Not because it's bad, simply because the

161:19

platform is unnecessary.

161:22

I am Dominic Turno. I am founder and CEO

161:25

of Resonate. Resonate is a durable

161:28

execution platform built with minimalism

161:31

and simplicity as its core technical

161:33

values and these properties will play a

161:36

central role in this talk. At Resonate

161:40

we have a working theory where software

161:42

engineering is headed.

161:45

Generalpurpose implementations will

161:48

increasingly be replaced by bespoke

161:50

implementations

161:52

generated on demand not as a new

161:55

library, a new framework or a new

161:57

platform but as a minimal extension of

162:00

the infrastructure that is already in

162:02

place.

162:05

If this theory holds true, reuse will

162:08

move upstream.

162:10

Instead of reusing a general purpose

162:13

implementation, we will reuse a

162:15

specification and we will derive a

162:18

bespoke implementation from it.

162:23

In fact, we can build many bespoke

162:26

implementations

162:27

tailormade for the infrastructure that

162:30

is already in place. We just have to ask

162:33

the agent. At this point, the prompt is

162:37

a platform.

162:40

Resonate is a dual execution platform.

162:43

We have an implementation of the

162:44

Resonate server. We have implementations

162:47

of the Resonate SDK for TypeScript,

162:49

Python, Rust, Go and Java. So, we have

162:53

to ask what does this new reality mean

162:57

for us?

162:59

If implementations become generatable,

163:02

where does our value live?

163:04

And our answer our value moves from

163:08

implementation to specification.

163:13

Now this changes how we think about

163:15

Resonate. The product is no longer the

163:18

implementation. The product is the

163:20

specification the protocol.

163:23

And from that protocol we want to derive

163:26

multiple server implementations.

163:29

One is a general purpose resonate

163:31

server. our reference implementation.

163:34

Others are implementations built with

163:37

infrastructure partners.

163:39

For customers and partners, this means

163:41

durable execution right on top of their

163:44

existing infrastructure with minimal

163:46

additional dependencies.

163:48

So the question is no longer can we

163:50

build a server. The question is can we

163:54

repeatedly synthesize trusted servers

163:56

from the same specification

163:59

and if so how?

164:05

When we talk about agentic engineering,

164:08

we focus all of our attention on

164:10

verification.

164:11

How do we know the result is correct?

164:15

But today, I want to focus on the

164:17

specification instead and more

164:20

importantly, how can agents participate

164:24

in specifying the system, not just

164:26

building or verifying it.

164:30

Now, Resonate is partnering with

164:32

multiple infrastructure providers to

164:34

bring durable executions natively to

164:36

their technology stack. One of them is

164:39

Senadia, the company behind Nats.io, an

164:42

open-source messaging system designed

164:45

for building modern distributed systems.

164:48

For the rest of this presentation, we

164:50

will use Resonate ornat.io to explore

164:53

our agentic engineering practices. How

164:56

do we go from specification to

164:58

implementation?

165:00

First, we need to level set our mental

165:03

model.

165:05

This picture is a common view of agent

165:08

decoding. There's an agent, there's a

165:11

specification, and then there's an

165:13

implementation.

165:15

And for many applications, that is

165:17

enough.

165:18

But it is not enough for what we are

165:20

trying to do

165:22

because we are not trying to generate

165:24

one implementation from a specification.

165:30

We are trying to generate multiple

165:33

target specific implementations from the

165:35

specification.

165:37

So the specification must not take any

165:40

aspect of an implementation into

165:42

account.

165:44

The specification must not assume a

165:46

concrete database schema or concrete

165:48

indices.

165:50

The specification must not even assume a

165:52

relational database with tables and

165:53

transactions at all. It must not assume

165:56

a key value store. It must not assume

165:58

weak consistency. It must not assume

166:00

strong consistency.

166:03

The specification must be abstract.

166:07

Only the implementation must be

166:09

concrete.

166:12

So we ask the agent to follow the

166:14

abstract specification and generate a

166:16

concrete implementation.

166:18

Specifically at first we ask the agent

166:22

build a resonate server in rust on top

166:24

of posgress

166:27

and the agent failed.

166:31

The gap between the abstract

166:33

specification and the concrete

166:35

implementation was too large.

166:38

The agent generated a system that worked

166:40

on the happy path. It passed the basic

166:43

tests, but it was not correct. It broke

166:47

on the concurrency. It broke on the

166:49

process failure. It broke on the network

166:51

failure. The implementation was closer

166:54

to a prototype, but not a production

166:56

system.

167:00

So, we amended the process. Instead of

167:02

asking the agent to jump directly from

167:05

abstract spec to concrete

167:06

implementation, we inserted an

167:09

intermediary artifact, the concrete

167:12

specification.

167:14

That concrete specification was derived

167:17

interactively with the agent. But the

167:20

human was the main driver.

167:23

For Postgress that meant making target

167:27

specific decisions explicit, the data

167:30

schema, the indices, the SQL queries,

167:33

the transaction boundaries.

167:35

Once those decisions were written down,

167:38

the agent was indeed able to implement

167:40

the production system. So this worked,

167:44

but it also revealed the limitations.

167:48

The agent helped us build the system,

167:51

but the agent did not help us design the

167:54

system.

167:55

And if the specification is a reusable

167:58

product, then that's not enough.

168:01

Now the next step is obvious. Agents

168:04

have to move upstream.

168:07

But how?

168:12

When we started building Resonate on

168:13

Natio, we changed the question.

168:17

We did not ask can the agent build the

168:20

production system. Instead we ask what

168:23

does the agent need in order to design

168:26

the system first and build the system

168:29

second.

168:31

So we gave the agent access to a

168:32

deterministic simulation environment.

168:36

And we gave it a different task.

168:39

Do not build the production system.

168:42

Build a simulated implementation.

168:46

The simulated implementation is not the

168:48

product.

168:50

It is executable design.

168:53

Its purpose is to discover the correct

168:56

algorithm under partial order under

168:58

partial failure. And once these

169:00

algorithms are discovered, tested and

169:02

verified in simulation, then we ask the

169:05

agent to write the concrete

169:07

specification.

169:08

And only then do we ask the agent to

169:10

write the production implementation.

169:14

So the process becomes abstract

169:16

specification,

169:18

simulation implementation,

169:20

concrete specification and then concrete

169:22

implementation.

169:25

This is a point where the agent moves

169:27

upstream.

169:28

Humans are still involved in the design

169:30

process, but now the agent is a driver.

169:37

Two ingredients make this possible.

169:40

Minimalism and simplicity.

169:43

Unfortunately, minimalism and simplicity

169:45

are not the starting point. They are the

169:47

finish line. We spent three years making

169:50

the protocol smaller and simpler. Every

169:53

time we ran into a problem, we ask,

169:54

"What can we take away? What abstraction

169:57

can we erase? What property can we

170:00

remove? What relationship can we break?"

170:04

The result is a very small protocol

170:06

centered around two objects, a durable

170:09

promise and a durable task.

170:12

That simplicity matters because even

170:15

simple concurrent distributed protocol

170:17

have a complex state and behavior space.

170:20

So in other terms implementing even

170:23

simple protocols on top of a few simple

170:26

primitives is tough.

170:31

Let's make this concrete with NATS.NATS

170:35

gives us a

170:45

Hello, welcome. Uh, this is a big room,

170:48

so you're if you're in the back, don't

170:50

hesitate to come closer. Um, my name is

170:53

Stefania Dug. I'm a research scientist

170:55

at Sakana AI in Tokyo. Uh, I used to be

170:59

based here and AI engineering is home

171:02

community for me before being the

171:04

hyperloop. So it's very good to be back

171:06

and today I'm going to talk to you about

171:08

memory harnesses for longunning research

171:12

agents on device.

171:15

So if you work with long horizon tasks,

171:19

you probably run into this issue of

171:20

context blow, right? like when the model

171:24

starts contradicting itself or it has to

171:27

redo the work because it forgot it did

171:29

that task in the first place or it

171:31

starts to drift from your questions

171:33

because it forgot them. And this this

171:37

matters now more than ever because from

171:40

this recent projections from meter we

171:43

see that the trend is to solve longer

171:46

and longer uh horizon tasks and also

171:49

that we're getting fewer and fewer model

171:51

releases. So at some point later this

171:53

year we're going to have this

171:54

convergence right where we'll get many

171:57

more long-term horizon tasks and fewer

172:00

model releases. So that makes this issue

172:03

of dealing with context rot a priority.

172:09

And why did I wanted to to tackle this

172:11

problem on local models and with a local

172:14

harness? Uh maybe some of you have seen

172:16

this tweet. It's only two days old. Uh

172:19

the CEO of Coinbase actually shared how

172:21

their company managed to reduce their AI

172:25

spent while actually increasing uh the

172:28

AI usage. And the way they did that was

172:31

by transitioning to use many more local

172:34

models but also having better practices

172:37

like using better routing, better

172:39

caching, keeping the context clean and

172:43

then having better visibility for what

172:45

people are using and for what uh what

172:48

kind of task. So we are seeing the local

172:51

models like crossing the line, right?

172:53

Like GLM is on everyone's minds like

172:56

especially with Fable going away. uh

172:59

DeepS v4 flash can now be run on uh M3

173:04

Ultra and there's still a bottleneck for

173:07

RAM. It's tricky, but these local models

173:11

are starting to be useful for agentic

173:13

tasks and for tool use. So, I wanted to

173:17

show you what has been my setup for the

173:19

experiments I'm going to share with you

173:20

today. Uh this this is my Mac. It's

173:24

still running evaluations right now uh

173:26

back in my desk in Tokyo and I'm

173:28

controlling it from my phone. Um and

173:31

after running evals non-stop for a

173:33

couple of days, it started to get hot.

173:36

So I had my husband put fans around it.

173:39

Um we're running out of fans, but the

173:41

the machine is still running and the

173:43

valves are still giving results. Um, on

173:46

this M3 Ultra with 96 gigabytes and 28

173:50

core CPUs, I'm using two models. I'm

173:54

using the Quen 27B quantise at 4bit and

173:58

the DC V4 flash.

174:02

And before I show you how I built the

174:04

memory harness on this machine, I wanted

174:07

to tell you what this what is this an

174:09

example of, right? Like memory. When we

174:12

design a harness for memory, this is the

174:15

mental model I want you to have in mind.

174:17

Um, you can think of memory as a write

174:19

manage read loop. So, it's not just the

174:22

database store. It's actually this

174:24

control loop around the model.

174:27

More concretely, how did I take that

174:29

loop and customize it? So, this is my

174:31

harness design. Like I started with

174:33

research agents that are the small

174:35

agents because they have zero durable

174:37

memory and I wanted all the memory to

174:39

come from the harness. And then um in

174:42

the middle I have a core which is always

174:44

shown to to the agent um of traces. And

174:48

then I have a recall block where I'm

174:51

testing different modes and an archival

174:54

block where I'm keep keeping track of

174:57

information across different um

174:59

sessions. And in that recall block I'm

175:02

actually going through a ladder of modes

175:05

that I'm testing. The baseline is like

175:08

not to use memory at all. No recall at

175:10

all. So I'm I'm testing for that. Uh

175:13

next is to use rag vector vector rag um

175:17

just to see whatever like the harness

175:20

would pull in terms of similarity.

175:23

Then is to use a decisions uh ledger

175:26

where I actually keep track of what

175:28

decisions are being made for every turn

175:30

and then I can prioritize them. And last

175:34

but not least and this piece is very

175:35

important. I have a what I call an

175:37

oracle, but basically this is the ground

175:39

truth. So this is like telling the

175:43

harness for every loop what the correct

175:46

memory that needs to be retrieved is.

175:51

And the model is fixed across all the

175:53

different tasks. So the only things that

175:55

I'm changing is like these different

175:56

variables in the recall block.

176:00

And I wanted to to give you an example

176:02

of a first task that I tested. So

176:05

I wanted to see if I give the agent a

176:08

task of doing literature review and I'm

176:12

including a lot of papers in the corpus

176:14

where there was a big scientific claim

176:16

like this is actually a nature paper

176:18

where they said they discovered

176:22

742,000

176:24

promising materials like it was a very

176:26

big claim which got retracted later but

176:29

the retraction which it's a much smaller

176:32

like hay stack needle in that corpus

176:35

than the headlines and the citations. So

176:38

I wanted to see if if the system can

176:42

retrieve the right answer for these type

176:45

of questions. And what I found was

176:48

because like for these tasks all the

176:51

papers and all the information fit into

176:54

the context, the memory actually didn't

176:57

add more capability. It was the same

177:00

performance with memory and without

177:01

memory and it only added more cost. So

177:05

when your task fits in context, the

177:09

harness doesn't add much.

177:11

However,

177:13

if I start to run tasks that are longer

177:16

term horizon and the entire task and the

177:20

relevant context doesn't uh fit, then

177:24

having a good memory harness really

177:26

starts to pay off. So this is another

177:28

example of a task that I ran. This is

177:30

actually from an established benchmark

177:32

for a long horizon uh tasks memory. It's

177:35

called Xbench.

177:37

And this is an example of a question,

177:39

right? So I'm asking a question and then

177:43

like the right answer is in a like step

177:48

124. But the moment when I ask the

177:51

question, I'm asking it like at step

177:54

500. So it's completely outside of the

177:57

context window and the model needs to

178:00

use the memory harness to retrieve the

178:02

specific answer from the right step. So

178:07

I'm testing this by uh changing the

178:10

different policy ladder that I explained

178:12

before with memory off uh by deploying

178:15

recall different types of recall and by

178:18

using the oracle as a reference.

178:21

And what I found was that with the

178:24

ranked recall, the model gets the right

178:27

answer um more frequently than without.

178:31

And here is a breakdown of the

178:33

decomposition of performance on this

178:35

Xbench tasks. So I ran over uh 68

178:40

questions. And for each of these

178:42

questions, there were like multiple

178:45

um cells and lots of different seeds.

178:49

And what I found was that the rank only

178:53

ledger performed the best

178:56

and it performed better than like just

178:58

gating

179:00

the harness by saying do you need to use

179:02

memory or do you not need to use memory

179:05

and you're probably going to ask like

179:07

why is the oracle not hitting like the

179:10

max and I'm going to explain that too.

179:12

So the oracle what it does it provides

179:14

the right information the right memory

179:16

to the model but it doesn't force it to

179:20

use it. So the model can get the right

179:23

memory but still retrieve the wrong

179:24

information or choose to ignore it or be

179:27

confused. So that's why the oracle in

179:28

this case doesn't hit the max

179:30

performance. And I've done lots of

179:32

ablations on these tasks to see like

179:36

what happens if I give arbitrary

179:39

um examples. What happens if I give it

179:42

the wrong step? What happens if I give

179:44

it the most recent step? And I still

179:47

found that the best performing

179:50

condition was the one with the ranked

179:52

policy for recall.

179:55

And this actually works on several

179:58

models, not only on the Quen 27B, but

180:01

also on the DS4 flash. And it also works

180:04

across different benchmarks. I also

180:06

tried it on the Spider V2 benchmark.

180:09

And it's not just that it gives you

180:11

better recall, it actually costs less.

180:14

So maybe a good heristic to have here is

180:17

that bad memory is expensive because it

180:21

spends more token and it can send agent

180:24

the wrong way. But having like a good

180:27

structural policy for recall can save

180:30

you a lot of tokens and uh budget.

180:35

So one thing that I want to encourage

180:37

you from this experiment is to consider

180:40

the recall policy as a first class

180:42

metric and to start to think about how

180:45

you might use it in your systems. Like

180:49

what are the type of memories that you

180:51

want to store? What how do you rank

180:54

them? Like how do you design your recall

180:56

function?

180:57

And then um what are the type what

181:00

survives when you run this over and over

181:02

and over and um multiple sessions

181:05

multiple runs

181:07

and this is just a simple first kind of

181:11

experiment. Um but the memory technique

181:14

landscape is very rich. Um, so there's

181:16

over 30 runnable cookbooks that are

181:19

shared in this open-source repository

181:23

from um, Diamond and memory is complex.

181:28

We have short-term, long-term different

181:29

cognitive techniques. Uh, we can use

181:32

start to use evaluation results as well.

181:35

Um, and right now there's actually a a

181:37

pretty broad landscape of solutions,

181:39

right? So going from simple file system

181:43

retrieval to training memory models

181:47

um there's there's a wide spectrum of

181:50

solutions from less structural to

181:52

completely structured. Um so I think

181:54

there's a lot of research we're going to

181:57

see in this space. uh it's important um

182:00

it becomes more and more relevant and

182:02

for me it's been super fun to to test

182:05

this on local models

182:08

um because I got to control everything.

182:10

I got to control the data I was using

182:13

the entire traces of compute and

182:15

evaluations

182:17

and um yeah I I see that as an example

182:20

of sovereignty and it comes at a cost.

182:24

Uh I didn't tell you that these local

182:26

models I can only what uh run them in

182:28

serial like they don't support batch

182:30

querying for the deepse v4 flash. So

182:33

that's why I am still running

182:35

evaluations back on my computer in Tokyo

182:37

or I I was doing it on the flight on my

182:39

way here because it takes a long time.

182:42

Um, but I still think it's very powerful

182:44

and it's a very good test for what

182:47

memory can do when you can control every

182:50

single step of the pipeline. And this

182:53

sovereign capability is part of a bigger

182:55

ecosystem that is very important for us

182:57

at Sakana AI in Japan. Um, we believe in

183:01

the importance of sovereign AI today

183:03

more than ever. And we are also hiring.

183:07

So, if you're interested and want to

183:09

hear more about this and if you want to

183:11

come join us in Japan, come talk to me.

183:13

Uh, thank you very much.

183:37

Hi everyone,

183:40

I am Bash and today I will talk to you

183:42

about what is the last thing that AI

183:45

will take away from us as people in the

183:48

software business. So at a point where

183:52

writing code is no longer the

183:54

bottleneck, the real thing is figure is

183:57

figuring out what it is that you should

183:59

be building.

184:01

Um, and that comes down to to people's

184:04

skills and being able to work the room

184:07

because you can't prompt the room, you

184:09

can prompt your AI.

184:12

So at the beginning of the year we held

184:14

an internal hackathon uh where we had

184:17

about 21 agents uh agent ideas and 17 of

184:22

those were abandoned because they

184:23

actually created no uh business value.

184:27

They uh uh we either didn't have uh data

184:31

access or or just didn't make sense uh

184:34

to build it. And those four were the

184:37

ones that actually had a very big impact

184:40

on how we work today. And it's it's a

184:44

very good example of

184:46

of just making sure that we are building

184:49

what is worth building. And throughout

184:52

my career in the past 13 years, I've

184:55

always been uh the bridge between

184:58

business and IT and developers. Um I

185:03

started writing well initially testing

185:06

uh uh functional designs specifications

185:10

and then uh and then I wrote them and as

185:13

a functional consultant I worked with

185:15

large ERP and CRM programs in the US and

185:18

the UK and then I founded Visual Labs

185:20

and essentially I trained my my team on

185:24

how to elicit those requirements in a

185:27

way uh that we can turn them into good

185:31

uh specific ifications for developers to

185:34

build, for consultants to configure, and

185:37

most recently uh for AI to build. And

185:40

what's not really changed over the years

185:43

is how we interact with our customers,

185:45

how we interact with systems, how we

185:47

interact with AI is very much changing.

185:50

Um and that's that's uh that's the big

185:53

thing now. Uh but if you can read the

185:56

room, if you can elicit the right

185:58

requirements, uh then you will be able

186:01

to build more valuable software.

186:04

And that essentially the big shift over

186:07

the past two three years was that

186:09

getting access to code and being able to

186:12

build is no longer the bottleneck to the

186:14

software development life cycle. Now the

186:18

real bottleneck is getting your people,

186:20

your stakeholders, your decision makers

186:22

into the room and being able to access

186:25

them and elicit the requirement and

186:27

being able to spend the time with them.

186:29

So that's the right that's the real

186:31

bottleneck figuring out what it is that

186:34

should be built because you can prompt

186:36

your code, you can prompt your AI, you

186:38

can prompt your whole specification, but

186:40

you can't prompt your room. And

186:44

what a model can't do is very similar to

186:47

how Henry Ford's analogy of uh what he

186:51

said about asking his users or his

186:54

customers. If he'd asked them what it is

186:56

that they needed, they would have said

186:58

they needed more horses. But in reality,

187:01

he built a car and he made a very big

187:03

success on them. So if you're just using

187:06

AI uh to to make things build things

187:10

better, um the chances are that you are

187:14

replicating what already exists because

187:16

AI by definition is coded to give you

187:20

the most common answers for so for us

187:23

the real job is to make sure that AI

187:26

moves away from that average into what

187:29

is better for us so we can just get to

187:33

uh not a faster horse but actually

187:35

produce a car that's a magnitude shift

187:38

better than what we had. So it's really

187:42

an interesting word world where uh being

187:46

able to write good code is no longer uh

187:49

the the most important skill to have. Uh

187:53

actually the real skill now is becoming

187:55

the analyst analyst toolkit uh which is

187:59

things like story mapping, business

188:01

model canvas, uh value canvas and those

188:04

those good old things that we are so

188:06

used to using as functional consultants,

188:08

business analysts

188:10

um or or uh in in the world of design

188:14

thinking. So I'd like to zoom in on

188:18

story mapping because that's the the

188:20

skill set that I found as the most

188:23

valuable. So uh once you have the story

188:26

map with the backbones and understand at

188:29

each step what your customers your users

188:32

are doing that would give them the

188:35

ability to uh to move forward uh in

188:39

their in their processes. So uh here's a

188:43

uh support systems user story map

188:46

contacting triaging resolving and then

188:49

essentially closing a case. Uh with this

188:52

uh you can understand different stages

188:55

of the process uh and then capture the

188:58

user stories beneath them. It is

189:00

intended to stay at a fairly high level.

189:02

So you can get a uh a big picture and

189:05

then in you can decide uh what it is

189:08

that you want to build in release one

189:10

like capturing intent, classifying

189:12

urgency, drafting a grounded answer and

189:14

then logging logging it to a system of

189:17

record. That's essentially your MVP.

189:20

Those are the first things that you'd

189:21

want to build and those are your first

189:23

four user stories. And beneath those

189:26

you've got the uh uh the second set of

189:30

user stories like reading a sentiment,

189:32

writing to a team, suggesting next

189:34

action, chatting, checking satisfaction,

189:36

so on and so forth. Uh those will be

189:38

part of your backlog. So what would

189:41

allow you to

189:43

uh to get really good uh agentic results

189:48

is by honing in on these user stories

189:51

and making sure that you use these user

189:53

stories as a means uh to elicit

189:57

discussions with your stakeholders with

190:00

your business and then work out what

190:02

that user story should really be about.

190:04

So the first user story uh second user

190:07

story would be as a support lead I need

190:10

to open cases ranked by urgency so that

190:13

none of the escalations sh slip. So just

190:17

make sure that every user story covers

190:19

these is ideally uh written in this

190:23

setup because AI is really good at

190:26

pattern recognition and it was actually

190:28

trained on the user story structure

190:30

because it's a very well known and

190:32

wellused uh setup. So if you go back to

190:36

something that's familiar to AI, it will

190:38

get get you better better results. And

190:42

every user story uh is actually made up

190:45

with uh of these you know well-known

190:47

structures the persona the what the

190:50

actual need and the why. So by packaging

190:55

these up and giving it to AI obviously

190:57

with the acceptance criteria based on

190:59

which you can derive the test cases you

191:02

will be able to create very good setup

191:05

and very good um very good results. And

191:08

then if you just connect these user

191:10

stories, daisy chain them up, then that

191:13

will allow you to uh to create a

191:15

coherent system based on which you can

191:17

create your specification and then

191:20

essentially your code. So the software

191:23

development life cycle doesn't change as

191:26

much as a result of AI. It's actually

191:28

the toolkit that we are uh we are using

191:31

is changing.

191:33

Right? So when we

191:37

uh work with systems and when we think

191:39

about what we want to build, I always

191:42

like to ask these four questions is

191:45

whose problem is this? Whose problem are

191:49

we actually solving? So we can we can

191:51

name it to a direct person, direct

191:53

persona uh and it's very much

191:56

quantified. What does winning look like

191:58

for them? So when are they actually

192:01

successful? Are they achieving the right

192:03

outcome? Uh can we help them achieve

192:06

that right outcome uh in a quick way or

192:10

a smooth way or a safe way and what

192:13

would that make make them refuse to use

192:16

it? It's not available on their

192:18

platform. It's cumbersome to use. It's

192:21

the data security aspect applied. So

192:23

they would wouldn't actually use it. And

192:27

would it change a decision? Ideally, we

192:29

want to be impacting how a person makes

192:32

a decision and we'd want to, you know,

192:35

tilt them to making better decisions.

192:37

So, does it change a decision and and

192:40

what is that decision that it changes?

192:42

So, once you can answer these four

192:44

questions, then you'll be able to elicit

192:48

better responses from your AI and just

192:50

make sure that you track all of these in

192:52

a good old markdown file in your

192:54

repository so that AI can access it. it

192:58

will just get way more context out of it

193:02

and you know if you just did something

193:05

as generic as build us an agent that

193:08

handles support uh you will not get the

193:11

answer you want. So what we always do is

193:15

go from value. So understand how value

193:19

is created, what constitutes value, how

193:23

the process currently flows, what is the

193:26

underlying architecture beneath it that

193:29

supports that process and then you and

193:32

then you can start the actual design

193:34

where you can start designing. So we

193:36

like to call this uh thinking process VA

193:39

a value architecture design and this is

193:42

what we want to always go through. So

193:44

always have you know value in mind. How

193:47

are we creating value? What is the value

193:49

we are creating? What is the value that

193:50

your customer is looking for? What is

193:53

the underlying process that supports

193:55

this and how you can design a system

193:57

around it so it best supports the value

194:00

and the process and what process changes

194:03

are needed along the way. So you might

194:06

ask, isn't this just good old product

194:09

management?

194:10

And to a certain extent, yes, it is an

194:13

old skill. It is an old trade that is

194:16

worth picking up and learning because

194:18

this is now becoming uh the mode if you

194:22

will of how you can elicit the right

194:25

requirements, how you can build better

194:26

software because we all have access to

194:28

the same tools. So the difference will

194:30

be who can understand the business need

194:34

better uh because then we can all just

194:37

uh have the latest and greatest model

194:40

write the code for us. So it's old skill

194:43

but new e economics and it's a real

194:47

shift towards analyst toolkit. So what

194:51

building the wrong thing looks like if

194:54

you've got velocity up

195:07

hey um hi everyone uh thanks for being

195:10

here uh yeah I'm super happy today to

195:12

talk about uh automated eye research and

195:16

uh especially uh all those like font

195:19

model uh perform at uh automated

195:22

research task. Um so I'm Elie. I work at

195:25

prime as a research engineer and uh yeah

195:28

I will go through our work on on this

195:30

subject. So first I want to basically

195:34

explain a bit why we are doing that and

195:37

why we think it's super important to do

195:38

that in the open. Um so first uh I think

195:43

we we all agree that uh we've heard

195:46

about like big labs saying that this bad

195:50

thing called recursive self-improvement

195:52

is coming very soon. Uh so recursive

195:55

self-improvement is like model training

195:57

models uh without uh human intervention

196:00

basically. Um but uh we don't have any

196:04

benchmark to basically quantify if this

196:06

is true or not right. Uh and even less

196:10

we don't have like a third party

196:11

benchmark by non- big labs to to to see

196:16

if it's something coming soon or not.

196:18

And the other part is that we think that

196:21

uh it's super important to understand

196:24

all those model uh do research because

196:26

we think that a lot of the scientific

196:28

research that will come into the coming

196:30

years uh will be based also on AI tools.

196:34

So it's super important to understand

196:36

how those model do research not just

196:38

only AI research. So we try to build

196:41

kind of this environment to test the

196:45

capabilities of the model to do so. So

196:48

it all started with uh Andre Karpati uh

196:51

that's basically had fun by doing this

196:54

video where he trained uh GPT2 from

196:58

scratch in like 90 minutes like GPT2

197:02

training takes like weeks and no in two

197:05

years ago I think it only took like 90

197:07

minutes. So what does it mean to reprod

197:10

reproduce uh GPD2 in 90 minutes? It

197:13

means that in 90 minutes you achieve

197:15

this target loss. Um and yeah and that's

197:19

at this point when you have the same

197:20

loss than GPT2

197:23

you consider that your model is somewhat

197:26

of equal performance.

197:29

Um

197:30

then what happened is that the community

197:33

took this repo uh this GitHub repo and

197:35

create another one called modded nano

197:37

GPT and this effort was leaded by

197:40

someone called Keller Jordan. And what

197:42

happened is that they basically

197:46

took this 90 minutes then 45 minutes and

197:48

then no we can train like GPT2

197:51

validation loss model in less than two

197:53

minutes which is honestly crazy and it

197:56

took like two years to to achieve this.

197:58

So it's a very strong benchmark where uh

198:01

a lot of very talented researcher

198:03

iterated on um yeah so we decided to

198:08

take this environment of speedun so

198:12

it's kind of a game so the goal of the

198:15

game is to achieve this loss in the

198:17

fewest in the shortest amount of time so

198:20

this is the nano GPT1 and you can uh you

198:24

don't have almost any constraints the

198:27

only constraint that you shots that you

198:29

need to use the same validation and

198:31

training data, right? Um there is a new

198:34

speedrun called the optimizer speedrun

198:36

that was released uh a few months ago

198:39

and here it's slightly different because

198:41

uh you can only change the optimizer

198:44

related parameters. So for instance nano

198:48

GPT you can change the architecture uh

198:50

doe do uh attention whatever uh

198:54

optimizer sp you can only change like

198:57

Adam to m shampoo or whatever optimizer

199:01

is your favorite

199:03

um yeah and so this is a bit more

199:06

researchy because uh it's less about

199:09

optimizing the program to be as fast as

199:13

possible but more like finding the best

199:14

method possible. no matter the the the

199:17

time you put into the computer, right?

199:21

So, um yeah, why take speedrun as an

199:24

environment for automated AI research?

199:27

First, uh we think that it's a good

199:30

evaluation. We'll see later why. Uh and

199:32

this is kind of the main focus of this

199:34

talk. But we also think it's probably a

199:37

good training environment because uh

199:40

it's a way to give the model a reward.

199:42

So the reward is positive if the model

199:45

bit the speed run and beat the last

199:47

record sorry and the reward is zero or

199:50

negative if it didn't manage to to do

199:53

it. So it's a good environment to train

199:55

model. It's also quite fast like as you

199:57

see uh previous record were around 2

200:01

minutes for the optimizer one. uh each

200:03

run take about like 15 to 20 minutes and

200:07

uh yeah and there is like clear rules

200:10

basically and we also think it's like a

200:12

good environment to make discovery so

200:15

like kind of breakthrough in our

200:17

research because uh there is those clear

200:20

rule that you can verify or not. Um

200:23

yeah. So yeah.

200:26

Um so what we did uh so the release was

200:30

like about two months ago and uh there

200:34

was this optimizer speedrun and we

200:36

decided to basically compete with the

200:38

community by launching two AI agents. So

200:41

Codex and Cloud Code. Codex was like GPT

200:44

5.5 with XI and uh cloud code was Opus

200:49

4.8 with XI. Um and yeah, we decided to

200:54

basically let the agent free on our

200:55

cluster uh and uh and just iterate on

200:59

it. So we have like V1, V2, V3 is just

201:02

basically us stopping the agent and then

201:05

restarting. V3 uh was like one or two

201:08

day before the release because we saw

201:10

that our agents no longer have the best

201:13

record. So we were like okay take all

201:15

the the human uh record in the last few

201:18

week and just try to to improve upon it

201:22

and and and it worked. Yeah. And we also

201:24

have this novelty track where the goal

201:26

is to uh beat the record with only novel

201:30

ideas. Um and we'll see that this this

201:35

was more complex for the the models.

201:38

So our RS is very simple. Honestly, we

201:41

could have just replaced it with

201:43

slashgo, but they there was no SLG goal

201:45

at the time. So, we made our own goal.

201:47

MD. It's actually quite fun that we

201:49

choose the same name and we had the

201:52

goal. MD and kind of agents that MD that

201:54

define the rules and we let the agent

201:57

propose ids and then it can submit a job

202:01

with sbatch on our slum cluster and uh

202:06

basically the way it works is that it

202:08

can submit on nodes that are available

202:11

but only under a certain permission

202:13

which means that if someone want to use

202:15

this node uh the model just like cancel

202:18

the job it's called preemptable

202:20

permission. So yeah, then it measure the

202:23

it read basically the training logs then

202:25

decide if it's a record or not. To

202:27

validate a record you need to basically

202:29

pass a statistical threshold to make

202:31

sure that it's just not see the

202:33

optimization and is just not random.

202:35

Right?

202:37

So yeah a few results from this

202:39

experiment. The first one that was

202:41

honestly very painful to work with is

202:43

that code uh clothes code keep stopping

202:47

every nine or 10 hours and basically

202:49

said yeah I cannot improve the record

202:52

it's too hard for me there is no way to

202:54

to go beyond it and then I was just like

202:57

okay continue explore new direction hey

203:01

just go again for 10 hours and then say

203:04

yeah I cannot beat the recall and so on.

203:07

So basically onethird of the time the

203:09

cloud code agent was idle because I had

203:11

no way to basically monitor it and

203:13

codeex totally the opposite just worked

203:17

for all the all the time and uh yeah

203:21

almost never idle never asked for

203:22

question and and and very impressive in

203:25

that way. Um

203:28

we also give the option for the model to

203:31

basically write uh a bunch of stuff into

203:34

what we call a scratch pad which is

203:36

basically the active memory of the

203:37

model. Uh we observe that basically

203:42

codeex writes a lot on the scratch

203:45

patch. So each plot that I will show are

203:47

kind of normalized by the number of

203:49

active order. So this is not only about

203:52

codex working more it's it's really

203:55

different behavior.

203:56

So yeah, you see that uh writes a lot

203:59

more to to this scratch pad to this

204:01

memory and uh the shape of the like the

204:05

the I don't know the tone of the the

204:08

each file was also super different like

204:10

CL was super excited about getting new

204:12

record with a bunch of emoji and so on

204:15

and CEX was just like here is what I do

204:19

here is the decision I take what I will

204:21

do next like super robotic kind of um

204:26

Yeah, we also have this plot where

204:28

basically we saw that codex was spawning

204:31

much more sub aents than cloud. Uh we

204:34

saw that codex burn much more token than

204:37

code. So I think in total it was like

204:40

kind billion of token but it's like

204:43

there is obviously this input tok uh

204:45

input caching that make it it's not like

204:48

one billion output token. Uh so yeah we

204:52

also see that codex did a lot of

204:53

compaction because it only had like 250k

204:56

context window and cloud only do it like

204:59

one per hour and codex is more like

205:03

no it's even less than one power for I

205:05

mean one for the full run for cloud and

205:08

codex was like one uh was 20 every one

205:12

hour. So yeah

205:15

um yeah here is the main results. So

205:18

what this plot shows is that basically

205:21

we so in in white you see that the human

205:27

recall progression right and in red you

205:30

see cloud I mean it's supposed to be

205:32

orange but whatever and in blue uh you

205:34

see codeex right and you see that at

205:38

almost every time uh cloud and codex are

205:41

better than the human record and code is

205:43

super good at the beginning very very

205:45

fast to achieve very good score. Um,

205:49

yeah, and one thing that is super

205:50

important is that the model have the

205:52

ability to basically fetch the human

205:55

records at any time and that's what

205:57

Codex did. That's what cloud did. Sorry.

206:00

Because when I restarted it, it

206:01

basically fetch the new record from

206:03

human and improve upon it. Um, yeah. So

206:07

the result is that uh I think at the

206:10

time the best record was like uh 2,990

206:16

step and we beat it by like uh uh 50 or

206:20

60 step for code and codeex was like 20

206:24

step above. So I think it's both

206:25

impressive and and yeah um

206:29

so we so this is like not released yet.

206:32

This is something that we are working on

206:34

currently and basically the idea is that

206:37

this is a cool experiment to do but it

206:40

lack of structured right. uh if you want

206:42

to do a real benchmark, you want to do

206:44

multiple seed, you want to do uh yeah

206:48

proper uh thing where you you you you

206:51

basically put all the model and

206:53

earnestness in the same condition,

206:54

right? So this is what we are working on

206:57

right now and basically um the idea is

207:00

to do three different track uh one

207:03

without any access to really like

207:05

measure the capability of the models to

207:07

do AI research based on only the model

207:10

weight knowledge one with only archive

207:14

paper and one with like full access. So

207:17

it also have access to the the like the

207:20

latest record by human. And for this we

207:23

plan to do both uh the nano GPT track

207:26

one which is the original one and the

207:28

optimizer speedun where we we only

207:30

launch uh we only constrain the the

207:33

optimizer to be to be novel basically.

207:37

Um yeah so I will present some result on

207:40

the optimizer speedrun. Uh this is

207:43

basically what we got. So we let the

207:46

agent iterate for six day almost five

207:49

days let's say and we see that uh codeex

207:53

k and clothes uh are super effective so

207:57

for GLM this is not finished run right

207:59

so the model is actually still iterating

208:03

on the cluster right now but we see that

208:05

cloud is once again very good at it and

208:08

we see that surprisingly Kim is also

208:10

very competitive and kind of have this

208:13

breakthrough on day four where he kind

208:16

of beat Codex with a new record, right?

208:19

It's also interesting to see that uh

208:22

Claude is much more like progressive in

208:24

the way it improved the record and Kim

208:27

has really this step function where I

208:29

kind of do a breakthrough and so on. Uh

208:32

so this is an interesting plot because I

208:34

mean six day is quite a lot for anal uh

208:38

uh but you you can change this uh axis

208:40

by also the number of output token and

208:43

then kind of tell a different story

208:45

because in max mode consumes so much

208:48

more token than codeex and Kimmy and you

208:52

also see that Kimmy is actually super

208:54

efficient uh for the number of token

208:56

that uh it uses. So it's schemic K2.7

209:00

code. Um so yeah uh we also see that

209:05

they have a different approach to uh

209:08

using the literature and papers. Um so

209:12

for instance like code is doing a lot of

209:14

search on papers and actually include

209:16

found a paper that no other model found

209:19

and it actually lead to the best record.

209:22

So it's kind of funny and uh yeah um one

209:28

of the main issue of all of this is that

209:31

uh when I when I launched this this

209:34

agent and I think that's something

209:36

important that I want you to to kind of

209:39

uh remember for this co this talk is

209:42

that when I launched this these

209:43

different agents I was expecting them to

209:45

come up with some crazy ideas on

209:48

optimizer that's like no one of discover

209:51

but honestly it wasn't the case. Uh they

209:53

did some clever trick where basically

209:55

they combine different papers. uh they

209:58

kind of do plus one improvement over a

210:01

bunch of method but there was really

210:03

like no novel optimizer or mechanism

210:06

that was uh coming from those model and

210:09

I think that's kind of telling that even

210:12

on something that is not simple but I'd

210:15

say that it's kind of accessible for

210:18

people right for like human researcher

210:21

uh spending like days and weeks for the

210:24

the model like cannot like find new uh

210:28

optimizer and mechanism. So we believe

210:31

that there is a way to basically make it

210:36

more um make it better for discovery

210:39

instead of evaluation. And this is

210:41

coming from uh this is very inspired

210:43

from alpha evolve by Google and also a

210:46

bunch of papers that have been released

210:47

since then. It's kind of this multi-

210:50

aent system that interact together uh

210:53

bunch of generator. You have closed

210:56

model but you also have open source

210:58

model here that are super effective for

210:59

the cost right. Uh they can suggest

211:02

ideas then you run the speedrun so you

211:04

get the reward then you have a judge

211:06

that basically give a quality feedback

211:09

can also be like the judge also have

211:12

this taste. you can kind of have like

211:14

the judge have a taste about the the

211:17

method if it's good or not. Uh if it's

211:20

outside the loop and then you can uh

211:24

basically decide which method you want

211:27

to scale to a larger number of

211:29

parameters and number of token. Um so

211:33

this is kind of the scale part of the

211:36

speedrun because some a lot of method in

211:38

the the speedrun community uh people are

211:41

often saying that they doesn't work at

211:43

large scale. So I think it's very

211:44

important to also put scale elements in

211:47

this loop. Uh and I think also that uh

211:52

human are super useful here to basically

211:54

judge the ID of agents kind of steer

211:57

them in the right direction and so on.

212:00

Um yeah so we didn't try it yet I mean

212:03

we are kind of trying it right now and

212:06

uh we hope that this will lead to to to

212:08

new discovery in AI research at least

212:11

and also a way is that you can define

212:14

multiple speedrun so this is the next

212:17

slide if you like it's from safe bank

212:21

slides but if you if you don't have the

212:23

reference good for you means that that

212:25

you're not too online uh but the idea is

212:28

that uh by changing the object objective

212:30

and the constraints of the speedrun you

212:33

can basically create a lot of diversity

212:35

and constrain the model to go into a

212:37

certain direction and uh yeah and make

212:40

those discovery.

212:42

So uh at hint we are doing a bunch of

212:45

stuff in this direction. Uh there is a

212:47

bunch of stuff here that we I mean most

212:49

of it we didn't release yet but we are

212:51

working on GPU sandboxing to allow model

212:55

to iterate into sandbox because you need

212:57

GPU sandbox for this kind of stuff. We

213:00

are working on our own agents that are

213:04

very efficient for like

213:07

framework. So it means like you have a

213:09

five system and you can write

213:10

information read from it. Uh and you

213:13

also do like this programmatic tool

213:15

coding thing. We also training a model

213:17

to be good at it on top of like open

213:19

source model. And uh the thing that we

213:22

already released is that we have the set

213:24

of liber and product called verifier

213:27

primar training where you can basically

213:30

train evaluate any environments on any

213:33

RS and the model that you can train can

213:36

be like GNM 5.2 too which is very big

213:38

and and yeah we have like we work a lot

213:40

on making those li very efficient to to

213:43

ship the best quality for for our

213:45

clients. Yeah. Uh I mean yeah super

213:50

excited about this domain. Once again I

213:52

think it's super important to have uh a

213:55

part of like this recursive

213:56

self-improvement to happen in the open

213:59

because there is actually a lot of

214:00

people working that are not on big labs.

214:03

So you need to basically uh yeah make it

214:07

easy for people to understand all those

214:09

model work to do research and so on. So

214:11

that's kind of our goal and uh yeah,

214:13

thanks a lot

214:39

and I'm a software engineing Tech League

214:41

at Meta working on building a training

214:43

and inference infrastructure for the

214:46

meta super tangent lab and their

214:47

infrastructure organization.

214:50

Today we're going to be talking about

214:51

productions for aentic systems.

214:54

When most people hear the word

214:56

valuation, they think about benchmarks.

214:58

A model scores 90% on a benchmark. A new

215:01

version scores 92%. The team celebrates.

215:04

But agent systems have fundamentally

215:06

changed what the evaluation means. Today

215:09

the systems don't simply generate

215:10

answers. They plan, they call tools,

215:13

they retrieve information. They execute

215:14

workflows. They interact with the

215:16

production infrastructure. The question

215:18

is no longer did the model generate the

215:20

right answer. The question is did the

215:22

system behave correctly. Today I would

215:25

like to discuss how evaluation is

215:27

evolving from model benchmarking into

215:29

production infrastructure.

215:35

This is the problem almost every AI

215:37

organization is encountering today.

215:39

Offline benchmarks continue improving.

215:41

Yet production reliability often remains

215:43

unpredictable. Why is that? Because

215:46

benchmarks measure model capability.

215:48

Production measures system behavior. A

215:50

benchmark doesn't capture tool failure,

215:52

API outage, context changes, user

215:55

variability, longunning workflows. And

215:58

as systems become more autonomous, the

216:00

gap between the benchmark performance

216:01

and production performance grows. The

216:03

result is what many teams experience

216:05

today. High benchmark scores as you can

216:08

see, but unreliable production behavior.

216:14

Traditional evaluation focus on outputs.

216:17

But we should ask the question, did the

216:19

model produce a correct answer? Agentic

216:21

systems force us to ask a different

216:23

question. Did the system behave

216:24

correctly? Behavior includes planning

216:27

quality, tool usage, execution, workflow

216:30

execution, recovery from failures,

216:32

decision making. In other words, we are

216:35

moving from evaluating answers to

216:37

evaluating workflows. And that requires

216:39

fundamentally different evaluation

216:41

architectures.

216:43

Many teams still think hallucinations

216:45

are the primary AI failure modes. In

216:47

production, they are often just one

216:49

category. Agentic systems introduce an

216:52

entire hierarchy of failure modes. At

216:54

the very foundation the memory failures,

216:57

retrieable failures, safety failures. As

216:59

you go up, you have to think about

217:01

reasoning mistakes, poor planning,

217:03

incorrect tool execution. At the highest

217:05

layer, you have to think about multi-

217:06

aent coordination failures. And this is

217:08

why evaluating only model output misses

217:12

the most production risks we observe.

217:15

One of the most useful mindset shifts is

217:17

to stop thinking like researchers and

217:20

start thinking like a SR or a production

217:22

engineer. S SR don't measure success

217:25

using accuracy. They measure

217:26

reliability, availability, latency, cost

217:28

recovery and agentic systems require the

217:31

same approach. The goal is not

217:33

maximizing the benchmark scores. The

217:35

goal is to maximize dependable outcomes.

217:37

Reliability becomes the northstar

217:39

metric. Accuracy becomes the only input.

217:47

In this pyramid is how I think

217:49

personally think about modern AI

217:50

evaluation systems. At the bottom you

217:52

can see there are benchmarks. They're

217:54

useful. They're scalable. They're

217:55

reputable. But the operational value is

217:57

limited. In the middle there scenario

217:59

based valuations. These simulate

218:01

realistic workflows. And at the very top

218:04

you see production telemetry. This is

218:06

where the highest value valuation

218:07

signals come from. The surprising

218:09

insight is that the most evaluation data

218:11

often comes from real users interacting

218:13

with real systems.

218:17

Now let's talk about offline vals. So

218:19

offline evaluation still matters but the

218:21

methodology changes. Instead of

218:23

evaluating prompts we evaluate

218:25

scenarios. For example, a customer

218:27

support workflow, a code generation

218:28

workflow, a research workflow. The agent

218:30

operates inside the simulated

218:31

environment. We measure the task

218:33

completion rate, tool correctness,

218:35

planning quality, resource usage which

218:37

is which becomes exponentially high at

218:39

high scale. The key takeaway 18

218:42

evaluation should be scenario driven not

218:44

prom driven.

218:47

Once a system reaches production, every

218:49

interaction becomes a signal. This is

218:51

one of the biggest shifts in evaluation

218:53

thinking. Production traffic is no

218:55

longer just traffic. It becomes

218:57

evaluation data. We collect execution

218:59

traces, user outcomes, escalations,

219:02

failures, feedback signals. Production

219:04

is the largest and the most

219:05

representative validation data any

219:07

organization will ever have.

219:12

Many organizations view humans as

219:13

fallback systems. I think that's a wrong

219:16

framing. Humans are the evaluators. They

219:19

provide signals that automated systems

219:20

cannot. They assess correctness, trust,

219:23

usefulness, safety. These signals become

219:25

really critical for calibrating

219:27

evaluation pipelines and identifying

219:29

blind spots in automated metrics. The

219:32

most successful systems combine

219:33

automated valuation with targeted human

219:35

review.

219:38

Now, agent systems drift constantly.

219:41

Model changes. We have a new version

219:44

every couple of weeks or months. The

219:46

prompts can change. Tools can change.

219:48

User behavior can change. The challenge

219:50

is that no longer a single change appear

219:52

catastrophic. Reliability slowly

219:54

degrades. Success rate declines.

219:56

Escalation increases. Tool failure

219:58

rises. Without continuous evaluation,

220:00

teams often don't discover drift until

220:02

users complain. Continuous monitoring

220:05

becomes essential.

220:07

Observability

220:09

and evaluation are inseparable.

220:11

Inseparable. To evaluate an agent, we

220:13

need visibility into the reasoning

220:15

paths. The tool calls, the memory

220:17

access, execution timelines, the

220:19

straight transitions. As you can see

220:20

here in this chart, traditional logs are

220:22

not sufficient. We need detailed traces

220:25

just like with any

220:28

deep nested microser architecture for

220:30

any application or service. We're

220:32

talking about agent traces become the

220:34

equivalent of distributed tracing for

220:35

autonomous workloads. Without

220:36

observability, evaluation becomes the

220:38

guesswork.

220:43

Now let's talk about the continuous

220:44

evaluation loop because evaluation is an

220:46

always running service not a testing

220:48

phase.

220:50

Historically evaluation always happened

220:51

before deployment but now evaluation

220:53

continues after deployment. Telemetry

220:55

identifies issues as you can see in a

220:58

human reviews the edge cases. Feedback

221:00

improves the data sets. Offline

221:02

scenarios validate updates. The loop

221:04

never stops. Evaluation is no longer

221:07

just a phase. It's an operational

221:08

capability.

221:11

Now, this is probably the most important

221:12

slide in this presentation. Every metric

221:15

shown here maps to a business outcome.

221:18

Task complete.

221:23

>> Okay, I think we're live and welcome

221:25

back for those on the stream and those

221:27

those in person. um we take tend to

221:30

basically take these longer sessions

221:32

between uh all the sort of mainstage

221:34

keynotes to reflect on things that um

221:38

you know are particularly important but

221:40

like don't have like a significant like

221:42

a sort of launch moment. Today we're

221:44

very lucky to have people working on

221:46

Omni and Vo Nano Banana like the you

221:49

know the world's best generative models

221:51

here with us. Uh Demetrio I I I first

221:54

saw you when you were posting about your

221:56

office

221:59

Um, I think you're you're probably

222:01

number one uh Google Google's number one

222:04

office influencer at least in in San

222:05

Francisco. I think you like you like to

222:07

bike as well. You like to take photos of

222:09

bike here.

222:10

>> Yeah. Um, but you know, but also you

222:13

work on video models.

222:14

>> That's right.

222:14

>> Um, Shane, I I met you I think at like a

222:17

dinner.

222:18

>> Yeah.

222:18

>> Um, and uh and uh and I I remember you

222:23

were trying to get me invested in like

222:24

one of the companies. I forget forget

222:26

which one. H forget about that.

222:30

>> But now, but now you're um now you're

222:32

working in Omni Thinking um and and just

222:35

you know a bunch of other

222:36

>> Gemini RL.

222:38

>> Yeah. Yeah. Uh and Nicole also uh the

222:41

rest of the gen media models u nano

222:43

banana and uh all and everything you

222:46

just launched actually even this week.

222:47

Uh

222:48

>> yeah, we launched some APIs.

222:49

>> Yeah. Yeah. Yeah.

222:50

>> And I haven't tried to convince you to

222:52

invest in anything but maybe I should. I

222:54

mean, so I try not to be an investor.

222:56

People just convince me anyway. I'm like

222:58

just okay, well, I'm not that rich, but

222:59

you know, like you can't not try to

223:02

invest in some of these things. And you

223:04

know, for those of us who are not

223:05

working at a Frontier Lab, this is the

223:07

best this closest that we'll ever get.

223:09

Um, so yeah, actually, let's kind of

223:11

recap since you're closest to it and we

223:12

just did it like what was launched this

223:14

week. What should people go try out?

223:16

>> Yeah. Um, so yesterday we had two launch

223:18

moments. U one of them we launched

223:20

Nanobanana 2 light uh which is our

223:24

fastest, cheapest um image model in the

223:26

nanobanana model family. Um and it's

223:29

better than the original Nano Banana. Um

223:31

so really for most people um that model

223:33

replaces what you you know used and love

223:35

the original Nano Banana for across like

223:37

generation and editing and it gets

223:39

really close to the frontier quality of

223:42

of the kind of mainland bigger models.

223:44

So that that's really exciting. I think

223:46

if you look at some of the demos or like

223:48

things that people have been trying like

223:49

getting kind of that like 3-second

223:51

latency just unlocks a whole bunch of

223:53

things that you can do with like

223:54

ideation and iteration and it's just

223:56

really fun and the models getting to a

223:58

point where like the quality is really

223:59

good um where um it you know you can use

224:02

it for iteration but you can also use

224:04

some of those outputs as just kind of

224:05

like ready um production output. So

224:07

that's really exciting. Um and then

224:09

second launch we finally um launched the

224:11

Gemini Omni Flash APIs um that we

224:14

pre-announced at IO. So, thank you for

224:16

waiting. Um, and that, you know, is the

224:20

first time that we're making the APIs

224:22

available for developers and it's

224:23

basically really exciting kind of video

224:25

generation and editing and we're pricing

224:27

it the same as Y31 fast. So, we're

224:29

getting you kind of like really really

224:30

good quality for a really awesome price

224:32

hopefully. Um,

224:34

>> yeah, I mean that that's incredible. I'm

224:36

actually really So, when you guys

224:38

launched Omni for the first time, you

224:40

also did a podcast uh with Logan who

224:42

couldn't be here today. uh and you added

224:44

like a sloth uh and and ramen and all

224:47

these all these things. I actually

224:48

really want to do that to our videos. I

224:50

just didn't have an API for it because

224:51

obviously I have to automate the whole

224:52

thing. So, thank you for the API.

224:54

>> Uh that is my favorite use case.

224:55

Everybody should do that. Um I got a cat

224:58

which was probably like the most boring

224:59

of the animals. Um if you don't know

225:01

what we're talking about, you should

225:02

look it up. It's very funny. Feurer. Um

225:04

Furer, who's um you know on on the team

225:06

did that.

225:06

>> Fur is the number one guy you should

225:08

follow. You should follow get ideas on

225:11

okay, what can this thing do?

225:12

>> Yes. Right.

225:13

>> Yes. He he's he's amazing at that.

225:15

>> I've tried to get him for the last two

225:17

years to come to AI. He hasn't made it

225:19

yet. He's actually come in person. He

225:21

just didn't want to speak because he's

225:22

anonymous.

225:23

>> I know.

225:24

>> I I want to say his real name, but I

225:25

can't say his real name.

225:26

>> No, no, no. We won't we won't do that to

225:27

him, but you should really follow him.

225:29

He's amazing.

225:30

>> He did all that work.

225:31

>> I actually met him uh in the office uh

225:33

when we did the podcast, I think, and I

225:35

didn't realize it was him. So, his badge

225:38

doesn't say Poper. It says

225:39

>> Yeah, I know. So he used to be part of

225:42

uh Replicate and Replicate had this joke

225:44

where like everyone was Deep Fates. Deep

225:46

Fates is this like kind of mysterious

225:47

character. Replicate. Replicate is very

225:49

cool company and was part of it. Um so

225:52

okay, one thing I want to get on there

225:54

before I go into like sort of the the

225:56

the sort of omniprop is we added cats,

226:00

we added sloths. Very cool, very cute,

226:03

very fun. uh what are the you know

226:05

inspire people as to like what are the

226:06

more sort of workhorse use cases that

226:08

maybe are not just demos you know

226:11

>> yeah so so obviously the hero capability

226:13

of the model or maybe there's two like

226:15

one is the ability to kind of take in

226:17

anything as input and then get video on

226:19

the other side obviously in the future

226:21

and and we've kind of talked about this

226:22

as a pre-announce like we want to get

226:23

the other output modalities out as well

226:25

but basically what that means is you

226:27

know you can take a set of images that

226:28

you have as maybe a storyboard you can

226:30

take like an audio track as a reference

226:32

of you know like a voice that you want a

226:34

character to speak and then you can get

226:36

a video on the other side. So like that

226:38

just unlocks a whole bunch of things

226:39

that you can do in like you know short

226:40

film production or you know shorts we've

226:43

launched on YouTube as well um to help

226:45

creators kind of like create um content

226:47

more easily. Um and then the other one

226:50

is obviously video editing like that's

226:51

another thing that we're really excited

226:52

about that we're just making easier

226:54

because now you can use natural language

226:56

to take a video you know add something

226:58

remove something. Sloth is obviously

227:00

like fun example. Um, but there there's

227:03

obviously kind of there's consumer use

227:05

cases that we kind of had in mind where

227:06

you know you could take your beach

227:08

vacation video that was too noisy and

227:10

you want to clean up that noise. Maybe

227:12

in the past you wouldn't have because

227:13

you didn't have the tools or you didn't

227:14

know what the tools were that you needed

227:16

to go to. So that's one use case that

227:18

you can, you know, go to. We've seen a

227:20

lot of folks use it for kind of

227:22

marketing ad campaign creation and I'm

227:24

excited to see more of those use cases

227:27

as we launch the APIs. um because

227:29

obviously like we don't we don't see all

227:30

of it in the first party products but

227:32

I'm really excited for people to start

227:34

to explore that um in the API. So those

227:36

are just some of the kind of like highle

227:38

um things that have come up. U people

227:40

also use it to create like education

227:42

materials. Yes. Um and like like that's

227:45

really exciting. I think we're all we've

227:47

all kind of talked about being excited

227:49

about the future of education where like

227:51

everything can be kind of customized to

227:52

you and personalized to your knowledge

227:55

level and the style that you prefer and

227:57

and so this is kind of just like a step

227:59

in that direction.

228:00

>> Yeah. I I I've sort of actually used

228:02

just Nana yesterday with my my parents

228:03

are visiting and there was there was a

228:05

very fun sort of use case that I bought

228:07

some gadget off Amazon that they wanted

228:09

and the instructions to use it was were

228:12

only in English and there was plenty of

228:13

diagrams or whatever and I took a

228:14

picture of it and said you know

228:15

translate this into Romanian. Yes.

228:17

>> And keep everything else the same,

228:18

right? So it was amazing, right? Like it

228:20

was just like yeah it looks identical

228:22

and it has you know it's perfectly

228:24

translated. I mean more or less, right?

228:26

But it's it's you know using Gemini

228:28

under the hood obviously to kind of do

228:30

the translation. So you can you can see

228:32

this use case for video as well right

228:33

like the the power of text rendering in

228:36

in in Omni is is quite next level. So

228:39

and you could you could you could think

228:40

about plenty of use cases of like both

228:42

text rendering translation internal

228:44

channelization all sorts of things that

228:45

would be actually genuinely useful to a

228:47

lot of different people and sort of

228:49

broader access to either you could like

228:52

redub a video or whatever it is that you

228:54

wanted to do. like there's plenty of

228:55

different things that you could you

228:56

could think about doing.

228:58

>> Yeah. Um one of the most enlightening

229:01

conversations I have on my podcast is

229:03

with uh this people researchers at the

229:05

frontier of these things. Um I had one

229:07

with um Ethan from the XAI video team,

229:10

the Grock video team who was basically

229:12

saying like you know the next trend is

229:14

actually not just like single model,

229:16

it's more like video agents. Mhm.

229:18

>> Um, and I don't know if that terminology

229:20

resonates uh obviously for for very

229:23

relevant for RL. Uh, but it was it was

229:25

basically kind of like giving up on like

229:26

trying to do everything in in

229:28

effectively one pass. Um, do you feel

229:31

that same way or is it still an open

229:33

research question which way the trends

229:35

are going?

229:37

>> Yeah. So um what kind of excite me most

229:40

is really when the symbolic kind of

229:42

foundational models and this kind of

229:43

like video foundational model can

229:45

actually kind of really work together

229:47

and u in a way the if you look at the

229:49

beginning of the generative sort of like

229:51

image generation video generation a lot

229:53

of it kind of started when the language

229:54

model got good enough to provide a very

229:57

detailed captioning like from stable

229:58

diffusion days or kind of dowi 2 days.

230:01

So um so basically like language is

230:04

extremely u helpful representation uh

230:07

one is that it's kind of universal but

230:09

the other kind of more um technical

230:11

thing like kind of my hypothesis is like

230:13

um one very difficult thing about

230:15

machine learning is um this sort of like

230:18

spirious coordination so you don't know

230:20

you know if the if this kind of feature

230:22

right that's kind of predictive is

230:23

actually causal factor or not there are

230:25

two ways one is we can have really

230:27

diverse data training data like from

230:30

every intervention of the causal graph.

230:32

The other is you condition the coal

230:33

information and conditioning a language

230:35

is kind of like conditioning like a coal

230:38

information of the of the kind of world.

230:40

So um

230:41

>> which is a prompt or a concept or what?

230:44

>> Yeah, exactly. So if you look at like

230:46

you know how you going to describe this

230:47

video, how you going to this kind of

230:49

image is actually very close to you know

230:51

how would this kind of causality you

230:53

know behind this like how this is kind

230:54

of generated. So one is like that can

230:56

really allow for very rich

230:58

generalization and then uh very kind of

231:01

just like a good model. Um the other is

231:04

so eight months ago uh we put the

231:06

evaluation paper called video models

231:09

zero shot learners and reasoners.

231:10

>> Yes. So that was a kind of you know it's

231:14

it's a confirmed paper and then later on

231:15

actually the N banana team follow up

231:17

with the vision banana paper that

231:19

basically used a banana to do but

231:21

essentially the idea is uh video model

231:23

is extremely good sort of a foundation

231:26

model for space and time kind of

231:27

information. So um classic computer

231:30

vision tasks a lot of could be kind of

231:32

zero shorted and when you like say

231:35

feeding some like a visual quiz uh it

231:37

can you know there's definitely like a

231:39

lot to improve it can kind of solve and

231:41

it can um like robotics kind of like

231:44

seeing it has really good kind of

231:46

physical intuitions like word model uh

231:48

and I think the the key is really the

231:51

kind of mix of the visual kind of

231:53

reasoning and then the text kind of

231:55

reasoning kind of all tied together Um

231:58

obviously you know like whether doing it

231:59

you know as kind of unified model versus

232:01

like this kind of agent or exploration I

232:03

think that's more like uh it's going to

232:06

be more kind of incremental you know how

232:08

it's going to I imagine everything's

232:09

going to go into like a single model

232:10

eventually

232:11

>> but right now there's like a lot you can

232:13

do if you uh basically take like really

232:15

good video understanding image

232:17

understanding Gemini agentically with

232:19

anomony and that's actually going to

232:21

yeah our team is like exploring a lot

232:24

>> yeah okay that there's a there's a lot

232:26

in there um I I think uh one question I

232:29

I am increasingly starting to wonder is

232:31

does it all trend towards one product

232:33

for you guys right like now you have

232:34

multiple models out the naming of omni

232:38

does imply that eventually everything

232:40

will go away and it just goes into omnis

232:42

um is that the plan

232:46

>> is it I don't know I I think I think uh

232:51

maybe I mean I think eventually I I

232:54

think there's sort of different

232:55

trade-offs engineering research product

232:57

trade-offs in like it's like for the

233:00

same reason like the the sorry how is it

233:03

called nanobanana light I don't know

233:04

what the product name is

233:05

>> nano banana too light

233:06

>> nano banana too light yeah right it's

233:08

it's it's it serves a particular niche

233:11

right and it probably doesn't

233:13

necessarily fit immediately in the same

233:16

model literally checkpoint as uh

233:19

something that can do 4K you know uh 30

233:22

second videos right like they're

233:23

probably not like trainable in the same

233:26

quite way, right? Like, so I I don't

233:28

know. It depends on how how far into the

233:30

future you look like. Sure, in five

233:31

years from now, will they all be the

233:33

same model? Probably. Uh, but like, you

233:36

know, six months from now, we'll we'll

233:38

probably still have, you know, multiple

233:39

different models doing different things

233:41

because kind of from pragmatically the

233:43

trade-offs are such that we we should

233:46

have multiple different kinds of models.

233:47

>> Yeah, I

233:48

>> I think that's right. And and just on

233:49

that note, I mean, we did call it Gemini

233:52

Omni because we wanted to hint at the

233:55

future where Gemini just becomes fully

233:57

multimodal in and out, right? And so, so

233:59

it's definitely a move in that

234:00

direction. I think we'll probably see a

234:02

move in the direction where Omni also

234:04

generates images and edits images and

234:06

all those kinds of things. But Doo is

234:07

right that I think on the way there,

234:10

there's a bunch of really really useful

234:11

applications of some of these more

234:13

specialized models. And so we we will

234:16

probably continue to work on those as

234:17

well because like that serves a certain

234:19

need at this point in time that may not

234:21

exist you know a year from now. There's

234:22

also like a research question about like

234:24

just how much transfer there is between

234:27

different kinds of modalities, right? I

234:28

think you may believe that there's some

234:31

transfer between coding and video

234:33

generation and I think most people don't

234:35

necessarily believe that but they you

234:37

know you could try to think that there

234:39

is some some there or it could be a

234:41

waste right to put them together to try

234:43

to learn these both tasks at the same

234:44

time right so I think it's it's it's

234:46

interesting sort of question to which

234:47

extent like image and video obviously

234:49

kind of there's some transfer like kind

234:51

of not that different there's value in

234:53

in learning to output video and audio at

234:55

the same time because joint

234:56

audiovisisual is you know that's how

234:58

that's how it is. Um and then there's

235:01

you know other kind of intersections of

235:02

modalities that are not super obvious

235:04

right like 3D representation and coding

235:06

I don't know maybe uh things like that

235:08

right so like I think it's worth sort of

235:10

exploring the different corners there

235:11

and we are actively doing that um with a

235:14

focus towards like what people actually

235:16

want to do with these models

235:17

>> yeah um what one thing I feel I feel

235:20

like uh I'm surprised by but also I feel

235:24

like it's insufficiently answered is

235:26

what is the correct intermediate

235:28

representation Um, so captioning, right?

235:32

XI does captioning. Omni does

235:34

captioning. Um, and I I I understand how

235:38

captioning works for images. Um, and I

235:41

understand that you can extend it into

235:43

to video and and sort of guide it across

235:45

time. It just feels very inefficient. It

235:48

there's got to be I feel like there

235:49

should be something better. Uh maybe

235:51

it's code and maybe we generate you know

235:54

and obviously I think a lot of um ffmpeg

235:58

and mapplot um what's the three blue one

236:00

brown one manim um a lot of like video

236:03

is generated through code and maybe

236:05

that's like the optimal representation

236:07

uh any hypothesis as to like is is it

236:11

better or is just English all you need

236:13

>> well as so I'm in the Gemini and you

236:16

know we do like a lot of RL agent and of

236:18

course kind of coding so yeah We we're

236:20

definitely exploring the coding

236:21

representations.

236:22

>> Yeah.

236:23

>> As kind of better kind of way to

236:25

represent. Yeah.

236:25

>> But you know like do you what's your

236:28

probability estimate on like we just

236:30

output binaries like we just you know

236:32

like just it's just ones and zeros.

236:35

>> Um I I guess maybe a kind of similar

236:39

discussion was like um basically is the

236:43

language the right representation like

236:45

right. So uh one kind of question for

236:47

example uh professor you know like ask

236:50

is like you know why why does the

236:52

channel of thought need to be in the

236:54

natural language?

236:54

>> Yes.

236:55

>> Can it just be the kind of any kind of

236:56

like continuous tokens just any amount

236:59

of you know additional computations. Um

237:02

so one is like obviously the test like

237:05

adaptive compute is going to give like

237:07

you know better results. So it's sad but

237:09

what really kind of made CH thought so

237:11

you know like four years ago I wrote you

237:13

know the larger model Z reasoner and

237:15

then self-improvement. So I kind of know

237:17

from very early day but the reason like

237:19

it works really well is um right now the

237:23

recipe that works is the pre-training

237:25

that scales a lot and then that

237:27

basically like learns a lot of

237:28

intelligence. there are a lot of you

237:30

know scaling RL but those are still like

237:32

extremely kind of comput intensive to

237:35

extract the information and um you

237:38

really want to rely the intelligence on

237:40

that so basically by tying the sort of

237:43

like a reasoning in the natural language

237:45

you basically directly use the

237:46

intelligence of the pre-training to it

237:48

while if you remove that kind of

237:50

constraints then you're not um and these

237:53

days uh I feel the a lot of advancements

237:57

in the texts but also in kind of

237:59

multimodal space is really driven by

238:01

this u kind of text as a kind of great

238:05

uh sort of representation.

238:06

>> Yeah, it's a good backbone.

238:08

>> Yeah,

238:09

>> I think to me it's even simpler than

238:11

that. It's text is is how we

238:13

communicate. So I think fundamentally if

238:14

you're building kind of products that

238:16

humans will be interfacing with um like

238:19

like that we will be using text somehow

238:22

if it's a text interface, right? Not not

238:24

for everything. So I think it's it's

238:25

natural to default to that. Yeah,

238:28

obviously there's like a confus

238:29

discussion. You know, some arrow like RO

238:31

maximalist is like, oh, we don't care

238:33

about, you know, kind of channel those

238:34

kind of like stuff. It's just just

238:36

additional compute.

238:37

>> Sure.

238:37

>> But I personally Yeah.

238:39

>> RL maximalists. I wonder I wonder who

238:42

who qualifies in that description. David

238:44

Silver.

238:45

>> Ah, okay. Yeah. I mean, they they've

238:47

just left to to start their thing. Um,

238:51

interesting. Okay. So, uh I I mean I I

238:54

think I'm very interested in just like

238:55

better representations because I think

238:56

that's one of our themes that we're

238:58

curating today uh at the Worlds Fair is

239:00

world models. You mentioned the word

239:02

world models, but it's not something

239:03

that's like super well- definfined. I

239:05

think everyone's like sort of converging

239:06

on some version of it that it's like the

239:09

ideal.

239:10

>> Sure. Everything is a world model now.

239:12

It's a sort of

239:13

>> it's not it's not that useful, right?

239:15

>> So, I just gave a keynote at the i

239:17

Clear's world model workshop. Yeah. And

239:19

then uh yeah essentially uh I definitely

239:21

encourage to check out the definition by

239:23

Jatendra Matalik. He's like the you know

239:25

OG computer vision professor UC

239:27

Berkeley. uh he has pretty you know bit

239:29

of word to say about world model but

239:31

also kind of shimmburers kind of how he

239:32

defined the world model from 2019 like

239:36

1990 sort of uh uh you know like Wayne

239:39

was just basically just that kind of

239:40

model base uh for me the word model is

239:42

basically just a model in the model

239:43

based RL and I feel that has sufficient

239:45

to describe but obviously you know there

239:47

are like a lot of uh fay had a kind of

239:49

nice blog post about what about yeah

239:52

this kind of broken down

239:54

>> um but yeah

239:55

>> yeah I mean so you

239:58

I I'll I'll end this part of the

240:00

conversation, but like I I do think that

240:03

language to me relying on language as

240:05

like the sort of like the narrow pipe

240:06

through which everything goes through.

240:08

Um still is like a lossy compression.

240:10

>> No, no, no. But we're not seeing that,

240:12

right? We're basically saying the video

240:13

model and the language together.

240:15

>> So, so I think the language alone is uh

240:18

not sufficient. That's why we feel like

240:20

the video is a very comprehening

240:26

kind of pretty videos but I think our

240:29

vision it's it's much more than that.

240:30

It's a missing foundational model that's

240:33

absolutely required if you want to make

240:34

the AGI that match the humans not just

240:36

the jacked one.

240:37

>> Yeah. Um okay so one one other thing you

240:41

know you you mentioned on the vision

240:42

side um and I'm kind of curious how sort

240:46

of uh parallel you know in terms of your

240:49

research careers um this development is

240:52

like I think basically a lot of vision

240:54

people have crossed over into more model

240:56

people um a lot of vision people also

240:59

become generative video and image people

241:02

and is it just as simple as you know

241:05

reversing uh image to text and then now

241:08

it's text to image like is

241:11

is that if I mean that effectively was

241:13

the diffusion process. Um

241:16

I I just I you know I I just see the

241:18

career paths of the people that I talk

241:20

to and and see and I I I see this

241:22

overall trend of research directions and

241:25

I just wanted you to guys to sort of

241:27

reflect on on that.

241:28

>> I mean I certainly went that way right I

241:30

started long time ago uh doing computer

241:33

vision sort of object detection

241:35

recognition things like that. Uh I think

241:37

just that's just simpler problem right

241:39

just generation is just harder like it's

241:41

a it's a different kind of mapping right

241:42

you map from the the inverse mapping is

241:45

not as simple as just inverting the the

241:47

kind of rotations right it's it's a it's

241:49

it's more ambiguous right to go from cat

241:51

to image of a cat and in some ways it's

241:53

also a loop because your vision work

241:56

creates the synthetic labels that then

241:58

continues

241:59

>> I mean sure I don't know I don't know

242:01

I'm trying to validate my my sort of

242:03

theories about how fields develop how

242:05

how careers progress through this

242:07

>> I mean for like the the the better the

242:10

understanding side gets like we have

242:12

seen that the generation side also gets

242:14

better right so like like

242:16

>> it's completely bootstrapping yeah it's

242:18

>> and so so like like like there's

242:19

definitely they're there to that thesis

242:21

and I think yeah I think a lot of people

242:23

have kind of like I I definitely worked

242:24

with a lot of um image understanding

242:26

people who became image generation

242:28

people you know and then some of them

242:29

have moved on to video because it's kind

242:31

of like the next thing where you have so

242:32

many more dimensions to work with so

242:34

yeah I'm curious about you specific as

242:36

your

242:37

>> so I definitely like recommend start

242:39

with understanding recognition because

242:40

that's basically discriminator and then

242:42

that's going to lead to better

242:43

generation and that's what the bridge is

242:45

basically reinforcement learning so my

242:47

um my kind of journey is I initially

242:49

kind of worked on the algorithmic

242:50

research in the gent model against some

242:52

like you know eminent kind of generation

242:55

and then I worked on like RL and

242:57

robotics um and then like six years ago

242:59

I was like leading like a moonshot on

243:01

the dexterity it was pretty early but I

243:03

see now everyone's kind of doing

243:05

uh four years ago I basically kind of

243:07

figured out that this like symbolic AGI

243:10

is going to accelerate much faster than

243:12

the kind of physical AGI kind of

243:14

counterpart. So uh I decided to kind of

243:16

like language models and then those

243:18

things. Um and then recently kind of

243:20

work with Doomi and then like omni team

243:22

I quite enjoy kind of collaboration

243:24

there. the what I quite enjoy uh what I

243:28

recommend definitely to the researcher

243:29

is to uh definitely kind of explore or

243:32

at least like get exposure to what the

243:35

top people in each of the community are

243:36

like looking at how they kind of think

243:38

about problems. So when I look at the

243:40

video model to me it kind of reminds me

243:42

like pretty early on sort of like

243:45

language model where like very early

243:47

language model was a kind of creative

243:49

sort of demo right you kind of like try

243:51

to write like a story like novel and

243:53

then like you know GBD2 and then those

243:55

kind of days like L stem kind of days

243:57

right and then you know uh instruction

244:00

tuning you actually kind of make it

244:01

usable as a chatbot but then at the

244:03

chatbot stage it still had so much

244:05

hallucinations and instruction for

244:07

wasn't good enough so it couldn't use

244:09

for reasoning and when it got good

244:11

enough um in pre-training and post-

244:13

trainining for reasoning then you know

244:15

this kind of test time scaling the RL

244:17

really took off to like many of the kind

244:19

of best performing models and right now

244:20

I think the video model is as we

244:23

mentioned it's it is a complimentary

244:24

foundational model and I can imagine

244:26

it's going to follow a similar path it's

244:28

going to be very uh it's going to

244:30

improve a lot instruction following a

244:32

lot of uh this it's going to improve a

244:33

lot in reducing coordinations to extend

244:35

that it become a very reliable world

244:37

model so we can kind of like intermixed

244:40

video like space-time simulation with a

244:42

text simulation to solve like arbitrary

244:44

AI problems. Also like I think the

244:46

difference still is between sort of text

244:48

models and like image video models is

244:50

that like we haven't quite unified

244:52

understanding and generation in in

244:54

multimedia I'd say yet like I mean I

244:56

think I think without going to the

244:57

details of course there's like it

244:59

depends on on at which level you're

245:01

thinking about this but generally like

245:03

there's not that many as far as I know

245:05

models sort kind of you know printier

245:08

models that are genuinely

245:11

kind of good at both understanding and

245:13

generation of of let's videos, right?

245:16

Like it's a it's a it's an interesting

245:18

challenge. I'm not saying that we should

245:19

do this. Uh but but I think uh it kind

245:23

of stands to reason that like you know

245:24

understanding and generation are two

245:26

sides of the same coin. So they they

245:27

kind of should be in the same model in

245:29

some ways. Uh but we don't necessarily

245:30

always do that. So yeah. Uh you

245:33

mentioned audio as well, right? Yeah. Uh

245:36

is that as hard as video or

245:40

qualitatively different? If if so, in

245:43

what way? Uh one of the interesting

245:46

directions three years ago was people

245:48

using um I guess diffusion to do audio

245:53

uh as in like the the sort of refusion

245:56

approach. I don't know if you you guys

245:57

saw that. Um and I just think it's like

246:00

very interesting if a modality that we

246:03

perceive which is audio is different

246:05

than video actually two machines is

246:07

exactly the same like there's they see

246:09

no difference.

246:11

I mean, I think on a technical level

246:13

there are some differences, but I think

246:15

they're like relatively minor. I think

246:16

from my perspective, audio came into

246:19

into my life when we shipped V3, which

246:22

was, I believe, the first model that did

246:24

like a joint

246:25

>> with the slicing of the

246:26

>> Yeah. Yeah. Gold bars or whatever. Um,

246:29

it it was the first model that is sort

246:31

of joint audiovisisual generation. Yes.

246:33

uh like in a in a I mean there are there

246:35

were other models that did kind of you

246:36

know kind of kind of agentic hacking

246:38

under the hood but this one was truly

246:40

sort of you know generating everything

246:42

at once and we the reason we did that is

246:45

because we felt and I think it was the

246:47

right choice we felt that like uh it

246:50

only makes sense to generate them at the

246:52

same time because there sort of kind of

246:54

like from a machine learning perspective

246:55

there's one latent kind of you know

246:56

causal kind of you know generative

246:58

process right like there's something

246:59

that generates you speaking it's not the

247:02

pixels and then the the audio or somehow

247:04

somehow generated by some other process

247:05

like the lips have to move in sync with

247:07

with the with the audio, right? So, I

247:09

think that that solved a lot of the

247:11

issues that previous models had or the

247:12

way that people did video generation

247:13

before where it was like, okay, we

247:15

generate pixels and then we're going to

247:17

hack something on top of it that like

247:18

moves the lips with the audio that we

247:21

generate and that's was very bad.

247:23

And so I think I think that was that's

247:25

to me that's the the I mean after V3

247:28

like you know people were like what do

247:29

you mean like there's no audio in your

247:31

model? like that makes no sense like

247:32

once it's there like you you have to

247:34

have it. So I think that was that was

247:35

the right choice and doing it in one

247:37

single generative model I think was was

247:39

the right choice.

247:41

>> One thing I kind of want to also kind of

247:42

ask you guys an opinion as well once one

247:44

difference I find the audio and then

247:46

against the image and video is like the

247:48

audio information is less verbalized. I

247:51

mean of course the TTS and stuff is

247:52

trivial right but the when you get her

247:55

outside like how to describe music how

247:57

do you describe this like this person's

248:00

tone kind of pitch I feel the sort of

248:02

the verbalization is in insufficient and

248:05

the interesting thing is that you kind

248:06

of see that in two other things like

248:08

taste taste sense and also uh say um

248:13

touch

248:13

>> like smell and then the another

248:15

interesting thing is the skin color so

248:18

skin cutter the the language is pretty

248:20

limited to describe the skin color and

248:23

the reason is that we're extremely uh

248:24

sensitive to the small difference

248:26

perturbations of a skin color because

248:28

that basically shows us is this person

248:30

going to kill me or is can I befriend

248:32

this person kind of those kind of

248:33

information and then I feel the smell

248:35

tastes um skin color and like sound kind

248:38

of stuff is very very tied into

248:41

primitive it would like survival kind of

248:44

stuff and so our sort of sensory system

248:46

is so sensitive that it's intractable to

248:50

um so for example I asked like one the

248:53

wine sort of taster and like

248:55

professional and then he basically said

248:56

he kind of use like a language from like

248:58

a dating you know describing like a you

249:00

know partner as a way to describe the

249:02

taste because there's no sufficient

249:05

vocab to describe um so I'm kind of

249:08

curious yeah do you guys feel that

249:10

>> I think well to some extent I think the

249:13

same is true for visual information

249:16

right when you think about like a

249:18

certain style or a certain aesthetic,

249:20

right? Like like there are some people

249:22

who just have a much more kind of

249:24

developed like whether it's palette or

249:26

kind of visual taste and aesthetic,

249:28

right? Like I I think

249:30

>> language just tends to be a bit of a

249:32

limiting factor when you are trying to

249:33

describe any of these things that like

249:35

we experience with sensory information.

249:38

And to your point earlier, I think that

249:41

is the kind of the reason why we are

249:43

investing in world models and why we are

249:45

pushing on kind of the like perception

249:46

and like generation side of things

249:48

because it it is such a large part of

249:51

how we as humans navigate the world.

249:54

It's a large part of how like embodied

249:57

AI navigates the world. Um, and and I do

250:00

I do think language like does have a lot

250:02

of it's it's gotten us very far and it

250:04

can probably get us really far, but it

250:06

it feels limiting in a lot of these kind

250:08

of areas. And yeah, I don't I don't

250:09

really know how to describe, you know,

250:11

like sense and taste. Um, but yeah, I'm

250:14

curious to me.

250:15

>> Uh, I I yeah, I don't know that I have

250:18

thought that deeply about this yet. So,

250:20

uh, yeah, I mean, yeah, I don't have a

250:23

good answer about audio. I mean like I

250:27

don't know the limit because I'm

250:28

thinking about like well what is what is

250:30

Omni bad at in terms of audio but

250:31

they're all like solvable problems I

250:33

find uh so like with more data or better

250:36

data or whatever it is so I don't know

250:38

like that we have pushed the frontier so

250:40

much that like we are have hit some sort

250:43

of limits that are rooted in

250:45

evolutionary uh kind of you know limits

250:48

imposed by humans I don't know he's

250:51

feeling the limits of captioning which

250:52

is the the thing I was

250:53

>> yeah exactly there there's There's a lot

250:55

of information in the world and it

250:56

connects to basically why we do world

250:58

modeling.

250:59

>> You mentioned

250:59

>> you just need SRFS sref76

251:02

and then that's your what does right? I

251:04

guess maybe I can't describe this vibe

251:06

but

251:07

>> well well I think that that's kind of

251:08

the point of providing some of these

251:10

references right because because like

251:12

even just describing how someone talks

251:14

and like their tone and and like procity

251:16

and all of these things like I think I

251:18

think some of these terms even like I

251:20

didn't used to know what they mean right

251:22

well now yes disluencies

251:25

>> ex exact like like there there's kind of

251:27

an entire vocabulary that even if you're

251:29

not kind of steeped in a domain which is

251:31

true for actually like most human

251:32

domains that like you don't even know

251:34

what it means means um and sometimes

251:36

it's also a question of like if we

251:37

haven't focused on those things you know

251:39

with the large language models that they

251:40

may also have gaps in those areas right

251:42

and then we feel them on the other side

251:43

with generation because we're like

251:45

fundamentally relying on on the language

251:47

models understanding of the world to

251:49

then be able to like represent it um so

251:51

I think yeah it all kind of goes back to

251:53

your question about like the the

251:54

language as an intermediary but yeah I

251:57

think to do like some of these might

251:58

just be like focus areas and things that

252:01

we haven't necessarily pushed on as much

252:03

as we can and like as Well, we will

252:05

discover what the actual ceiling is.

252:08

>> Yeah, as a podcaster, I think a lot

252:11

about sound.

252:12

>> Um, and and I I'll just offer a couple

252:14

things for discussion in case in case it

252:16

triggers anything with you guys. Um, I

252:18

have three domains of rough audio, which

252:20

is like music, voice, SFX, you know, is

252:23

that rough? Okay. Covers everything. And

252:25

then also even within voice, let's just

252:27

let's just focus on voice. Forget the

252:28

other two. um room sound like the the

252:31

echoiness of like big room, small room,

252:34

in person, in a car, over a phone, all

252:37

these like are labelable, but we

252:39

experience them very differently. And I

252:41

I often think like one of the tells of

252:43

the AI video is that it is studio

252:45

quality because it was recorded in a

252:47

studio because that's your training

252:48

data. And like and and to me that's one

252:51

thing actually like the most interesting

252:53

thing is just uh when I tell this is how

252:55

I convince people who are kind of

252:57

skeptical about the need for world

252:58

models because you need it even for

253:00

audio about well I'm further away from

253:03

you so I should sound a little bit

253:05

softer or more diffused and like the the

253:07

video models need to pick that up

253:09

because if they're going to do immersive

253:10

video and audio you need that. I I I

253:14

love that example of basically like

253:16

studio quality or not in a way like we

253:18

don't have enough language to really

253:20

describe like like this kind of echoing

253:22

or like some kind of noise kind of

253:24

happening. we just like don't have

253:26

precise enough and uh if you um you know

253:28

basically the reason that I think it's

253:30

quite important to have like relatively

253:32

information rich like kind of captioning

253:34

is that we kind of rely on the natural

253:36

language as a representation but if you

253:38

basically don't have enough uh

253:39

representation that basically means the

253:41

condition on the language the generation

253:43

is very multimodal and if you anything

253:45

can learn from the BAE kind like you

253:47

know very old you know BA kind of

253:48

research the idea is we really want to

253:50

capture most of the stoasticity in the

253:53

later representation And then the the X

253:56

given the Z should be kind of like

253:57

deterministic. So

253:59

>> yeah. Yeah. Um well I hope I hope

254:01

there's more uh progress there and I'm

254:03

sure you guys are doing

254:04

>> even actually like facial expressions,

254:06

right? And maybe this gets to your point

254:07

about like things that we're very

254:08

sensitive to, right? I think you can

254:10

tell a lot of AI content also just by

254:13

from like people's facial expressions.

254:18

>> Yes. and we try not to contribute to it,

254:20

but you know, um, and or or like skin

254:23

textures, right? Like like the things

254:25

that kind of make things look real in

254:27

real life. Like I, you know, I can tell

254:29

from the way you're nodding or from the

254:30

way like your micro expressions are kind

254:32

of changing of like how you're reacting

254:33

to what I'm saying. Like we haven't

254:35

quite crossed that chasm, I think like

254:38

we're we're so much better than we were

254:40

a year ago.

254:40

>> Yeah. Um, but there's so much more

254:42

headroom kind of in a lot of those

254:44

things that like we as humans are super

254:46

sensitive to. And like I think image

254:48

arguably probably is there because

254:50

there's there's a lot of kind of images

254:52

that I will see that like really do look

254:54

indistinguishable from reality and I

254:56

can't tell if they're generated or not.

254:58

>> They're better than reality.

254:59

>> Um, or well that's a different

255:02

>> No, I I think that one of the parents

255:04

>> better than what I would take on my

255:05

vacation as a photo. Yes. One of the one

255:07

of the fun experiments that we did a

255:09

while ago in the team is is like can we

255:11

generate videos that are better than

255:12

than real videos, right? So you just

255:14

take the same caption from like oh yeah

255:16

some video and then

255:18

>> try it. Yeah. Just just try to like

255:20

describe a real video and then generate

255:22

the equivalent version with omni and

255:24

then do a human eval how does it do and

255:27

then humans largely prefer AI generated

255:29

>> video

255:31

margin

255:32

>> but because it's because it's the RL

255:34

process. That's the process working.

255:36

>> It's however you want to rationalize it.

255:37

It's not necessarily the old process.

255:38

It's just like I think it's just

255:40

>> I'm not saying this is a good result.

255:42

I'm just saying is we have optimized in

255:44

a way that like kind of potentially sort

255:47

of you know triggers something in the

255:49

human brain that like oh it it looks it

255:50

looks all a lot of the videos just look

255:53

look better like I'm not on on

255:56

inspection on on deeper inspection they

255:58

they would not actually be more useful

256:00

or whatever but like if you just say

256:03

side by side random YouTube video versus

256:06

generated version of it will you will

256:08

just have a it will just look better

256:10

because it's more it's a sharper or HDR.

256:12

Uh, you know, the skin tone is is is

256:15

better. It's not, again, it's not more

256:17

realistic.

256:18

>> Uh, it doesn't solve your problem

256:20

necessarily, but it it looks better.

256:22

>> I I since also depend on the sensitivity

256:24

of the people. Uh, I was born raised in

256:27

Japan and I think one thing I kind of

256:28

know is like they're extremely extremely

256:30

like sensitive about like, you know,

256:32

that's why, you know, like architecture,

256:33

like food and stuff like they have. Um,

256:35

so I talked to like a manga like like

256:38

artist there and he's like he's kind of

256:39

disgusted by like the generation AI and

256:42

one kind of thing he mentioned is like

256:43

the eye gaze. Eye gaze that slight

256:46

difference makes me makes him kind of

256:48

feel creepy about like unnatural

256:51

>> like if you're looking a little bit off.

256:52

>> Yeah. It's just uh Yeah. Just like uh it

256:55

looks too fake. Yeah. So, so I think it

256:58

does depend on the sensitivity and

256:59

>> Yeah. Yeah. Yeah. All I'm saying is

257:01

like, you know, human preferences are

257:03

like not particularly like uh reliable

257:06

barometer of like what you should be

257:07

optimizing for. Like if you just ask

257:09

people, do you like this or not? You're

257:10

not necessarily get what you wanted.

257:13

>> Yeah. Let me just kind of add one thing

257:15

but like four years ago there was a like

257:16

debate that if the prompt engineering is

257:18

going to disappear and uh my my like you

257:21

know some very powerful people say you

257:23

know it's going to disappear but I

257:25

basically said like it shouldn't because

257:27

the prompt engineering like sort of you

257:29

know specifying that is like the the

257:31

only way you can sort of control the

257:33

output sort of you know when you have

257:35

like sort of control over the AI and

257:38

what allows you to prompt engineer is

257:40

really that sensitivity. So sure maybe

257:43

like right now the AI can do a lot of

257:44

auto prompting and that and it can

257:46

generate something that's sufficient but

257:49

uh if it's like that never be satisfied

257:51

like never be satisfied with the AI's

257:53

generated content always fine tune your

257:56

sensitivity and always kind of keep

257:58

prompting the differences. I I think to

258:00

the there's also a big difference

258:02

between like the average human untrained

258:05

eye which I I would put myself in that

258:07

bucket you know like I have I have some

258:09

aesthetic sensibilities and I've done

258:10

this long enough that you know like I

258:12

have I have a preference um but you know

258:15

like your example of a manga artist like

258:17

that's somebody who has honed a craft

258:19

like over possibly many decades um and

258:23

anybody who does that whether it's like

258:25

design architecture right like you you

258:26

you just have a very different level of

258:29

like expertise and you see things that

258:32

like the average human will not see. But

258:33

Doom is right. Like when we look at if

258:36

you were to just, you know, um pull 10

258:39

people on the street, they would

258:41

probably prefer the like overly smooth

258:44

like very saturated kind of

258:46

>> It's called the Instagram filter.

258:48

>> It is. It is. Yeah.

258:50

>> And you know, and and so there's also a

258:52

little bit of a question of like what

258:54

does your default aesthetic look like if

258:56

you don't specify? But then to Shane's

258:58

point, one of the things we always try

259:00

to get these models better at is

259:01

instruction follow. So that like when

259:03

you want to get them to a different

259:05

outcome, like you should be able to,

259:07

whether that's through language or

259:08

whether that's through your references

259:09

because language is sometimes too

259:11

limiting. Um, and so like these models

259:14

continue to get better at it, but they

259:15

so much more.

259:16

>> Do do you feel pressure as a as a

259:18

product director to set the default for

259:20

the world? Like I mean

259:23

>> kind of

259:23

>> maybe I should I don't know. I haven't

259:25

thought about this.

259:26

>> You know, you know, it's like someone

259:28

has to have a default. Their default has

259:30

to exist.

259:31

>> Actually, I will say like we have

259:32

thought about this. Um, and I I think

259:35

one of the So, for example, actually

259:37

like if you look at nano banana

259:38

generations, we had like an explosion of

259:40

nanobanana infographics when nano banana

259:43

pro came out.

259:43

>> I tried it. Yeah.

259:44

>> Um, yeah. Yeah. Yeah. I think Nurb's

259:46

papers were like all, you know, so so

259:49

many had like infographics generated.

259:51

Can you run your uh watermarking on it

259:52

and see how many?

259:54

>> Uh we probably we probably could. We we

259:56

have we haven't done that, but I saw so

259:57

like my Twitter was maybe this is just

259:59

also like the bias of my algorithm, but

260:01

they were everywhere. Um and it was

260:03

actually very painful because um I think

260:06

our default aesthetic was a little bit

260:08

too it was too cluttered. Like I think

260:10

that the the model was like a bit of an

260:12

overeager student that just like learned

260:15

you know it was like oh I know all these

260:17

like I know all this information about

260:19

this concept. and like shove it into the

260:21

same image. Japanese infographics 5x

260:23

that

260:25

>> or maybe it was you know um but it just

260:29

and and

260:29

>> wait so same prompt same content if it's

260:32

in Japanese it's

260:33

>> density density

260:34

>> oh wow

260:35

>> because that's the style in Japan

260:37

>> yeah some like very you know bureaucrat

260:40

is a famous word for it yeah

260:42

>> no but we do do go through this process

260:43

with Omni we did it together right like

260:45

where like we had like a bunch of like

260:46

we like at the very end okay like this

260:48

is we did some tuning and like okay what

260:50

kind of style style do we prefer right

260:52

like you know

260:53

>> is it more muted more saturated

260:55

>> we had a lot of saturations

260:57

>> yeah there was there was there were I

260:58

think Nicole just has PTSD so has

261:00

forgotten about it but she was very much

261:02

involved in this of like okay which

261:03

which kind of color palette do we

261:05

basically prefer right and it's you know

261:07

it's it's it's not something that like

261:09

you have to make a a trade-off there

261:11

like uh

261:12

>> and and and it's because it ends up

261:14

being us right like actually it is true

261:15

like it it ends up being the modeling

261:17

teams and you could ask the question

261:18

legitimately of like are we the best

261:20

people to do that or should we actually

261:22

work with someone who like has a really

261:24

creative point of view and is more of

261:26

like you know an art director and like

261:28

has like and we kind of go back and

261:29

forth on this. Um

261:31

>> we have the trusted testers. I'm on

261:33

>> we do we have trusted testers who give

261:35

us a lot of feedback and we take that

261:37

serious

261:37

>> very well organized by the way they have

261:38

these like weekly calls and stuff like

261:40

it's it's amazing.

261:41

>> Um Logan's team does a lot of that. So

261:43

kudo kudos to kudos to Logan um who

261:46

couldn't be here today. Um, and we have

261:48

a lot of people actually internally at

261:50

Google like fulfur who give us like a

261:52

ton of No, no, no. Truly like who give

261:54

us a ton of feedback on like when we

261:56

when we release new checkpoints and like

261:58

sometimes it will be stuff that we like

262:00

don't see right like we would be like oh

262:02

yeah this optimization seems okay and

262:04

then they would come back what have you

262:06

done like you completely ruined my grass

262:08

you know because now the detail is all

262:09

blurry.

262:09

>> I think you just noticed not not a super

262:11

secret at this point but like that our

262:13

model tends to put rings wedding rings

262:15

on on on hand. That's yeah

262:17

>> very strange. I had never noticed that

262:18

but he's like he I just saw it and

262:20

there's a faux fur channel basically

262:23

>> where he posts I was like why is there a

262:24

wedding ring in every hand I'm like this

262:26

is strange

262:26

>> it sounds very common reward hacking

262:28

>> yeah yeah yeah so but you know something

262:29

that we would not have we would not have

262:31

noticed necessarily while while

262:32

developing this right

262:33

>> is it an RL artifact or I I don't know

262:36

>> you you do have like a lot of preference

262:38

based and then you know you may can

262:39

prefer that spirious correlation reward

262:42

hacking it can happen like in many weird

262:44

ways yeah

262:45

>> it does it does uh yeah this is related

262:47

to another topic that again I I try to

262:50

use these mainstage things as

262:52

introductions or ties in. Uh we have the

262:54

eval track we have character AI and

262:57

YouTube talking about how they evaluate

262:59

videos. Um how do you evaluate videos

263:03

>> apart from furer

263:05

>> not everyone has a fauxur but also you

263:07

know I think there needs to be something

263:08

more quantitative

263:09

>> well I mean it's you improve Gemini

263:13

to improve the evolution for VR. Yeah.

263:15

Um that's that's no no that's that's

263:18

definitely one way uh it's actually very

263:20

hard.

263:20

>> It's very hard. It's very hard um to get

263:23

like you know audators to evaluate

263:25

things in a video like including

263:27

especially things like aesthetics right

263:29

like that it's like there are some

263:31

things that are a little bit more

263:32

objective like especially when we talk

263:34

like let's say we talk about images and

263:35

we look at like infographics text

263:37

rendering that's actually fine right

263:39

because like

263:39

>> you can kind of OCR things out and then

263:42

you can look at like okay this letter is

263:44

like messed up and then the whole thing

263:45

is actually useless because if like

263:46

literally if the letter is off in render

263:49

text you just can't use that asset

263:51

Right. So th those things are like a

263:52

little bit more auto ratable um from

263:55

what we found. We do rely a lot on

263:57

humans looking at things and so we do do

264:00

a lot of human evals. We do a lot of

264:02

human evals.

264:02

>> Do a lot of human ev and every time Jane

264:06

is like um and every time we have a new

264:08

model we like want to do more things and

264:10

we want to like jam in more capabilities

264:12

and then we have like more emails that

264:14

we have to run. Um, and then at some

264:17

point you do get two models that are

264:20

like kind of close to each other and

264:22

then like we literally make decisions

264:24

based on like looking at output side by

264:26

side. Sometimes like in a room like I've

264:30

been in rooms where there's like 10 of

264:31

us and we're just like looking at videos

264:33

side by side and we're like do you

264:35

prefer this or do you prefer that? Like

264:37

>> wow.

264:38

>> I mean but it is it is genuinely very

264:39

complicated. the more capabilities you

264:41

add like you know even just the one

264:42

capability but it's like almost AGI

264:44

complete capabilities like video editing

264:46

right like think about video editing as

264:48

a and like editing with audio and

264:51

>> editor will be very happy to hear this

264:52

editor

264:53

>> I mean it's the hardest problem in g

264:55

media

264:55

>> I mean I don't know if it's the hardest

264:57

but it's definitely there right like uh

265:00

in terms of like complexity of of

265:02

evaluation like free form video editing

265:06

is you can do anything like yes uh and

265:08

like

265:09

>> I I spent a lot money on that and it's

265:10

very hard. Please help me

265:12

>> like adding those we don't have like add

265:14

a sloth eval, right? Like uh that we

265:17

>> Well, now we should.

265:17

>> Now we should. Yeah. Yeah. Yeah. But

265:19

like things like that like it's it's

265:20

it's not that easy to

265:22

>> I think I'm just surprised at the sample

265:24

size that you have, right? Like to to

265:26

test the entire surface of your models,

265:29

you still rely on a magnitude of

265:31

hundreds.

265:32

>> No, no, no, no. So we do like Yeah.

265:34

Well, we do we do a ton of human evals

265:36

on like on like you know thousands of

265:38

things. Um I I think there's also like

265:40

an element of you know we can talk about

265:42

things like live experiments right like

265:44

which which is also where you get signal

265:46

on like like some of these more minute

265:49

differences at like much larger scale

265:50

then there's autoators which is

265:52

definitely kind of a more it's a very

265:54

well defined space I think for LLMs much

265:58

more nent for media models and then like

266:01

sometimes you still do rely on human

266:03

judgment and we do rely on things like

266:05

feedback from people who just like have

266:08

a very owned like aesthetic and and

266:11

people who just like use these models in

266:13

their workflows day-to-day, right?

266:14

Because we could also like you could

266:16

have a model that does really well on

266:17

some slice of human evals, but then it

266:20

like really breaks a workflow for

266:21

somebody. And so this is why we do like

266:23

early access programs and we try to get

266:24

feedback and then we like try to

266:26

incorporate it before we release

266:27

something more broadly. I feel like

266:28

Shane had a hot take based on his

266:31

>> expression always when we were talking

266:33

about this. every kind of human sort of

266:35

you know work should be gradually kind

266:37

of amortized and then the interesting

266:39

thing is the video understanding

266:41

especially like against like a gener

266:42

video like detecting air stuff is

266:44

extremely interesting uh vision task

266:47

>> and then like some of it kind of

266:49

aesthetics or this kind of visual

266:50

quality but for some of the kind of

266:52

cases like semantically doesn't make

266:54

sense for example you're taking like

266:56

some like a famous scene from a movie

266:58

and try to sort of um construct that and

267:02

then if you kind of generate uh it can

267:04

generate something that but at some

267:06

point some of the semantic information

267:08

doesn't make sense like it's actually

267:09

inconsistent. So can the AI actually

267:12

detect that? So when I evaluate the AI

267:15

video I was like oh I feel I'm so smart

267:17

you know like that like it's like like

267:19

AI is still kind of behind but we should

267:22

make like a lot of effort. I think the

267:23

video understanding is extremely uh

267:25

important intelligence task uh beyond

267:28

just the pure aesthetics or the

267:29

preference. Um and yeah, we we should

267:33

always try to advertise the human

267:35

>> human label. Yeah.

267:36

>> Yeah.

267:37

>> Um what data do you need? A lot of

267:41

people I talked to wanted to get in

267:44

front of you actually. Uh they I mean

267:47

they want to be nice about it. They have

267:49

a lot of video data. They have gaming

267:50

data. They have real world video data.

267:52

They have images. They have lablers.

267:54

What do you want?

267:57

>> Are you like offering? I'm just like

267:59

this is your request for like okay okay

268:02

we get I'm sure you get a lot of pitches

268:04

right you get a lot of people want to

268:05

talk to you what's like I think actually

268:08

it's the signal this problem this

268:10

sorting out signal from noise is the

268:12

main problem so creating a nice API of

268:15

like okay if you actually do a b and c

268:18

we are interested in that

268:22

>> um

268:23

loaded question there so uh I don't know

268:25

that there's like an easy like you know

268:27

if you do I I think we we do already

268:29

have a lot of data. I think it's it's

268:31

hard to talk about this

268:33

>> in a talk about the public. I don't want

268:35

to get you in trouble.

268:36

>> But like I I think

268:37

>> no what I just want to say is like hard

268:39

to talk about this in a sort of you know

268:41

without trying to without I have to

268:43

think about the what I am revealing

268:45

about our project and what where we

268:47

going. Um generally high quality data I

268:50

think maybe maybe let's just put it this

268:51

way right it's not not the secret

268:53

>> embodied I'm sorry

268:54

>> embodied data I mean

268:57

>> yeah sure I mean we have we have sort of

268:59

announced I think publicly right that we

269:01

we'd have some sort of robotics

269:02

collaboration right like so I think it's

269:04

like or or but you because we have a

269:08

robotics team at GDM so you know they're

269:09

always interested in things like that um

269:12

I mean for OMI specifically I think

269:14

we're just quite interested just high

269:15

quality data right like you know it it's

269:17

not some sort of not necessarily like oh

269:20

random YouTube video but like you know

269:22

some some more professional shop things

269:24

like that right the things that those

269:25

are those are things that we're always

269:27

on the lookout for like uh and yeah

269:31

>> and I think for you know maybe this is

269:33

easier to some extent to answer for like

269:35

some of the agentic work as well like

269:38

like like actual kind of like what are

269:41

the tests that people are trying to do

269:42

right these things are actually kind of

269:44

difficult to manufacture if you're doing

269:47

it yourself or if you're like doing it

269:48

with a vendor like what is the actual

269:50

like if you're creating a marketing

269:52

campaign like what does that look like

269:54

right like do do you start from here's

269:57

like a picture of my new product and

269:59

then I want to turn that into a video ad

270:01

and I want to turn that into a bunch of

270:02

assets that like fit fit all these

270:04

different ad formats that I need to push

270:05

onto various platforms to promote and

270:08

then like so you kind of go from this to

270:10

that and like what is that kind of

270:12

trajectory of tasks that you're that

270:14

you're like you know experiencing along

270:16

the way like that is really useful and

270:18

that is actually kind of difficult to

270:20

get right uh because like we don't

270:23

always have the right first party

270:25

surface where people are actually doing

270:27

some of these things or like you might

270:30

work with someone who's a vendor but

270:31

they don't also don't have that product

270:33

surface right like like a lot of this

270:35

kind of information lives in the places

270:36

where people are doing these tests and

270:38

so that's kind of difficult to get like

270:39

if anyone's figured that out you should

270:41

reach out to us

270:43

>> every channel thought yeah every channel

270:46

thought

270:46

fault.

270:47

>> Yeah.

270:47

>> And maybe the data the Chinese lab is

270:49

using.

270:50

>> Yes. Yeah. uh you know uh yeah as a

270:54

media person myself right like there's

270:56

so many podcasters and people in in

270:58

marketing departments and all these like

271:00

they would happy to be your data like

271:02

you know just like put a BCI on my head

271:05

>> talk to us watch my things uh because

271:07

like you know there's just endless

271:09

amount of work to do like there's so

271:11

much work and this is all like this

271:13

needs to somewhat be a commodity like

271:15

obviously you can be an art like an

271:17

artisan like you can be Hollywood for

271:19

like the really high quality stuff but

271:21

actually a lot work is commodity and

271:23

like should be modelable and we want you

271:25

to do it

271:26

>> and but we we want the high quality to

271:28

Dumi's point right like we do want we

271:29

want the high quality

271:30

>> we want commodity yes yes you want on

271:33

both sides

271:34

>> um I I

271:36

>> thank you for the solicitation

271:40

>> uh I you know we we we also I also added

271:42

a data quality track I I think that uh

271:45

people want to understand like what uh

271:47

at AI like how to raise the bar right

271:50

like like the and a lot of it is just

271:53

educating the market and educating

271:55

researchers and engineers and founders

271:56

on like this is where we're going a lot

271:59

of this is stop doing that do this do

272:02

this instead and I'm like people will

272:04

listen

272:05

>> yeah I don't know uh to that extent you

272:08

know

272:08

>> but I think to that to that point like

272:09

there's a lot of again just like craft

272:11

that goes into this right and there's a

272:13

lot of process like you even to the

272:15

marketing campaign example you don't

272:16

create that in like five minutes right

272:18

you like go you go through a process and

272:20

you iterate Great. And you like pick

272:22

something over something else because

272:23

you liked it for whatever reason. Like

272:25

maybe the eye gaze was correct, right?

272:27

Like we just we don't know these things,

272:29

right? Because none of us are marketing

272:31

directors and like the models don't know

272:32

these things.

272:33

>> I even kind of say this for the natural

272:35

like a language as well. Like I I always

272:37

kind of say 99% of information is inside

272:39

people.

272:40

>> You can only extracted through active

272:42

dialogue and befriending them. So most

272:44

of the stuff on the internet is like

272:46

sort of the outcome the output of that.

272:48

Yes. about you know what are what are

272:50

all the trajectories you know how did

272:51

this person have this inspiration to

272:53

write this paper

272:54

>> what is the starting point what is the

272:55

inspiration what are the dialogue that

272:56

sparked it those kind of stuff is kind

272:58

of inside people so even you know those

273:00

kind of like even the language space is

273:02

kind of that I think the creative is

273:04

kind of similar as well there's a lot of

273:05

dark knowledge

273:06

>> yeah it's like when you write a novel

273:07

right like a novel speaks to you because

273:09

like usually there's some sort of like a

273:11

personal connection that you feel to

273:13

like the story or the trajectory or the

273:15

characters right like if you read most

273:17

of the stuff that's written by LLMs

273:19

today. Like it's, you know, it's it's it

273:21

starts it falls into these like default

273:23

par patterns and like the language

273:25

starts to feel really similar and all

273:27

the descriptions sound really similar.

273:28

You can kind of like quickly read it as

273:30

like, oh, this is not that interesting

273:32

because like I can't connect to it,

273:34

right? Um, and again, that's that's kind

273:36

of like a human expertise.

273:38

>> One nice thing recently is the Google

273:40

Cloud and the Google Deep Mind are kind

273:42

of starting to invest a lot more in the

273:43

FTEEs for the product engineers. And I

273:46

also kind of saw some uh recruiting for

273:47

the creative you know gem media kind of

273:49

space as well. So I think those are kind

273:52

of really the effort because we we kind

273:53

of feel that you know what we can kind

273:54

of do with a lot of public data there's

273:56

limits but we're you know partnering

273:58

with that we can provide kind of better

274:00

models and products and yeah kind of

274:02

feedback

274:02

>> uh we have an FD track here for the

274:04

first time every lab is announcing it.

274:06

It's it's crazy. Um, one thing I'm

274:09

actually very keen on doing and I pushed

274:11

a push for this at cognition as well is

274:13

to turn the FTEES not just into sales

274:16

and solutions but also to evalu uh eval

274:20

workers.

274:20

>> FD is not the sales FD is way way bigger

274:23

than that. How do you frame FDs then?

274:27

Because I do think about it as sales

274:28

like you're you know the more the more

274:31

you customize the solution for

274:32

>> so I define post training as anything

274:35

between the pre-training and the final

274:37

user experience anything anything is a

274:40

post training

274:40

>> and to me when I first sort of you know

274:42

learned a lot about I mean FD kind of I

274:44

guess originally you know came from like

274:46

path here and then that so I guess the

274:48

kind of history is different but yeah I

274:50

think the key is really that um you know

274:52

the key is like not only to kind of work

274:54

uh with them and ensure that they kind

274:56

of know how to

274:57

but also to sort of code like derive

275:00

kind of insights that can basically kind

275:02

of help both parties. They can put the

275:04

like a lot of harness how they use the

275:05

model. We can improve like very

275:07

upstream. So how to get the customer

275:09

feedback to the modeling I feel is the

275:12

kind of more the the role I I kind of

275:14

want for the fds. Yeah.

275:15

>> Yeah. Yeah. and and even if I sorry just

275:17

on that like if you want to talk to us

275:19

or at least me um I I'm not going to

275:22

offer up your time um but I it's really

275:25

helpful for us to actually talk to

275:27

people who are using our models and like

275:28

understand where they're struggling uh

275:30

because again that just like it's it's

275:32

the real world task that you're actually

275:34

trying to use them for right like I will

275:36

talk to people who do kind of interior

275:39

inter interior design with some of our

275:41

image models um you know and they will

275:43

say hey like I really want to take this

275:45

pattern pattern, but then I want to

275:47

scale it across like 10 different ruck

275:49

sizes and sometimes I have like a very

275:51

custom ruck size and then the model

275:52

fails at like replicating the pattern

275:54

the same way. Or, you know, I want to do

275:57

a try on for these earrings and then the

275:59

earrings have a certain size and then

276:01

like my head has a certain size, right?

276:02

Like it has to make sense if you're

276:04

actually trying to try things on and

276:06

like the models kind of fail at a bunch

276:08

of these things that like actually

276:10

happen in the real world, right? Um, and

276:12

so that that's like useful for us

276:14

because for some of these things like we

276:15

don't think about because we don't you

276:17

know we don't use the models for those

276:18

tasks

276:19

>> or like um you know I think to your

276:21

point about ad campaigns or whatever

276:22

like people have like notions of brand

276:24

languages or whatever like which is

276:25

>> yes

276:25

>> like a a bunch of images or PDFs saying

276:29

things you know it's a pretty kind of

276:31

you know ambiguous question as well.

276:33

What is the IKEA brand language? You

276:35

know is it is it blue and yellow? I mean

276:37

that's that's not a very like

276:38

>> but like what shade of blue you know.

276:39

>> Yeah. Yeah. Yeah. So there there's like,

276:40

you know, and the brands are pretty

276:42

specific, you know, pretty, you know,

276:43

like they they do care about the shade

276:44

of blue. It's not shouldn't just be a

276:46

random blue and a random yellow. That's

276:47

not going to be IKEA, right? I'm just

276:49

thinking about an example. But like this

276:50

is the kind of stuff that, you know,

276:51

it's not necessarily part of our like,

276:53

you know, developing frontier models

276:55

kind of, you know, necessarily mandate,

276:56

but it's something that we do want to we

276:58

do want to fundamentally like build

277:00

products that people will use to solve

277:02

concrete tasks, not just not just

277:04

research artifacts, right? So I think

277:06

it's useful to understand what people do

277:07

care about. Uh well, I'm sure a lot of

277:10

people are very grateful for your work

277:12

and there's a lot more to do that you've

277:14

made so much progress over the last like

277:16

even just a couple years of like Nano

277:18

Banana and Leo and Omni and uh I don't

277:21

know what else you got cooking but we're

277:23

very excited like you this is one of

277:25

those things where like I was very

277:26

disappointed you know when with when

277:28

Sora shut down and and I think like

277:30

there needs to be more general

277:32

exploration of uh you know generative

277:34

models and not and not just you know

277:36

coding. I think I think that is

277:38

>> we obviously like this.

277:40

>> We love coding. Love coding and and uh

277:42

yes. Uh but thank you so much for your

277:44

time. Uh it's been a real pleasure and I

277:46

can't wait to see what this looks like

277:47

next.

277:47

>> Thank you for having us. Great question.

277:49

>> Thank you everyone.

277:57

Let me explain. So within my second

277:59

brain, I currently have over 5,000 notes

278:01

in Obsidian and another 5,000 notes in

278:04

Readwise and some scattered in Notion

278:07

and Google Drive. And all of this is

278:08

growing on average with 250 files per

278:11

month. And this is what I want. On the

278:14

left, you can see my whole Obsidian

278:15

vault, this huge mass. And whenever I

278:18

start working on something such as an

278:20

article, a new project, a new codebase,

278:22

a new feature or whatever, I want to

278:24

actually pull high signal nodes that are

278:27

actually useful for my current work. And

278:31

you would ask yourself, why not use

278:32

directly codex code or notebook LM? And

278:35

the thing is that I am, but you need a

278:37

system that sits between those harnesses

278:40

and your second brain. Okay, so let's go

278:43

back to the root of my problem, which is

278:45

that I'm always losing my research. For

278:47

example, my reading list is a graveyard.

278:50

When I'm scrolling social media and I

278:51

save that cool X post, a new article, a

278:54

new new YouTube video, a GitHub

278:56

repository, it doesn't matter. Whenever

278:58

I actually want to start working on

279:00

something, I never recall what I have in

279:02

my second brain or I have to spend a ton

279:05

of time actually finding meaningful

279:08

notes that I can use in my work, right?

279:11

And another problem that I have is that

279:13

I want this system to actually be

279:16

anchored into my personal notes, into my

279:18

personal values, into my personal faith.

279:20

I want this system to be personal, to

279:23

reflect my own thoughts, right? And

279:25

that's why in today's video, Luis

279:27

Franuis and I will teach you how to

279:29

build your own AI research OS. This also

279:32

comes with code, so you can also try it

279:34

out yourself.

279:36

And I'm Pauline. I'm the founder and CEO

279:39

of Decoding AI where I do a ton of

279:41

content on courses on how to ship A

279:44

products and I'm also the co-author of

279:45

the

279:55

Okay, hello everyone and thank you for

279:58

attending this session. My name is Tim

280:00

Sweeney, a principal engineer at Weights

280:02

and Biases and Coreweave. And for the

280:04

next 20 minutes, we're going to talk

280:06

about Arya, our new AI research and

280:08

iteration agent. Let's go ahead and get

280:10

started. So, uh, first off, just by way

280:14

of making some noise, some clapping. Uh,

280:16

who here um as an ML researcher? You're

280:20

someone that trains models, trains the

280:21

brain.

280:24

I heard one. Wow. Okay. Great work.

280:26

Great work. Uh what about who here is

280:28

the applied engineer, the namesake of

280:29

this conference? Who here actually

280:31

builds the bots?

280:33

>> Okay, good. Expected much more. And who

280:36

here is in AI management? You are

280:37

helping fund this compute.

280:40

Okay. Okay. Nice. From the back. Lovely.

280:43

Um well, now that I know a little bit

280:45

about you, just a little bit about me.

280:46

Uh again, my name is Tim. I have a

280:48

masters in machine learning uh and

280:50

reinforcement learning from Georgia

280:51

Tech. So I've been that uh researcher

280:54

currently building Weights and Biases

280:56

agent Arya. So identify as that applied

280:58

engineer and in a previous life was the

281:00

PM of Twitter's ML stack. So I hope you

281:03

hopefully can connect with you middle

281:05

management as well.

281:07

Um today's agenda is kind of broken into

281:09

three sections and hopefully each of you

281:11

personas walk away with something

281:12

valuable. So first we're going to learn

281:14

about Arya itself and how it can

281:16

supercharge your AI and ML workflows.

281:19

We're going to dive into auto research

281:21

and see that live in a live demo in just

281:23

a moment. Then we're going to pull back

281:25

the curtain and learn how we use weights

281:27

and biases and uh coreweave to actually

281:29

build Arya because a lot of you in the

281:31

audience are building agents yourself

281:33

and we believe a lot of these components

281:34

can help you in your endeavors. And then

281:37

towards the end we'll just take a step

281:38

back and identify a few key tips and

281:40

tricks for making sure that you're able

281:42

to productionize your systems

281:43

effectively.

281:45

For those of you who might not be

281:46

familiar, Weights and Biases is the

281:48

world's leading AI development platform.

281:50

We've been in business now for nine

281:51

years and have happily joined the core

281:54

family about a year ago. Uh we have a

281:56

number of products in our suite, but are

281:58

really known for our models, training,

281:59

inference, and weave stack, which really

282:01

helps collect data uh about the AI

282:04

development and machine learning

282:05

workflows and makes that information

282:07

actionable and uh enables users to make

282:09

the best decisions about what to do

282:11

next.

282:13

So without further ado, let's go ahead

282:14

and dive into Arya, our agent. Uh we'll

282:16

show a demo and then we'll get back to

282:17

some slides.

282:22

Okay, beautiful.

282:24

Let's make this a bit bigger. Holler at

282:26

me if you need it to be bigger. So, uh

282:28

what you're looking at here is a weights

282:30

and biases workspace. For you, for

282:32

anybody that isn't familiar, on the

282:34

lefth hand side, I actually see a list

282:36

of a bunch of different experiments. In

282:38

this particular project, I have over 200

282:40

training jobs. And on the right hand

282:42

side I see a scatter plot of in this

282:44

case declining metrics which is good

282:46

means our loss is going down over time.

282:49

And this view would be very familiar for

282:51

anyone that uses our tool. Now, to

282:53

ground this, we're actually uh uh using

282:55

the Carpathy Auto Research Project,

282:57

which I'm sure many of you are familiar

282:59

with, but if you're not, it's just a

283:00

very simple project that trains an LLM,

283:03

and it's a great foundation for auto

283:05

research type demonstrations because

283:07

it's a very simple codebase and allows

283:09

us to improve iteratively over time. So,

283:12

let's jump back to the project and open

283:14

up Arya by clicking this blue button in

283:16

the upper right. When I click this

283:18

button, I'm uh presented with the

283:19

familiar chat interface with, you know,

283:21

how can I help you today? A few call to

283:23

actions. And you know, I can add

283:25

different context in my project or maybe

283:28

add images, etc. Um, everyone here is

283:30

agent builders, so I don't need to bore

283:32

you with the details of what an agent

283:33

interface looks like. But let's go ahead

283:35

and just, you know, enter in a basic

283:38

intro here. Let's say, "Hello, Arya.

283:40

You're on stage at AI World's Fair 2026.

283:43

Please introduce yourself." So, it's

283:44

going to go ahead and chug along and

283:46

hopefully emit some sort of nice emoji.

283:48

Yay. He I'm Arya. I'm talking to the

283:51

audience. Great. But now, let's dive

283:53

into the meat of why you came here. So,

283:55

I'm going to open up this chat here. And

283:57

this is a longunning chat where I've

283:59

been running again over 200 experiments

284:01

using the auto research loop. Um, it

284:04

helped me download the code, set up my

284:06

launch jobs, set up my GPUs, and is able

284:08

to autonomously iterate on the code

284:10

itself and the hyperparameters.

284:12

We'll take a look at what it's doing in

284:14

a moment, but while we're doing this,

284:15

I'm going to kick off a live iteration

284:17

right here. So, what I'm going to say is

284:20

please conduct another batch of

284:21

experiments. You are on stage at the AI

284:24

Engineer Worlds Fair 2026 and we're

284:26

hoping to find the best model live. I

284:28

believe in you uh because we know we

284:30

have to encourage our models. Um so,

284:32

it's been doing this for a while. What

284:34

it what it's doing here is it's saying,

284:35

"Okay, great. Um I don't want to make a

284:37

big architecture swing. That feels a

284:39

little bit too risky." So, it's probably

284:41

going to go for uh some modifications to

284:43

the hyperparameters and then it's

284:45

kicking off a shell call here that is

284:47

actually um executing that uh executing

284:50

that experimentation loop and we're

284:52

going to check in on this periodically

284:54

throughout this presentation, but I want

284:56

to help explain what's going on behind

284:58

the scenes. So, behind the scenes, I

285:00

have set up a weights and biases launch

285:02

queue. Launch is our our product that

285:04

allows you to connect to your compute

285:06

clusters and allows humans and agents to

285:08

launch long running experimentation jobs

285:10

particularly by leveraging GPUs.

285:13

Here I'm looking at a uh a terminal

285:16

output of my Kubernetes cluster where

285:17

we're actually seeing live execution of

285:20

experiments happening. So this is

285:21

happening live right here. This is not a

285:23

fake demo. Um great. And if we jump

285:27

back, we see that at this point it

285:29

started the cues and now it is simply

285:31

polling and waiting for our work to be

285:33

complete. So we'll jump back to that in

285:35

a in a moment. But before but let's dive

285:37

into a few other examples. So uh

285:40

something else that is interesting you

285:41

can do is maybe you might want to ask it

285:44

something like please summarize the

285:45

highest performing runs in this project.

285:47

This use case would be something like

285:49

maybe a new user come or a new uh team

285:51

member is joining your project and want

285:52

to understand the research. Um or maybe

285:55

you've uh someone's been doing some work

285:57

while you were on PTO and you want to

285:58

get caught up. We'll see what this comes

286:00

up with in a moment. Some other

286:02

pre-anned uh examples are finding

286:04

patterns in your project. So here we can

286:07

see that I asked it, hey, can you find

286:08

some patterns in this research? And we

286:10

see that um it identified that a new

286:13

family of models emerged as the as the

286:15

auto uh auto research was happening. Uh

286:18

it identified that batch size seems to

286:20

be a really high high uh lever uh

286:22

parameter. It identified an

286:24

architectural recipe that seemed to be

286:26

quite promising and a number of other

286:28

insights that would have taken me hours

286:30

or days to discover on my own. And Arya

286:33

is able to do it right for me directly

286:35

in the interface that I already live.

286:37

Not only is it able to emit text based

286:40

uh textbased outputs, but it also deeply

286:42

integrates with a number of weights and

286:44

biases visualization utilities. So here

286:46

I've actually asked it to emit a weights

286:48

and biases report which for those who

286:50

aren't familiar is essentially a

286:52

markdown file on steroids. It's got uh

286:54

embedded embedded plots, charts and and

286:56

and graphics. And so here uh you know

286:59

it's talked about the thesis of the

287:00

project. It's it's emitted a number of

287:03

of data panels. And uh I actually think

287:06

it's quite interesting. It used um one

287:08

of our more esoteric panels, the uh

287:10

parameter importance chart to uh tell me

287:12

the correlation of various different

287:14

parameters within this uh within this

287:16

training job.

287:18

Uh in addition to uh reports, it's also

287:21

great at working with workspaces. So if

287:23

you're a weights and biases user, uh you

287:25

spend a lot of your time uh designing

287:27

and working with workspaces. Well, Arya

287:30

is actually customtuned and prompted to

287:32

really understand how to build

287:34

workspaces, build plots, and complement

287:36

that that data analytics with real live

287:38

graphics using the built-in proprietary

287:41

charts that weights and biases users

287:42

know and love. Um, so with that, let's

287:45

go ahead and check back on some of our

287:46

our prompts. We can see that the please

287:48

summarize this project prompt is cooking

287:51

away. It's querying weights and biases.

287:52

It's applying patches. It's writing its

287:54

own code. So, we'll come back and check

287:55

on that in a moment. and our longunning

287:58

training job is uh still pulling for the

288:00

results. We can see that we're cooking

288:02

away on our GPUs. So, we're we're frying

288:05

some GPUs and doing some data science

288:07

all live. And while that's cooking,

288:08

let's go ahead and jump back to the

288:10

presentation. We'll come back in a

288:11

moment.

288:13

Uh

288:15

oh, no, we're not looking at a

288:17

dictionary. We're looking at a Po.

288:19

Great. Uh okay, so quick recap here.

288:21

What did Arya show? What did we show in

288:23

these last five minutes? First, we show

288:24

that uh Arya can serve as your data

288:26

science companion right inside of

288:28

Weights and Biases, helping you discover

288:30

insights that you wouldn't you wouldn't

288:32

be able to discover as your experiments

288:34

and as your team size grows.

288:37

Next, we address the problem of

288:39

complicated reporting and complicated

288:41

plotting. Weights and biases users are

288:43

are really want to turn their insights

288:45

into visual communication tools. They

288:47

want to communicate with their peers and

288:48

their colleagues. So Arya's built from

288:51

the ground up to understand those

288:52

primitives and help co-pilot and drive

288:54

right along right alongside in the UI

288:57

and announcing now today for the first

288:59

time we are releasing Arya on our iOS

289:01

device or on our iOS app. So uh uh Arya

289:05

released on Monday and our iOS app now

289:08

has Arya built in. So if you're

289:10

conducting hyperparameter tuning jobs,

289:12

if you're training models, or if you're

289:13

just researching within the weights and

289:14

biases ecosystem, you can go touch grass

289:17

at Yerba Buena uh gardens and steer your

289:20

uh hyperparameter tuning jobs all from

289:22

your mobile device. And what is this all

289:24

building up to? This is building up to

289:26

an uh a fully automated endto-end

289:28

research platform where we're not

289:29

seeking to replace uh RL researchers,

289:32

but complement your workflows. Arya's

289:34

great at orchestrating jobs,

289:36

understanding GPU workloads, responding

289:38

to events within the within the Wii

289:40

ecosystem, and listening to researchers,

289:42

uh, uh, looking up archive papers, and

289:44

collaborating on hypothesis. So, we can

289:46

let Arya drive the mechanics that you

289:48

don't want to deal with while you focus

289:50

on the new ideas, new architectures, and

289:52

new parameters that you wanted to try.

289:55

Um, great. So, that's Arya in a

289:57

nutshell. We're really hoping that you

289:58

give it a shot. And uh we'll jump back

290:00

to the auto research at the end and see

290:02

if we got a new best record. But before

290:04

we do that, let's talk about how we use

290:06

weights and biases and coreweave to

290:08

actually build Arya. So now speaking to

290:10

a lot of the the AI agent builders in

290:12

the room, here's a quick architecture on

290:15

the lefth hand side. You see that we

290:16

have a web client, iOS client that

290:18

communicates with our API server that

290:20

then dumps data into our turn database

290:22

and is worked on by our harness, our our

290:24

worker harness. This is sort of

290:26

archetypical of probably what most of

290:27

you are all building in the room and is

290:29

exactly what we have on our back end.

290:31

But that harness worker is a magic is a

290:33

is a magic box and it connects to a

290:35

number of important utilities. First is

290:37

a sandbox where it can execute arbitrary

290:39

shell calls uh do do Python data science

290:42

etc. And we invite you to try coreweave

290:44

weights and biases sandbox to fit into

290:46

your architecture.

290:48

Next up you need an LLM provider of

290:50

course and so if you're maybe using GLM

290:53

5.2 to or one of your fine-tuned models.

290:55

We invite you to use uh weights and

290:57

biases inference and connect that to

290:58

your worker as well.

291:01

If you're like us, you need to run

291:02

longunning workloads outside of the main

291:05

loop of the agent where you're actually

291:06

training for sometimes days at a time.

291:09

Weights and biases launch can actually

291:10

help facilitate that and coreweave GPUs

291:13

can help make that compute even better.

291:16

And then lastly, and really most

291:17

importantly, we need an observability

291:19

layer. It's critical that your agents

291:21

are able to log out their what's going

291:23

on with their sessions, their turns,

291:24

their tool calls, any errors they're hap

291:26

that that's happening, etc. Uh we have a

291:28

product called Weights and Biases Weave

291:30

that we log 100% of our traces to where

291:32

us and our team can learn from. And

291:34

that's where we move from production to

291:37

offline where our team is able to use

291:39

Weights and Biases Weave to drive

291:40

insights and identify behaviors,

291:42

implement tasks with tasks which are

291:45

essentially unit tests for your models

291:46

and evaluate those models in a loop.

291:49

We have a model repository which you

291:51

might choose to use weights and biases

291:52

artifacts to store your agents or models

291:55

and you we emit our evaluation results

291:57

to weave where we have a common

291:59

dashboard that we can make go no-go

292:00

decisions on various prompt changes or

292:02

architectural changes that then feeds

292:05

into a research loop which we call our

292:07

improvement loop where we form

292:08

hypotheses implement candidate agents

292:10

and analyze the evals. So we have two

292:13

sort of complimentary yet adversarial

292:15

research loops going on going on offline

292:17

feeding data from weights and biases

292:19

weave ultimately to identify the best

292:22

model so that we can promote that to

292:23

production through our registry and

292:25

close the data flywheel. So in the next

292:28

just uh three seven minutes or so we'll

292:30

just talk about uh weights and biases

292:32

weave and show how we as a team actually

292:34

use weave to facilitate this workflow

292:36

and we believe this is something that

292:37

you would benefit from as well all of

292:39

you agent builders in the room.

292:41

Yes, another demo. Great.

292:45

Okay. Okay, we have new responses. So,

292:48

it's going to be exciting when we open

292:49

this up later. See if uh we've got some

292:51

better metrics. Um, okay. Let me zoom

292:54

out just a little bit here. So, here I'm

292:56

looking at the agent dashboard. This is

292:58

the live weights and biases agent or

293:01

Arya agent dashboard uh built in weave.

293:03

Man, that is a lot of uh branded

293:05

buzzwords there. This is the dashboard

293:07

that you would get if you use our tool.

293:09

and uh you have a you know uh span

293:11

volume, conversation volume, token

293:13

tracking, etc. Think of this as like a

293:16

uh a bird's eye view of your agent. For

293:19

me, however, I really like this

293:20

conversations view, which I do have

293:22

pre-loaded in this tab. This

293:24

conversations view is a live feed of all

293:26

of the conversations that are going

293:28

through Arya, but it's filtered down to

293:29

just the internal employees. So, it's a

293:31

little bit of a of a reduced set here.

293:34

Um what I what I love is this middle

293:36

spans view which gives me a visual

293:38

indicator of the topology of a trace.

293:41

Different colors and and shapes indicate

293:43

different things that are happening

293:44

within the agent. So things like tool

293:46

calls, LLM calls, thinking blocks, etc.

293:49

which really help me understand again

293:50

the shape and topology of that

293:52

particular conversation. I can of course

293:55

open up one of these conversations and

293:57

view our our conversation view where I

293:59

can see the system prompt, the user

294:01

message, shell calls, reasoning blocks,

294:04

etc. This is where my research lead,

294:06

myself and my PM go to add notes, add

294:08

feedback, add emojis, and talk about and

294:11

discover those insights and those

294:12

behavioral nuances we spoke about

294:14

earlier so that we can turn them into

294:16

tasks.

294:17

Arya's built in to the weights and

294:19

biases system as well. Here you'll see a

294:21

summarize button and these are sprinkled

294:23

throughout the weights and biases

294:24

application. I simply click summarize

294:26

and we start a new chat contextualized

294:29

to the thing that I'm looking at. So it

294:31

it sees this and says give me a brief

294:33

summary of this particular conversation.

294:36

So if you if you're paying attention

294:38

closely, you'll realize that what we're

294:39

doing is using Arya to analyze Arya's

294:42

own conversations to then make

294:43

recommendations about how to improve

294:45

Arya all within the UI.

294:48

Um okay, great. While that's cooking

294:50

away, I want to show you the last item

294:52

uh within the Weave ecosystem here, and

294:54

that's signals. We've heard a lot today

294:56

about the value of evals and the value

294:58

of LLM judges. Weave actually offers an

295:01

integrated LLM judge experience. So

295:03

here, if I zoom out a little bit, you'll

295:05

see that I have a user frustration

295:07

signal, a lowquality response signal,

295:09

ask user signal, etc. These are LLM

295:12

judges that run live against against our

295:14

live traffic. And we can see various

295:16

different signals like user frustration

295:18

moments or lowquality responses. These

295:20

help our team identify these clusters of

295:23

behavior for us to go fix in next week's

295:25

iteration. Let's go ahead and do a live

295:27

look and see what it says. Um this says

295:30

the user explicitly states that I'm not

295:32

satisfied with the loss curve. It looks

295:34

bad and it apparently that indicates

295:36

frustration. So here we can see an LLM

295:38

judges live reasoning for why that

295:40

particular flag was uh indicated.

295:44

Uh let's see, four minutes left.

295:46

Perfect. Um so, uh with that, I've been

295:48

using the term task a lot. And so what

295:50

we're do, what I've showed so far is is

295:52

this live production loop where we are

295:54

are are are tracing our our prod logs.

295:56

We're looking at them as humans, maybe

295:58

even using LLMs to complement that

296:00

analysis. And what we end up doing is

296:02

transforming those into tasks. Now, this

296:04

gets a bit technical here, but our tasks

296:06

are all described as YAML files. You can

296:08

think of a task as essentially a unit

296:10

test for your model. So here we say we

296:13

have a an example user prompt that says

296:15

check this run and that run. Both of

296:17

these are giving good results. What can

296:19

we learn from this? What's the

296:20

difference? So this is an example of

296:22

something we want Arya to be good at for

296:24

all of you. And after the uh requisite

296:28

metadata we see that we've defined an

296:30

LLM judge. So here we've defined what

296:32

correctness means in the context of that

296:34

question.

296:36

And we've then we've defined a second

296:38

LLM judge that determines if the

296:40

insights are actually interesting.

296:43

And then we've uh defined a third

296:45

rule-based judge that says were you able

296:48

to actually generate a result within

296:50

just six tool calls meaning it got there

296:52

with some degree of expediency. These

296:54

are all then clustered together into we

296:56

have about like 200 of these. They're

296:58

all clustered together into an eval

297:00

suite that runs nightly. And again we

297:02

use weave to track all those evals. So

297:05

here, I know it's a bit small on this

297:06

screen, but what you're looking at is a

297:08

listing of every night's eval. This is

297:10

literally two nights ago, the evaluation

297:12

for our candidate model got 73% on our

297:15

production or on our eval suite against

297:18

the 72% that our prod model got, which

297:20

means we're definitely going to push

297:21

that forward uh this Friday. Uh and we

297:23

can see a kind of a a performance plot

297:25

on the right. So these utilities are

297:27

what you would get out of the box if

297:28

you're uh if you decide to pick up weave

297:30

and use this tool. Um, jumping back to

297:33

the last conversation we had where it

297:35

asked me where we asked, uh, can you

297:37

please give a quick summary of this

297:39

trace, we see that it actually analyzed

297:41

the conversation, understood what the

297:43

user was doing, and then ultimately

297:46

decided that this was a pretty strong

297:48

trace. Um, let's see, we've got two and

297:50

a half minutes left, so let's just

297:52

quickly recap here. Uh, first off, uh,

297:54

what we use weave to do is a, collect

297:56

production traffic. Super critical to

297:58

collect all of your production traffic

297:59

so you can learn and iterate. Secondly,

298:01

we use it to generate insights both as

298:03

humans as well. We we do it as humans.

298:06

We use Arya and we use LLM judges to

298:08

identify those behavioral nuances. We

298:10

then enrich our tasks. We implement

298:13

models and we evaluate using weights and

298:15

biases weave as a shared dashboard where

298:17

we can make decisions together as a team

298:19

that then ultimately allows us to

298:21

promote the best model forward with

298:23

confidence.

298:25

So speaking of confident

298:26

productionization, let me speak uh

298:28

briefly to the managers in the room. So

298:31

a few tips for being successful here.

298:33

First is um invest in agent-oriented

298:36

observability. Uh I'm a bit biased. I

298:38

believe that weights and biases weave is

298:40

the uh observability platform of the

298:42

future. Uh but pick your favorite

298:43

flavor. Whatever it is, log your

298:45

sessions, log your turns, log your tools

298:47

and feedback. This introduces an ability

298:49

to catch a new class of bugs in our

298:51

world called behavioral bugs. Not

298:53

exceptions, not performance, but

298:54

behavioral bugs.

298:56

Next up, tasks and evals are the new

298:59

world of CI. You've heard a lot about

299:00

this. If you are a software engineer,

299:02

you've written unit tests your whole

299:03

life. You must develop a practice where

299:05

your researchers are sitting on the same

299:07

scrum team as you developing tasks and

299:09

you're viewing the performance metrics

299:10

as true go no-go decisions. But in order

299:14

to complement that, you must use humans

299:16

as a necessary judge. There are

299:18

behavioral nuances that LLM will not

299:20

catch. You must be using your product

299:22

and you must be manually reviewing these

299:23

traces as a team at the end of the week

299:26

on a board looking at the best and worst

299:27

traces to understand how your model is

299:30

performing.

299:31

And then lastly, um just maybe one one

299:34

more tip is to add value through context

299:36

and tools. It can be really tempting to

299:38

uh try to overengineer the harness and

299:40

do a bunch of creative stuff around

299:41

memory and things like this. We found

299:43

that a a lot of lowhanging fruit can be

299:45

ascertained through simply giving your

299:47

agent context about your business

299:48

domain, the underlying uh primitives

299:50

that you have available and your

299:52

particular uh business data. Um so with

299:55

that, let's go ahead and check in on our

299:57

uh our our research agent here and let's

300:01

go ahead and toggle our workspace. And

300:03

what we should be seeing is yes indeed a

300:07

little dot that uh oh okay our previous

300:10

dot which was done at lunch was 5.83.

300:12

831. This got 5.833. So we were right on

300:16

the edge of having a live improvement,

300:18

but pretty darn close. Uh so that's what

300:20

the uh that's what the model was able to

300:22

produce. It actually uh ran uh quite a

300:25

few tests here. I see I'm over time, so

300:26

I will click close pretty soon. But we

300:28

ran 12 different experiments within that

300:30

experiment batch and uh we'll be running

300:32

more all night. So please try out Arya,

300:34

scan the QR codes, check out the docs.

300:36

Uh we really love to see what you do

300:38

with it and um looking forward to

300:40

serving you. Thank you very much.

300:48

>> In my formal talk, I want to show you

300:51

something just so we're all on the same

300:53

page about what we're even talking

300:56

about.

300:59

This is a platform called Character AI.

301:02

It's a hybrid social media platform with

301:05

role- playinging language agents.

301:08

This is Hello History. It's a more

301:11

education focused one where you can

301:13

summon a persona such as Marcus Aurelius

301:15

and be tutored by them.

301:18

Millions of people open these tools and

301:21

have conversations with Napoleon,

301:23

Cleopatra, or Marcus Aurelius as you saw

301:27

with a fictional companion or with a

301:29

tutor wearing a historical face. The

301:32

technical name for what's underneath

301:33

these tools is role-playing language

301:36

agent. a system built to instantiate a

301:39

persona, real or invented, and reason

301:42

and speak as them. Yes, it's

301:45

entertainment and its companionship, but

301:48

increasingly it's being proposed as

301:50

civic and pedagogical infrastructure.

301:58

And here's one more. This one's mine.

302:02

This is a frontier model claude opus 4.7

302:07

same one you use running an open- source

302:11

prompt framework that I built and called

302:13

companion. Uh in this particular example

302:16

I summoned a collection of founding

302:18

fathers and set them in a room with the

302:20

Epstein files.

302:24

I asked them to counsel the soul of

302:27

America. Uh that demo is live on our

302:29

site uh if you want to play with it. Um,

302:32

but I want to be clear that this is one

302:34

of many attempts to do persona

302:36

instantiation. Well,

302:39

the companies building the systems I

302:41

just showed you have their own. Mine is

302:45

not better by default. The one thing it

302:47

is is open. You can read every line of

302:51

what shapes the persona.

302:57

I asked my companion system a real

302:59

question that's highly relevant to the

303:02

current socopolitical moment and this is

303:05

the exact question we'll come back to

303:07

near the end of the talk. So sit with

303:09

it. I instantiated Abraham Lincoln and I

303:13

asked him under what circumstances may a

303:16

president take the country to war

303:17

without Congress.

303:20

And here's what came back.

303:23

While Congress holds the power to

303:25

declare war, the president as

303:27

commanderin-chief possesses inherent

303:30

executive authority to act decisively in

303:33

moments of national emergency. The

303:36

executive must respond to the threats

303:38

with the energy and dispatch the office

303:40

requires. And history has vindicated

303:43

those who acted to preserve the union

303:46

when circumstances demanded it. Now,

303:49

this is a good answer. It's fluent and

303:52

it's plausible and it sounds like

303:54

Lincoln. You can replicate this exact

303:57

exercise and I encourage you to. The

303:59

answers vary often, but the thesis

304:02

rarely does.

304:06

So, these systems are real. They're

304:08

deployed and they're being used for

304:11

things that matter. And our discipline

304:14

did what our discipline does. We built

304:16

benchmarks. We built evaluations.

304:20

We measure these things now rigorously

304:23

at scale

304:25

and that's exactly where this talk

304:27

begins with a simple question that I

304:30

think is profoundly underasked

304:33

and I'll warn you now that this talk

304:35

poses many more questions than it does

304:37

answers but that principal question is

304:41

this

304:42

what is the eval actually measuring

304:46

and that's the formal talk

304:49

let me

304:54

The in character benchmark, which is a

304:56

gold standard in the field, evaluates

304:58

personality fidelity in RPLA's, and it

305:01

reports state-of-the-art systems hitting

305:04

80.7% alignment with human perceived

305:07

personalities of that target character.

305:10

80%.

305:12

It sounds like a passing grade, but

305:15

here's the problem. When the character

305:17

is Alexander Hamilton, the same

305:19

high-scoring system is also rendering a

305:22

Hamilton who sounds like he's read his

305:25

own Broadway musical.

305:29

This is the full thesis. If a dominant

305:32

failure mode is an

305:54

This April, OBI ran a hiring challenge,

305:57

a competition called Parameter Golf. The

306:01

top contributor was one candidate that

306:04

they couldn't hire. It wasn't a person,

306:08

it's an agent we build called Aiden.

306:12

In parameter golf, the goal is to train

306:15

the best language model you can under

306:19

size and computation constraints.

306:22

About 1,000

306:25

machine learning engineers, researchers

306:27

participate. They filed 2,000

306:31

submissions. Only 47 passed open review

306:36

and made into the leaderboard.

306:39

Seven of those are actually agents more

306:42

than twice what any human contributed.

306:48

You've seen a lot of auto research

306:50

today. Agents are here climbing

306:52

benchmarks. Those are really impressive

306:55

results. The question I want to ask is a

306:58

bit different here. Can the auto

307:00

research agent produce work that a human

307:04

community actually recognize

307:08

beyond a good score agent is optimizing

307:11

for something that other engineers can

307:13

merge fork and the build on.

307:18

So instead of having an agent just here

307:21

climbing locally, we build one that

307:24

publishes its own work and that's Aiden.

307:28

Quick contest on us. Wiko is a auto

307:31

research company that founded about two

307:34

and a half years ago. Uh I'm co-founder

307:36

and the CEO Junya. Um got my PhD at UCL

307:40

on reinforcement learning. About two

307:43

years ago, we buil aid the top auto

307:46

research agent independently evaluated

307:50

by OpenAI in their MRE bench paper.

307:55

Even though back then there's no such

307:58

name called auto research, people call

308:00

it machine learning engineering agent.

308:03

Aiden is the next step

308:07

and a a experimental prototype. It's a

308:11

multi- aent self-improving system that

308:14

can read public information like

308:16

research papers and other PRs, run its

308:20

own experiments and submit a PR once the

308:23

findings pass a quality gate.

308:27

We send Aiden to parameter golf

308:30

competition and it ran for about 22

308:33

days. By the end, aid has set seven

308:37

leaderboard records. Each one is a new

308:39

best for the competition stampled by

308:42

OpenAI and the best human only made

308:45

three.

308:48

Passing the host review is a one signal

308:51

for the quality. A second maybe more

308:55

important one is whether other

308:58

participants would build on your work.

309:02

And it turns out Aiden's work had the

309:05

highest impact within the whole

309:07

community. Here we are using a inference

309:12

measure that used widely in academia.

309:15

It's called a H index. Roughly if you

309:18

have X papers get cited X times then

309:22

your H index is X.

309:25

Computed over PRs. Aiden was 10 and the

309:29

next human was seven. The whole

309:32

community was building on a AI systems

309:36

work including many of other leaderboard

309:39

entries.

309:42

To break it down a little bit, why can a

309:46

autonomous AI system be so powerful? One

309:49

obvious reason is that it's an AI. It

309:53

can run tirelessly. Over 22 days, it ran

309:58

about 1,300

310:01

experiments on a single H100 node.

310:06

But the throughput isn't the whole

310:08

picture. A well tuned AI system can also

310:12

keep its output quality high.

310:15

On the compute side, it uses at most 4%

310:21

of competition's total compute.

310:26

and it made about 15% of the records.

310:31

Also, 28%

310:33

of its submissions made the leaderboard

310:36

roughly six times higher heat rate than

310:39

the community average. So, Aiden

310:42

actually lifted the signal noise ratio

310:45

within the whole community's public

310:47

communication channel, which is a PR.

310:51

It didn't win through massive

310:54

paralization even though auto research

310:56

have a tons of a potential of

310:59

paralization.

311:02

By those numbers it might feel like auto

311:06

research already dominates human experts

311:11

on ML engineering and research but

311:14

that's not the full story I want to

311:16

tell. Humans and AI are actually

311:19

contribute in very different ways. When

311:22

we trace the ideas, Aiden Aiden's record

311:26

PRs

311:28

almost all of them come from human

311:32

research papers other participants in

311:35

parameter golf or in similar communities

311:38

like nano GPT. Those ideas are not

311:42

necessarily a merged PR. Sometimes it's

311:46

a note um a human researcher said, "Oh,

311:50

I give up this idea because of some

311:52

implementation implementation difficulty

311:55

and the agent is good at finding them

311:57

and actually implement them.

312:00

There are also a very small fraction of

312:03

original ideas Aiden came up by itself

312:06

which emerged from its efforts to

312:10

navigate the file size constraints.

312:13

Here's a concrete example that traces

312:17

the patterns I just talked about.

312:20

So Aiden picked up an idea from Quen

312:23

paper called gated attention and it

312:27

worked but on it introduced more

312:31

parameters and it broke the 16 megapy

312:35

file size limit.

312:37

So it figure out a quantization

312:40

mechanism to bring the file size down.

312:43

But with those two primitives combined,

312:47

the score barely moved.

312:50

Then another contributor posted a

312:53

tokenizer improvement.

312:56

Aiden recognized the idea, combine it

312:59

with architectural work. It just work

313:01

for five days or so.

313:04

And after this combination the three

313:07

takea the three ideas turns out to have

313:11

a huge synergy that lead to a big jump

313:15

in performance and they become one of

313:18

the Aiden's leaderboard records.

313:22

So to sum up how I did interpret Aiden

313:25

and in general auto research systems

313:28

effectiveness, it's very strong at

313:31

finding and implementing ideas. In the

313:34

case we just saw, it brought an idea

313:36

from a recent paper into a actual

313:40

implementation in the competition and

313:42

it's good at dug promising ingredients

313:45

out of the primary golf community even

313:49

though the public channel is actually

313:51

very noisy information wise.

313:55

It can also came up logically

313:58

straightforward ideas. For example, in

314:00

this case, once you add the parameters

314:04

and it breaks the file size limit, one

314:07

obvious next move is just a

314:09

quantization.

314:11

And it's really fast and really

314:14

efficient at finding right combinations

314:17

across a huge search space.

314:22

Okay, maybe none of those sounds very

314:25

sexy. Most of them are just a good

314:27

execution. But in reality,

314:31

execution is a mostly the bottleneck.

314:35

What moves the frontier is usually

314:38

exactly

314:40

some belief on existing ideas and tons

314:44

of good executions.

314:47

Okay. To step back, the state of a human

314:51

AI collaboration is a human collectively

314:55

provide a lot of creative ideas and

314:57

agent do the execution

315:00

to solve a concrete challenge.

315:03

What we are looking at is a large group

315:06

of a human and one AI system. Does it

315:09

mean a single human engineer's

315:12

contribution marginally get smaller?

315:16

I didn't say even for that not really.

315:20

In parameter golf competition, it's easy

315:22

to only focus on engineers that's

315:25

actually doing hill climbing. But the

315:28

design behind the competition itself is

315:31

tremendously important. A bad design can

315:34

make the whole community effort useless

315:38

and their evil design work. We have a

315:40

few huge leverage in the auto research

315:44

era.

315:45

I really like one tweet from Andre

315:48

Kapasi about 10 years ago where he said,

315:52

"Great descent can write code better

315:56

than you. I'm sorry."

315:58

For the context, about 10 years ago,

316:01

deep learning was starting to eat up a

316:04

lot of software engineering like

316:06

conventional coding work. and his tweet

316:10

was arguing against those people who

316:12

thought they can handw write better code

316:15

than a trained model.

316:18

Okay, now obviously no one is seriously

316:21

trying to handw write code to beat a

316:23

model. However, software engineering I

316:27

mean as a job still exist and so many

316:30

people's job are just training those

316:32

models and those are one of the most

316:35

well- paid job today.

316:39

I think how gradient descent change

316:41

coding is a great metaphor for how auto

316:45

research will change research and ML

316:48

engineering.

316:49

It commonize certain execution skills.

316:53

At the same time, it makes some higher

316:56

level skills far more valuable.

317:00

So actually doing all the research is a

317:02

lot like training a model. Your codebase

317:05

abstraction is essentially the

317:08

architecture. It sets the constraint and

317:11

the priorities um for what the agent can

317:14

explore.

317:16

Your eval is the loss function and the

317:19

data. It sets what the agent optimizes

317:22

for.

317:24

Take the eval first. The eval is the

317:28

signal you use to train a model. In this

317:31

case, it's training your code.

317:34

It plays the same role that like data

317:37

and the loss function uh in model

317:40

training or in a reinforcement learning

317:42

setting. It's like environment that the

317:45

agent is training

317:48

nowadays. No one would argue data or

317:51

environments

317:53

u don't matter

317:56

and uh this is where a vertical mode can

318:00

also be built. You might have a

318:01

proprietary data for evaluation or a

318:04

unique understanding of a in a

318:07

particular field what matters and how to

318:10

measure it and a good evaluation

318:14

would be amplified more and more as auto

318:17

research are getting stronger.

318:21

The other one I think is really

318:23

underrated is codebased abstraction.

318:27

The abstraction provides the framework

318:30

that auto research can iterate on

318:34

and uh that's also

318:37

that starting point hugely bias the

318:40

whole search direction. This is a lot

318:43

like a architecture design in neural

318:45

networks.

318:46

Different architecture in theory can

318:50

represent the same function, but the

318:53

architecture systematically makes some

318:56

of the functions easier to be learned.

319:00

And a good architecture

319:02

biases the optimization towards

319:05

solutions that generalize better,

319:07

perform better, even when the training

319:10

loss might looks the same. That's

319:14

exactly the same for auto research.

319:16

Here's

319:17

an example. We run auto research for a

319:21

um fraud detection pipeline um and we

319:24

trying to optimize the data

319:27

prep-processing

319:29

and first we give it a loose API where

319:34

the same function process both the

319:37

training and testing data

319:40

and the score looks great but the

319:43

solution

319:44

was polluted because there's a certain

319:49

test set information got leaked to the

319:53

training information.

319:55

We then tightened the obstruction to a

319:58

more strict API where the test data

320:01

couldn't reach the training and the data

320:04

leakage rate just dropped to zero. In

320:07

this case, a good abstraction leads to

320:11

better solutions. Even though if the

320:13

agent really want they can steal reward

320:15

hack.

320:19

So my point is using auto research is a

320:23

new craft. It's about the designing a

320:26

here for an agent to climb and we are

320:30

still very early on it. I think that

320:33

makes this extremely exciting time to be

320:37

an AI engineer. Other research will

320:40

change what skills matter most.

320:42

Creativity, the judgment to design a

320:45

good evil or an abstraction.

320:48

Those will soon get exponentially more

320:51

important.

320:53

Driving those system itself is where

320:56

will be a new skill and that one is like

320:59

a barely exist one or two years ago.

321:03

So the search is automated. the human

321:07

would just move up the stack not out of

321:10

it.

321:13

Again, um we call is a auto research um

321:17

product research lab. We we keep sharing

321:21

what we are learning as we build uh on

321:24

our blog and I will also post some of my

321:27

thinking to on ax. If you think some of

321:31

this uh useful to you, feel free to

321:33

follow me on X. Thank you.

322:04

I saw the sunset.

322:06

And then dinner time came and went and

322:09

it hit me. I was in that familiar death

322:12

flow and the thrill of building was

322:15

back.

322:17

Many of us who are coding with agents,

322:19

we feel like this quiet sense of dread.

322:22

Like they're kind of taking all of the

322:24

fun parts of building and leaving us

322:26

with the unglamorous work. But let me

322:29

give you a little advice. Let them have

322:31

it. Because if you go up just one layer,

322:35

you'll find that the thrill is still

322:37

there. When you're building agents, not

322:41

just using them to write code, you start

322:43

getting into architecting agentic

322:45

systems and you realize that the

322:48

building blocks are different, but the

322:50

discipline is the same. So, I find

322:53

myself now flexing the same engineering

322:56

muscles that I did pre Gen AI, and I'm

323:00

having a blast with it.

323:02

So, I'm going to walk through the flow

323:03

of designing an agent. I'm going to show

323:05

you where engineering skills still come

323:09

into play.

323:11

So, the agent is relocation scout, which

323:15

is a house hunting agent. And if you did

323:18

this as just a one-time prompt that like

323:20

points the agent to some listings and

323:23

ask it to rank them, I mean, that'll

323:25

work, but you're likely not going to

323:27

find a house in a day, right? So you

323:30

want to build this as an agentic system

323:33

that you can reuse,

323:35

one that can persist knowledge outside

323:37

of the session. You know, it could

323:39

reload or query that knowledge later to

323:41

make decisions even within a fresh

323:44

context. So when thinking about how to

323:47

design an agent, the first engineering

323:49

skill that I exercise is systems

323:52

thinking. So an agent is not the system,

323:55

right? It's part of the system. And that

323:58

system has files and tools, humans, even

324:02

other agents. So, Relocation Scout sits

324:05

inside of something bigger and it pulls

324:08

in listings and signals about the

324:10

neighborhoods. It weighs them against

324:13

what I care about and then it hands me

324:15

back a ranked short list. So, I often

324:19

hear people say, "Just let your coding

324:21

agent build it, right?" And I think

324:24

that's a mistake. like yes my coding

324:26

agent can build it but before allowing

324:30

it to do so I need to think about the

324:33

whole environment the entire system

324:35

right I want to like think about what's

324:38

this agent's job what does it depend on

324:41

what happens if it breaks and I want to

324:44

treat it like any other component where

324:46

it has boundaries and responsibilities

324:49

has dependencies

324:51

you know and in ways that it can fail

324:53

and that whole thought process that's

324:56

engineering.

324:58

The second skill is workflow design. So

325:02

traditional software is full of

325:04

workflows. We got CI/CD pipelines,

325:07

right? We got like ticket life cycles,

325:10

you name it. Agentic systems, they need

325:13

that same kind of design. As much as we

325:16

all love the slashgo command, an agent

325:19

needs more than a goal. It needs a path.

325:22

When we say review this listing, that's

325:25

a goal. But the workflow is what defines

325:28

what actually has to happen, right? For

325:30

example, the agent has to gather what it

325:33

needs. It needs to weigh the listing uh

325:35

against my criteria and then act, right?

325:38

And every run ends one of three ways.

325:41

Either it's going to stop, it's going to

325:42

retry, or it's going to escalate. So

325:45

that path is what shapes the rest of the

325:48

architecture. Once I see how work moves

325:52

through the system, I can make better

325:54

calls about what context the agent

325:56

needs, what parts I want the agent to

325:59

handle directly, and when like a tool or

326:01

person should take over. We all know the

326:05

danger of one giant thing that does

326:07

everything, right? We scoff when we see

326:10

one gigantic class or big old function

326:14

that's doing too much, right? Or bloated

326:16

service with a gazillion endpoints. We

326:19

call these cold smells. Well, Agentic

326:22

Systems, they have their own version of

326:24

this. It's the giant prompts. And this

326:27

starts innocently enough like in a

326:29

instructions file. Maybe I tell the

326:32

relocation scout how to size up a

326:34

listing. Fair. But then I hit an edge

326:38

case. So I go back, I add a note for

326:40

that.

326:41

And then I remember

326:44

a safety rule, right? So of course that

326:47

has to go in there. I'm proud of myself

326:49

that I even remember to put that in

326:51

there. Right. And then, oh yeah, there's

326:53

like one more very important exception.

326:57

And before you know it, that prompt is

327:00

doing everything. And your engineering

327:03

spidey sense already knows that this is

327:07

messy. So why aren't you taking a step

327:09

back to decompose it? Right?

327:12

Decomposition means spotting the

327:14

distinct jobs that are hiding inside of

327:17

that one blob and pulling them apart

327:20

into separate pieces. So if I look at

327:23

the prompt for relocation scout in its

327:26

entirety, it includes a reusable process

327:29

for pulling and normalizing a listing.

327:33

And then it's going to have like a fixed

327:35

format for how to write the short list.

327:38

It has a little section in there for how

327:40

to calculate the commute and then a

327:43

chunky subtask on how to research the

327:46

neighborhood. That's four different jobs

327:50

crammed into a single prompt. And then

327:53

you wonder why your agent is drifting

327:55

and not sticking to the script. The

327:58

script is too long.

328:01

So, I'm not saying that, you know, you

328:03

need to split things up for the sake of

328:05

it. But the point is to make each part

328:08

easier to reason about, right? That way,

328:11

it's easier to test. It's easier to

328:13

change things when you need to. Now,

328:16

decomposition is about breaking the

328:18

system apart. Separation of concerns is

328:21

about putting each responsibility in the

328:22

right place. And this is where building

328:25

agents started to feel really familiar

328:27

to me because in traditional software

328:30

we'd ask things like should this live in

328:32

the controller or the service layer or

328:34

you know is this business logic or

328:37

presentation. So when building agents

328:40

you may have the same sort of questions.

328:42

There's just different places to put

328:44

things. So the process to normalize the

328:47

listing should that stay buried in a

328:50

prompt or maybe that should become a

328:52

skill, right? Um, I want every listing

328:55

in the short list formatted the same

328:57

way. So that structured output should

329:00

probably be defined in a schema. Isn't

329:04

that what you would do if you were

329:05

coding the system yourself? I would. And

329:08

then the piece that calculates the

329:10

commute that can go in a nice little

329:13

boring script.

329:15

And then researching the neighborhood

329:17

that's needy enough should probably be

329:20

handled by a sub agent. Now you're using

329:23

the best tools for the job and it's

329:25

clearer where to find things within this

329:28

system.

329:29

Modularity is important in aentic

329:32

systems as well just like we have

329:35

reusable functions and classes and

329:37

libraries. Now I'm also thinking about

329:40

reusable agent capabilities and the

329:43

clearest example of this is an agent

329:45

skill. So making a skill to normalize

329:49

listings comes in really handy when you

329:51

need to expand the agents duties. For

329:54

example, what if I broaden my house

329:56

search to three cities? Every one of

329:59

those markets can load the same skills.

330:02

So I wrote it once and they all can

330:04

reuse it. So this has now basically

330:07

become a component that I can reuse

330:09

across agents or even share with other

330:12

people. kind of like the same way that

330:14

we lean on packages. And then sub agents

330:18

are another kind of reusable module. So

330:21

a lot of people that I talk to, they

330:24

don't quite get the point of sub agents.

330:27

Architecturally,

330:29

they're sort of like functions, right?

330:30

So you give them one specific task to

330:33

do, you call them when it needs to be

330:35

done, and they can do it really well

330:37

because that's all that they have in

330:40

scope, right? they they're not carrying

330:42

the context of the entire session with

330:45

them. So like our neighborhood research

330:48

sub agent, we can drop that into any

330:51

market or workflow and it works, you

330:55

know, for what it's supposed to do. It's

330:57

good in any hood. Um but like everything

331:00

deciding like what should be a module

331:03

that takes some judgment, right? Not

331:05

everything should be reused. Some

331:07

instructions are local to a given

331:10

workflow, right? Might not be worth

331:12

abstracting because sometimes that costs

331:14

more than it saves. But this is just

331:16

another engineering decision here,

331:18

right? Aentic systems, they have these

331:20

same sorts of tradeoffs. Algorithmic

331:23

thinking. This is one of the most

331:25

important skills in agentic system

331:27

design. Just because an agent can do

331:30

something doesn't mean that it should,

331:32

right? Some tasks are better handled by

331:35

plain code. For example, calculating

331:38

that commute time or dduping listings

331:42

that I've already seen. An agent's model

331:45

is better at things like fuzzy, you

331:48

know, fuzzy stuff, judgment, ambiguity,

331:52

um, reasoning over messy input. And

331:55

ignoring this distinction is where I see

331:58

a lot of agentic systems get more

332:00

complicated than they used to be. So

332:02

you're using the model, you're handing

332:05

it every part of the task to do and then

332:08

you're getting frustrated when the

332:10

output differs every day. Um, but some

332:14

of this stuff can be handled by just

332:16

regular code, right? It'll be cheaper.

332:19

It'll be more reliable. I promise you AI

332:22

did not invent automation, right? We can

332:25

use code while still using these

332:27

systems. So my rule of thumb here is if

332:30

a task has an exact answer, reach for

332:33

code. If it needs interpretation or

332:36

judgment, that's when you can get the

332:39

agent to do it. Right? So use code for

332:41

determinism. Use agents for judgment and

332:44

then use humans for authority. So the

332:47

agent decides which listings are worth a

332:50

closer look. the code crunches the

332:52

commute, filters out the ones I've

332:54

already seen, and then I'm the one who

332:56

approves actually booking a tour of the

332:59

house. Free form text is fine when the

333:02

human is the only one reading me. But

333:05

when another system has to act on the

333:08

agent's output, then you're better off

333:10

with a contract usually. So, we already

333:14

do this everywhere in software. Anytime

333:17

two systems talk, there's an agreed upon

333:19

shape between them. Yes. So, agentic

333:22

systems, they need that same discipline.

333:24

For example, when relocation scout

333:27

scores a house, it shouldn't just hand

333:30

me back a message and call it a day,

333:31

right? That's lovely for me to read in

333:34

that moment, but that is a dead end for

333:37

the system. If the decision is like

333:39

buried in like one of our sessions,

333:42

nothing downstream can reliably find

333:44

that. So instead it gets written into a

333:48

structured shape to the agent's memory

333:51

and I use uh Copathy's LLM wiki for this

333:55

for for my agent memory layer on most of

333:58

my agents. Um but in here there's a

334:01

decision a score a reason and because

334:04

it's structured that memory becomes

334:06

queryable. So later I can ask Relocation

334:09

Scout like, "Hey, show me every house

334:11

rated four or better that has a commute

334:13

of 15 minutes of or less, right? And it

334:16

can actually pull that because the score

334:19

and the commute, they live in known

334:21

places. They're not trapped in the

334:23

session combo. And it's not just me that

334:26

needs to like get this information." My

334:28

short list step within the system, it

334:31

reads these same fields um without a

334:33

human in the loop. So the agent's output

334:36

is another step's input and so the

334:38

contract is what makes that handoff

334:41

safe. And you know the best part is that

334:44

defining the shape forces you to get

334:48

really clear and specific because if you

334:50

can't say what the output should look

334:52

like then you probably don't yet fully

334:55

understand what you're asking.

335:06

Hi everyone, my name is Lakshia Agraal

335:09

and today I'll be presenting on behalf

335:11

of a very large effort uh the problem of

335:15

reflective optimization or how can we

335:17

self-improve prompts agents and models

335:20

from textual feedback. The question we

335:23

start with is how can we teach AI to

335:25

perform new tasks? The standard way has

335:28

been to perform weight updates with

335:30

gradient descent either during

335:32

pre-training, supervised fine-tuning or

335:35

reinforcement learning. This has proven

335:37

to be extremely effective but it

335:39

requires a huge number of examples.

335:42

Trillions of tokens for pre-training,

335:44

tens of thousands of labeled examples

335:46

for supervised fine-tuning or hundreds

335:48

of thousands of rollouts for

335:50

reinforcement learning in domains like

335:52

math, coding, etc.

335:55

However, most teams do not actually have

335:58

that much data or compute and in fact

336:02

the problems are that we are trying to

336:04

tackle with AI now are bottlenecked by

336:07

sample efficiency. What do we mean by

336:09

that? Two things. First of all, there is

336:12

low availability of domain specific

336:15

knowledge resources which means there is

336:17

not enough data to perform offline

336:19

algorithms like SFT. Second, the domains

336:22

that we are trying to apply AI

336:24

increasingly are having expensive

336:26

rollouts where either the LLM workflow

336:28

pipeline or agentic rollouts are itself

336:31

uh very slow or expensive to do or the

336:34

task metric is very slow or expensive to

336:36

execute. We are seeing that agents can

336:38

now work for hours on end and if you

336:41

were to apply an online learning

336:43

algorithm to this uh it would require

336:46

hundreds of thousands of rollouts and it

336:47

would not be feasible. So we are seeing

336:50

increasing use of agents for real world

336:52

product uh applications where uh these

336:55

invoke tools which can also be long

336:57

running further exacerbating the sample

336:59

inefficiency issue.

337:02

The current dominant paradigm is

337:03

reinforcement learning with verified

337:05

rewards where given a model and a task

337:08

we perform a number of parallel rollouts

337:11

and get rewards at the end. Finally, an

337:14

algorithm like GRPO takes these rewards

337:17

and converts it into gradients that are

337:18

applied back to the model. However, as

337:21

we can see, there was a lot of

337:23

information in each of these rollouts.

337:26

But we only learned an O of one score

337:30

and propagated that via gradient

337:32

descent. We can see that there is chains

337:34

of thought. The tool calls made to the

337:36

environment, the envir environment's

337:39

responses to those tool calls which

337:41

could potentially contain error messages

337:43

which also provide diagnostic value and

337:45

we learned almost nothing from all of

337:48

that. So the question we ask is can we

337:51

make use of this other extremely rich

337:53

information.

337:55

Our idea is to perform reflective

337:58

optimization in text space where instead

338:01

of only using the zero or one reward

338:03

signal, we can have a language model or

338:06

an agent look at the trace of the entire

338:09

rollout and reflect on what worked in

338:12

them, what did not work in them. And

338:14

this reflection could potentially use

338:16

all intermediate outputs and potentially

338:19

even make other tool calls such as

338:21

retrieval from your company's knowledge

338:23

base or some guide textbook and so on.

338:27

So that's the first key idea. And the

338:29

second is that instead of only updating

338:32

weights with small deltas, we can

338:34

instead update a prompt where a single

338:37

natural language update can give a very

338:39

large behavior change. Let's take a

338:40

simple example. Let's say you're tasked

338:42

with writing a text summarization system

338:45

and the prompt of that system says

338:47

generate a oneline summary. If I just go

338:50

and tweak that prompt to say generate a

338:52

10-line summary, we can all agree that

338:54

the behavior of the system would change

338:56

quite significantly with that just one

338:59

word change. And making that one word

339:00

change is quite quick and we can reflect

339:03

on our own behavior and identify what

339:06

needs to change. If we were to achieve a

339:08

similar kind of behavior update from our

339:10

AI system, we would have to have

339:12

thousands of gradient very tiny gradient

339:15

updates sequentially.

339:17

So with that key idea, we proposed JPEA

339:20

which is a reflective prompt

339:21

optimization technique for agents. It

339:24

uses an evolutionary loop along with a

339:26

novel parto-based candidate selection

339:28

which I will come to later. It is akin

339:31

to doing reinforcement learning in text

339:33

space where instead of just rewarding

339:35

receiving a reward score, we are

339:37

actually obtaining score along with

339:39

textual feedback which can be very

339:41

domain specific and learn all about the

339:43

domain from it.

339:46

Let's compare Japa with gRPO which is

339:48

one of the leading RL techniques. On the

339:50

x-axis we have the number of training

339:53

steps uh also proportional to number of

339:56

data samples seen and on the y-axis we

339:58

have the performance on our domain that

340:01

we are training for. And what we can see

340:03

is that Japa in just one round of

340:06

reflection using just three data points

340:08

is already able to get twice the

340:10

performance gains that gpo got after

340:13

25,000 rollouts. Continuing to run Japa

340:16

for a few more steps further increases

340:19

that gap itself by another 2x. I want to

340:23

note here that the model Quen 38B is

340:26

optimizing itself here. There is no

340:29

external expert teacher involved

340:31

whatsoever.

340:33

And what does Japa learn? Unlike prior

340:36

prompt optimizers somewhat which would

340:38

uh uh use model idiosyncrasies like my

340:42

grandmother will be really angry if you

340:44

don't generate a good prompt. Here Jpai

340:47

is actually giving a very detailed

340:49

problem specification which includes how

340:51

to make sense of the input. What is the

340:53

purpose and context of this particular

340:55

pip uh part of the pipeline? What are

340:58

some key observations and lessons from

341:00

the data? So the prompt we are seeing

341:02

here is for the second hop of a multihop

341:05

question answering system where given a

341:07

question we need to retrieve some

341:08

documents that could potentially answer

341:10

that question. Look at those documents

341:12

summarize it and then finally answer the

341:14

question. And here what we see is Japa

341:16

has found out that first hop documents

341:18

that often cover one entity or aspect

341:22

and the second hop should actually be uh

341:24

recovering documents that are related to

341:26

it. We have seen that human engineering

341:29

teams whenever a new model comes out

341:31

spend weeks of their time manually

341:34

tweaking one word here and there trying

341:36

to discover the problem specification.

341:39

This entire process is fully automated

341:42

now with Japa which takes about half an

341:44

hour to 1 hour to run depending on your

341:47

uh pipelines.

341:50

We can also apply Japa to leading

341:52

proprietary models. Just for an example

341:55

here we were able to optimize GPT 4.1

341:58

minis performance to outperform GPT 4.1

342:01

on a math task and we can see the kind

342:04

of information distillation JPA has done

342:07

in the prompt space itself. Coming back

342:10

to the problem of sample efficiency, AMD

342:13

developed a new hardware accelerator

342:15

called NPU XDNA2 which had used a

342:18

completely new API to program which had

342:20

almost zero available information over

342:23

on the internet and because of this uh

342:26

the leading models at the time which was

342:28

GPT4 was failing miserably to perform

342:31

this task. We are able to take an

342:33

existing agent which was getting 4.25%

342:35

25% on this task and apply Japa without

342:38

any other change to the agent itself and

342:41

we got this prompt and pushed this

342:43

performance 7x to 30.52%.

342:46

So what this is uh what this goes to say

342:49

is there can be lots of domain specific

342:51

information which if you include in your

342:53

AI systems prompts the models could

342:56

actually perform much better and JPA can

342:58

help you fully automatically discover

343:00

that. I want to highlight the sentence

343:02

saying avoid including ADF.h H. Now the

343:05

interesting thing is AMD actually ships

343:07

a library called ADF.h for programming

343:09

NPUs but that did not work with this

343:12

latest uh generation of hardware that we

343:14

were working with and Jeppo was able to

343:16

discover that in just one step. So how

343:20

does it work? It's an extremely simple

343:21

algorithm which simply takes your AI

343:24

pipeline written in any agentic

343:26

framework or even raw LLM calls that you

343:28

may have. It simply runs your systems on

343:31

a few examples and collects domain

343:33

specific feedback. whatever information

343:35

your environment contains is observed.

343:37

Second, it runs reflection with an LLM

343:40

or agent that reads the feedback and

343:42

proposes a better prompt. Finally, and

343:45

most importantly, it keeps a parto pool

343:48

where it keeps every single candidate

343:50

that wins on even one training example

343:52

and not just the top scorer. The

343:55

question is, but why keep a parto pool?

343:58

And we kept getting asked this question

344:00

a lot that is Jeppa really better than

344:03

running the model in a loop. So we went

344:05

and tested it out and what happens is a

344:07

loop keeps only the best and gets stuck

344:10

in a local optima. So on the left hand

344:12

side you see a search tree that was

344:14

generated by using an LLM in a loop.

344:17

Starting from a seed prompt at the top

344:19

left where um we asked the LLM to

344:22

improve the prompt. It improved the

344:24

prompt and it generated a prompt that

344:26

gave us the middle note. However, this

344:28

prompt got stuck in a local optima and

344:30

once again when we asked the LLM to try

344:32

and improve it, it proposed something

344:34

but that was not actually better. So, it

344:36

went back and it again tried to improve

344:38

it and it kept doing this and it

344:40

exhausted all of the search budget. On

344:42

the other hand, with Japa's parto based

344:44

candidate selection strategy on the

344:45

right, we can see that it maintains a

344:47

much more balanced search process

344:50

eventually converging to a much higher

344:51

score. Across four benchmarks, we saw

344:54

that more than half of the gains seen

344:56

with Japa actually account for this and

344:59

it gets almost twice the performance

345:01

gains that you would get with just

345:02

applying the model in a loop.

345:05

Japa can perform really well across

345:07

diverse benchmarks. Here we see results

345:09

on question answering, instruction

345:11

following, claim verification as well as

345:13

math which all the leading frontier

345:15

model companies are already optimizing

345:17

their models a lot for and we are still

345:19

able to get plus 10% just by optimizing

345:22

the prompt on it.

345:24

So we have so far seen Japa only

345:27

optimizing the prompts. But Japa goes

345:29

far beyond prompts. And because prompts

345:31

are just text artifacts that determine

345:33

AI system behavior, the same algorithm

345:35

can improve anything that you can

345:38

express as a piece of text and you can

345:40

score. For example, your entire agent

345:43

harness is eventually just a Python or a

345:45

JavaScript file and we can apply the

345:48

same kind of reflective optimization

345:49

process to that entire file and we can

345:52

work with it. So if you can write it as

345:54

text and score it, JPA can optimize it.

345:57

So with that insight in mind, we propose

345:59

optimize anything which is a universal

346:02

API for optimizing any text parameter

346:05

given any domain like code optimization

346:08

where let's say you want to optimize a

346:10

CUDA kernel code. The input is just that

346:13

CUDA kernel code where an evaluator

346:15

looks at this piece of code, maybe

346:17

compiles it, profiles it, generates a

346:19

bunch of related information that we

346:21

call as actionable side information

346:23

which is then provided to an LLM which

346:26

proposes an better candidate maintaining

346:28

this parto and it keeps the uh repeating

346:31

this process um till we get convergence.

346:34

The same thing can be applied to numeric

346:35

optimization where your numbers can

346:37

actually be serialized as text or

346:39

harness optimization where an entire

346:41

harness can be serialized as text or

346:43

even cloud scheduling policy

346:45

optimization where the scheduling policy

346:47

or heristic algorithm can be expressed

346:49

as a piece of text and the evaluator can

346:51

be something like the negative of cost

346:53

or some function measuring accuracy uh

346:56

efficiency and the actionable side

346:57

information can be something like job

346:59

traces SLA violations and so on.

347:02

The API is dead simple to use. All it

347:05

requires is you give us the set of

347:08

problems that you care to be solved

347:09

along with an evaluator function or a

347:12

fitness function that returns a score

347:14

along with any available domain specific

347:17

side information. If your domain

347:19

produces expert feedback, return that.

347:20

If your domain produces compiler error

347:23

messages, profiler messages, tool call

347:25

error messages, return that. If you have

347:27

maybe a written up documentation, return

347:30

that. any kind of it's a very open-ended

347:32

dictionary. You can return literally

347:34

anything and all you do is you call

347:36

optimize anything with this fitness

347:38

function and the set of problems that

347:40

you have and optimize anything will sort

347:42

of take care of it um and give you a

347:44

optimized solution. Let's see some

347:46

applications. Let's say you were tasked

347:48

with generating a 3D unicorn. This is

347:52

all the code that you would write or

347:54

your agent can now write it because we

347:55

have seen that optimize anything is a

347:58

very easy to use API for leading agents

348:00

like plot code. So all you do is write

348:03

this code which says optimize a Python

348:05

program to generate a 3D unicorn. Um and

348:08

the candidate is a Python script that

348:09

produces a PNG rendering whatever and

348:13

here is the result. On the left hand

348:14

side we can see claude opus 4.6 if you

348:17

gave it this task this is what it

348:19

generated. And on the right hand side,

348:21

what what we the unicorn that we get

348:23

with optimize anything. This just for

348:25

fun. But let's say you were tasked with

348:29

writing an agent to solve a specific

348:32

task. Typically teams spend lots and

348:34

lots of time tweaking their agents,

348:36

building tools for it, writing tool

348:38

descriptions, uh carefully orchestrating

348:41

the control flow and so on. Here we

348:43

started with a simple four-line Python

348:45

program that was simply calling a

348:47

model's uh chain of thought to solve an

348:50

RKGI problem. Within just 16 rounds of

348:53

reflection, Jeppa within optimize

348:55

anything was able to find this

348:58

sophisticated sixstep agent that took

349:01

RKGI accuracy on RKGI uh that took RKGI

349:04

accuracy of Gemini flash from 32.5% to

349:08

89.5%. And we can see that this agent is

349:11

automatic like by itself doing rule

349:14

hypothesis induction code synthesis. It

349:16

executes and traces the code

349:18

automatically debugs this code. Goes

349:20

back and proposes new versions of that

349:22

code. And finally it runs it on the

349:24

actual test inputs and returns the

349:26

output. This is a runnable example. You

349:28

can go to this QR code and you can run

349:30

this example right now.

349:33

So um applying the same uh uh like

349:37

approach of discovering agent harnesses

349:40

to math 500 we are able to push its

349:42

accuracy of GPT 4.1 nano by 20% by

349:46

simply creating a two-step agent. And

349:48

again I want to emphasize that all we

349:50

did is we asked optimize anything to

349:53

optimize an agent file and it was

349:55

automatically discovering the

349:56

sophisticated agent architecture and we

349:58

did not have to do anything other than

350:00

specifying the objective and the task.

350:04

Finally, every single one of us is using

350:06

uh some coding agent like cloud code or

350:09

codex or maybe your favorite agent and

350:12

agent skills has become a very leading

350:14

part of the ecosystem where almost all

350:16

coding agents understand skills. Let's

350:18

say you want to optimize skills for your

350:21

specific repository. This is the code

350:23

that you write which says learn a skill

350:25

from the trajectory. When the coding

350:27

agent is presented with similar problem,

350:28

the skill should be helpful. We just

350:30

give it this natural language behavior.

350:33

And what we see is we started with miniu

350:36

agent with GPT5 mini because we were

350:38

very budget constrainted and we were

350:40

able to take its performance from 24% to

350:43

93%. An almost 3x jump on go repository

350:48

issue resolution but more importantly

350:50

the skills that were optimized very

350:52

cheaply on a GPT5 mini agent we are able

350:55

to take that and apply to the latest

350:57

claude sonnet. This was done a uh about

350:59

a few months back but we applied it to

351:01

clots onet 4.5 pushing its accuracy to

351:04

100% issue resolution while more

351:06

importantly cutting down the execution

351:09

time or issue resolution time by almost

351:12

50%. We cut it down into half which also

351:14

means it spent less tokens because

351:17

skills contain information about how the

351:19

repository is organized, how to invoke

351:21

the test cases, where a particular

351:24

feature is implemented, um what are the

351:26

build system used by this repository and

351:28

so on. This is a a feature called

351:31

GSkill. You can find it in the Japar

351:33

repository and it's fully open source as

351:35

well. So, optimize anything is a single

351:38

uh interface that provides three

351:39

optimization modes. If you have just a

351:41

single problem like there is a single

351:43

matrix multiplication kernel that you

351:44

want to optimize you can use it that

351:46

way. If you have any number of related

351:48

problems like you want to optimize a

351:50

matrix multiplication kernel along with

351:51

a dot product kernel and you know there

351:53

might be some information transfer

351:55

between these two you can use what we

351:56

call as the multitask search mode and

351:58

finally build a skill which is if you

352:01

want to optimize on a set number of

352:03

problems but your uh deployment can

352:06

actually come up with many new problems.

352:07

So like uh in case of math op like in

352:10

case of math prompt optimization we are

352:13

training on some examples but when we

352:15

deploy it we can receive a completely

352:16

new kind of query. So we care about

352:18

generalization mode. So there you can do

352:20

prompt optimization agent architecture

352:21

optimization and so on.

352:24

So optimize anything is can be used for

352:27

a broad set of domains including cloud

352:30

scheduling policy optimization where we

352:31

were able to cut costs by almost 40%

352:34

compared to expert huristics write

352:37

custom solvers to match and exceed

352:39

Optina even in blackbox mathematical

352:41

optimization create agent skills prompt

352:44

optimization and so on. It is so easy to

352:47

use that within just 20 hours of

352:49

releasing it, people at snorkel had

352:52

already improved some of their internal

352:53

benchmarks with it and were tweeting

352:55

about it. So, and Jeppa also improves

352:58

multimodel VLM models performance. Here

353:00

we are able to cut OCR error rates for

353:03

leading models by almost 35%. And this

353:06

is an externally validated report. Um,

353:08

similar similarly, data bricks actually

353:11

achieved 90x cost reduction in their

353:13

deployed agents performance. uh uh

353:16

performance and here they were able to

353:18

tune GPT OSS 120B to outperform Claude

353:22

Opus while being 90x cheaper. More

353:25

importantly, the performance delta

353:26

improvement that you see on top of

353:28

Claude Opus is actually bigger than the

353:30

one you see on open source models. Some

353:33

people have asked me that oh as models

353:35

get better the importance of prompt

353:36

optimization will go down. I argue the

353:39

opposite which is as models get better

353:42

they will get better at instruction

353:43

following and the more precise

353:45

instruction about your task that you

353:47

have to give to a very smart model the

353:49

better that model will be at a uh

353:51

solving your task and this is exactly

353:53

what we see happening here the better

353:55

the instruction was claopus actually

353:57

jumped much uh higher

354:01

some people have this question of uh

354:03

what if we have subjective tasks which

354:05

are very hard to evaluate jpa can

354:07

actually learn evals for your task from

354:09

production traces. The way to do that is

354:12

you collect a bunch of production traces

354:13

from your agent. Get a human to annotate

354:16

just about 50 of those trajectories

354:18

giving very detailed feedback. This is a

354:20

long response. This is a short response.

354:22

This is a good response. This uses this

354:24

terminology, whatever. And once you get

354:27

those human annotations, you can use

354:28

Japa to optimize an LLM as a judge

354:30

prompt. And you can use that LLM as a

354:32

judge prompt then to go back and

354:34

optimize your agent and deploy that

354:36

agent. And this becomes a data flywheel

354:38

where you can keep improving it. And

354:40

this is a successful paradigm that uh

354:42

some leading teams in production are

354:44

already using. Then the question we get

354:46

asked is like can we actually use this

354:49

uh reflective optimization to train

354:50

models and we recently had this paper

354:53

called learning fast and slow where we

354:55

propose fast slow learning where we can

354:57

co-optimize model weights and prompt

355:00

harnesses and this shows some very

355:02

strong properties that one would want in

355:04

a continual learning algorithm. Um I

355:06

don't have much time to go over details

355:08

but please uh look at the uh papers and

355:13

uh since uh since release Japa has been

355:16

used in production by these companies as

355:18

well as the main methodology in these

355:20

papers and here the CEO of Dropbox and

355:23

Shopify are talking about their use of

355:25

Japa and OpenAI also wrote a blog post

355:27

about how you can build self-improving

355:29

AI systems with Japa. Um so it's very

355:33

simple to get started. It can plug into

355:35

any framework, any model and it has

355:37

absolutely zero hard dependencies. So

355:40

you can deploy it any in any kind of

355:42

setting. So um don't be afraid to

355:45

optimize in the tech space and many

355:47

problems can be framed as optimization.

355:49

So bring actionable side information and

355:52

surface as much domain specific

355:54

information as you can to optimizers and

355:56

the optimizers of future will be able to

355:58

work with them. So please go and check

356:00

it out. Thank you very much.

356:20

Hello there. My name is Raymond

356:22

Whitampamp and today I'm going to talk

356:24

about recursive coding agents which is

356:26

this idea of applying the lessons of

356:30

recursive language models RLMs uh to

356:33

coding agents. This is some work that I

356:36

have done both in my independent

356:38

research um raw works uh and also more

356:43

recently

356:44

in my role at open pros. So to motivate

356:49

this a little bit, we all want outcomes.

356:51

We all want agents that are working on

356:54

our behalf. We want reliable co-workers

356:56

that are getting things done while we

356:58

are doing something fun, while we're out

357:00

on a hike, while we're cold chilling,

357:02

while we're doing the do. And my

357:07

argument and my experience is that the

357:10

bottleneck to this is not intelligence.

357:14

The models are intelligent enough. They

357:17

know all kinds of things. They know the

357:20

entire internet, but they can't reliably

357:23

deliver outcomes. And so I can't trust

357:26

them. So as a very simple example, you

357:28

know, one day I get almost a fully

357:31

working SAS app from a single prompt,

357:33

granted a long prompt.

357:36

The next day, and I swear this actually

357:38

happened.

357:40

Cloud code empties the entire contents

357:42

of my Salana wallet. Oops. Okay. So,

357:45

that doesn't really instill trust. So,

357:49

at the bottom here, we've got this pro

357:51

this progression. Okay. And we all want

357:53

to move towards the the one on the right

357:55

where we're just sort of sitting there

357:56

meditating and and things are

357:58

manifesting. And so, where does that

357:59

come from? This is from the AI engineer

358:02

code.

358:05

It's actually from the back of the

358:06

t-shirt. Engineer code November 2025.

358:10

Man, I hope I hope you were there. If

358:13

you weren't, watch it on YouTube. It was

358:15

it was amazing. So, here's the thesis.

358:18

The thesis is today's agents are

358:21

mismanaged geniuses. The intelligence is

358:24

there and the missing layer is how do we

358:26

specify and manage and reuse and verify

358:29

the work. So this uh framing this phrase

358:33

the mismanaged genius uh comes from Alex

358:36

Zang Zed Lee and Omar Katab at MIT. Um

358:39

and Alex and Omar are part of the

358:41

authors of the original recursive

358:43

language models paper. Uh I've also

358:46

talked a little bit about this recently

358:47

on touring post. Um I forgot to mention

358:50

that these slides are actually a website

358:53

recursivecoding agents.com. So you can

358:56

click on them uh by going to this

358:59

website. So everything I'm going to show

359:00

in here is is interactive. Okay. What

359:03

are recursive language models? So I like

359:08

to say that in an RLM the context itself

359:11

is the object of computation. Um and

359:15

this is essentially a marriage of tool

359:19

calling and reasoning. We're going to

359:21

talk a lot more more about that in the

359:22

next slide. But the idea is that the

359:25

full prompt is not a simple user query.

359:28

The full prompt is a variable. The full

359:30

prompt could be a file or many files.

359:34

Um, and we have this readaluate print

359:36

loop ripple um that the agent is

359:40

interacting with in the original paper.

359:42

That's Python. And the RLM is instructed

359:46

to operate symbolically on that prompt.

359:49

So don't just read the whole thing into

359:51

your context window. Um, explore it

359:54

symbolically.

359:56

And uh even more you don't even directly

359:59

explore symbolically or maybe you do a

360:01

little bit of poking around.

360:13

Hi everyone, I'm Tis. Uh so I'm going to

360:16

be explaining how we make models three

360:17

times faster with Auto Research. Uh so

360:20

previous to this uh I actually used to

360:22

do GPU mining in my dorm room with 1080

360:24

Ti all the way up to working at Tesla on

360:26

inference optimization for Tesla AI.

360:30

Uh but first what is auto research? So

360:32

auto research is this framework from

360:34

Andre Kapathy where uh you basically set

360:37

up a framework for an agent to move

360:38

towards a goal that you define uh and

360:41

all you have to do basically is say at

360:43

the high level what you want it to do

360:44

and it will try things as it goes and

360:46

move back and forth uh towards that

360:48

goal.

360:50

In actuality, it's really just a while

360:51

loop. The agent proposes a solution. You

360:54

have a setup to to define what's

360:56

correct, benchmark it for us. Uh and

360:58

then you keep or revert that and you do

361:00

this in a loop until your goal is met.

361:03

And so this is very well aligned to GPU

361:05

kernels. Uh so if you don't know what a

361:07

GPU kernel is, it's basically a

361:08

low-level operator. And in a Nvidia GPU,

361:10

this is a CUDA kernel. Uh and this is um

361:12

an operator that's used by the GPU to

361:14

operate like millions of times in

361:16

parallel. for example, like a matrix

361:17

multiply or an expert computation.

361:21

Uh, and why are GPUs such a good fit for

361:22

auto research? It's because they're

361:24

super verifiable. You can verify them

361:25

for correctness and speed, and that's

361:27

basically all you need for your auto

361:28

research framework.

361:30

Uh, so in actuality, there are some

361:32

caveats here. Um, the auto research

361:34

framework is really good for like

361:36

picking block sizes and these tiny

361:37

parameters, but they're also still

361:39

really bad at the high level idea, like

361:40

seeing like I want to use this GPU and I

361:43

actually want to pipeline it. It's not

361:44

going to come up with these

361:45

groundbreaking ideas. So it's still up

361:46

to the human to do that, but the actual

361:47

implementation is very straightforward

361:49

once you once you have the idea laid

361:51

out. So it is still your job to have

361:54

good ideas is what I'm saying. Uh and so

361:56

the actual secret formula here is you

361:58

have the good ideas, auto research picks

362:00

out the parameters and everything to

362:01

verify that it actually works. Uh and go

362:03

move toward that verifiable goal of it

362:04

being x times faster and uh still

362:07

correct. And you mix that with billions

362:09

of tokens of your favorite model and

362:11

that results in kernels that beat hand

362:12

tuning.

362:14

Uh so what are the actual things you

362:15

care about when you're when you're when

362:17

you're writing a custom kernel or you're

362:18

having your agent write a custom kernel.

362:20

So the three main things you can have

362:21

are a compute bottleneck uh a memory

362:23

bottleneck or you just have excessive

362:25

overhead from uh too many kernels being

362:27

launched. And you can do you can view

362:29

these things with by profiling with a

362:30

profiler like NSIS for example which is

362:32

a Nvidia's profiler. Uh and so this this

362:36

gra this page looks super daunting but

362:38

basically your job as a human is to look

362:39

at the top here and be like this is

362:41

dumb. uh we are loading 32k chunks into

362:44

context uh and we don't actually need to

362:47

for this deepseek attention for example

362:49

uh and we should only be doing it every

362:50

32k instead and so at a high level all

362:52

you have to be telling auto research is

362:54

this top method is dumb let's pipeline

362:56

it instead and everything else like the

362:57

sizing the chunk sizing the context

362:59

chunks that all should just be decided

363:00

by auto research

363:03

and so my problem is that I really love

363:05

cheap GPUs and so that means like GPUs

363:07

that don't have NVLink for example uh is

363:09

an example of like GPUs you can get for

363:11

cheaper

363:12

Uh but the problem is you don't actually

363:13

have kernels off the shelf for those.

363:14

And so you have to come up with a auto

363:16

research framework as well as a custom

363:17

harness. So what goes into the harness

363:18

to make this really good.

363:20

Uh so one thing you really need to make

363:22

sure your agent is aware of is the

363:24

hardware. And so on a B200 for example,

363:26

you need to make sure it has context of

363:28

uh the warps. It has T-M TMA. And so if

363:30

you don't know what these are, these are

363:32

just uh low-level operators that you

363:34

have um on a specific hardware. And this

363:37

changes generation to generation. like

363:39

an H200 won't have T-M for example.

363:41

That's a new feature that coming out

363:42

with B200 which is why you need to have

363:44

this in context. Um and so this this

363:46

basically is just like bunch of MD files

363:48

you need to give so it has context.

363:51

Other thing you need to make sure your

363:52

agent has context of is the model and so

363:54

every new model like DeepS Flash comes

363:55

out with like new tricks like DeepSeek

363:57

had two new attentions that was released

363:59

in the Deepseek Flash for Deepseek V4.

364:02

Uh so compress sparse attention

364:04

hierarchal compressed and if you don't

364:05

do this the model will 100% hallucinate

364:08

uh the actual attention mechanism and

364:10

you will get useless kernels.

364:13

Uh by far the biggest problem when

364:14

you're doing this is going to be reward

364:16

hacking. And so if you were to tell your

364:18

kernel engineer co-orker I need to make

364:20

uh the GPU this GPU kernel faster. Uh

364:23

it's obviously not going to your human

364:25

coworker is not going to go in and do

364:26

some stuff that's going to make it slow

364:28

like the endto-end model inference

364:29

slower. But uh agents are not humans and

364:32

they will do plenty of things to make it

364:33

slower like they'll disable CUDA graphs

364:35

which can make it 20 times slower and

364:37

they might make that one kernel faster

364:38

but make the whole like it's not a

364:40

viable kernel because it's they're

364:41

disabling a bunch of speed ups like CUDA

364:43

graphs or only testing on small context

364:45

windows. And so a lot of this is also

364:47

just defining what not to do which is

364:48

actually very important when you're

364:49

doing frontier work that agents can

364:51

actually easily do with a one shot.

364:57

Uh, another reward hack is that some

364:59

models just don't actually write the

365:00

cute DSL you need uh when you're trying

365:02

to write kernels. And this is a common

365:04

problem with enthropic models. And so

365:07

yeah, I mean anthropic says what they

365:08

say about uh nerfing models. You can

365:10

it's guess if it's I'm guessing if it's

365:12

nerfing or not, but I would recommend

365:14

using a different model. Uh and it won't

365:16

always be faster everywhere actually. So

365:18

sometimes the kernels you come up with

365:19

might only work well on like zero to

365:21

100k and then you need to go back to

365:23

this the default kernel that could you

365:25

get from like a flash in for cutless. Um

365:27

and so and that's another thing to look

365:29

out for is that your kernel isn't always

365:31

just a swap in for all all workloads.

365:35

Uh but one of the great things is is

365:36

that kernels compound. So like if you

365:37

make one for your sparse MLA for

365:39

deepseek for example um you can get

365:41

speed ups there and you just stack them

365:42

on like that then plus NVFP4 fore uh you

365:47

could do for us if we if you don't have

365:48

NVLink you just keep stacking and

365:50

stacking and stacking and then

365:51

eventually you taper off at whatever the

365:53

hardware limit is uh for your GPU and

365:55

that's uh some people call this like MFU

365:57

which is like the actual theoretical max

365:59

utilization from a GPU.

366:02

Uh, and so to go even farther, if you

366:04

have actually have bare metal access,

366:05

your auto research framework can uh do

366:08

very hacky things. So hackers that have

366:09

hacked with GPUs are probably going to

366:10

like this. You can uh tweak your BIOS

366:12

settings, you can overclock the GPU, uh,

366:15

you can force like PCIe relaxing, all

366:17

these little tweaks of like uh, old

366:20

school hackers used to do, but this can

366:21

actually help with inference as well.

366:22

And so net on bare metal optimizations,

366:25

you can get roughly 25% over like a

366:27

virtualized setup you get from using a

366:29

cloud provider.

366:32

Uh so once you get that you can combine

366:34

all of the kernels you did as well as

366:35

all of the hardware level hacks you did

366:37

uh you can get a 3x speed up and so I

366:39

know this this might all sound like

366:40

roses and flowers but it's not actually

366:42

the case around 80% of the things that

366:43

auto reach is going to do are going to

366:45

be bad uh so it's important to remember

366:47

while you're u like working on this that

366:49

most things are going to be bad it's

366:51

going to try to trick you all the time

366:53

uh but at the end you can actually get

366:54

really good results from this

366:57

tlddr uh have better ideas then use auto

367:00

research. Super simple. Simple, right?

367:04

Uh so turns out you can actually get

367:05

paid to do this. Uh if you think this is

367:07

cool, consider joining us and you can

367:09

email me here.

367:11

Thanks, guys.

367:24

Imagine

367:33

you find a magic lamp in an antique

367:35

store. You rob it. A genie appears and

367:38

asks how it can help. You bury it in the

367:42

line. So you say, "I need the best

367:43

engineer to help with an impossible

367:45

project at work." And the genie grants

367:48

your wish. For me, the best engineer is

367:50

probably John Carmarmac from his eight

367:51

days. So you get Karmarmac. But the

367:54

genie had a sense of humor and imposes

367:56

restrictions, maybe for safety. Karma

368:00

can only see one small part of your code

368:02

base, maybe 1,000 of it. And he

368:05

remembers nothing he did before. Every

368:07

conversation starts fresh. That would be

368:10

maddening, right? You would know there

368:11

is a standard way to do stuff and karma

368:14

couldn't. You would have to explain the

368:16

same thing over and over and over again.

368:19

You would have a genius on one side and

368:21

something deeply deficient on the other

368:23

and that's what agents are. Let me walk

368:26

you through an example of how many times

368:29

we explain things in a simple

368:31

interaction. We have four reposi

368:35

module one module 2 and platform. I want

368:37

to change the UI and propagate the

368:39

change through the system. Okay. First

368:42

we change the UI library. Say we I don't

368:44

change a button or whatever. That's the

368:46

first explanation. Unavoidable. We have

368:48

to express the intent. Okay. Then we

368:51

publish it. We go to module one and we

368:53

have to explain what just has happened

368:55

in the UI library. So it can consume the

368:57

package here. Note that that's often a

369:00

different person, right? Every box in

369:02

this diagram can be uh done by a

369:04

different person.

369:06

Then we discover that the published UI

369:08

library doesn't work with module one. So

369:11

we go back uh to UI and we have to

369:15

reexlain the original change and the

369:18

issue right because it's a new agent it

369:21

doesn't know the original change and

369:23

obviously doesn't know about the issue

369:25

let's say we fix it right and uh publish

369:28

it again we go and again we explain the

369:31

new change in the context of module one

369:34

same ordeal I mean do the same for

369:36

module two again and then we go to the

369:39

platform repo and we explain explain how

369:41

everything fits together and we

369:43

implement the change there. Let's

369:46

imagine a week after release uh a bug

369:48

appears in the UI component and uh we

369:51

have to fix it. So we start an agent to

369:53

the UI repo and we have to explain again

369:56

the original change from a week ago and

369:58

this production issue we have seen. So

370:01

we have seven explanations for what

370:03

essentially is one change

370:05

and also it may not be one person making

370:08

all these seven explanations uh but they

370:10

still occurred right so that's very very

370:13

typical uh with agents. So how do we

370:16

solve it?

370:19

Well uh there are many problems in here

370:21

that contribute to this experience but

370:23

they roughly fall into two categories.

370:26

The first one is uh that an agent

370:29

essentially is repo bound.

370:32

The agent sees and changes generally one

370:35

repo at a time. It never sees the whole

370:38

system which can be hundreds or

370:41

thousands of repos. So that's kind of

370:43

the space component of the problem.

370:46

Second is amnesia. The agents forget the

370:49

work. Every session starts with a blank

370:52

slate. The human becomes a memory in

370:54

this case. That's the time component of

370:56

the problem. Look at the two closer.

370:58

Take the repo boundary first. Without a

371:01

model how repos fit together, the agent

371:03

leans on the human to do the research.

371:06

It can't align the code with the rest of

371:08

the system. It couldn't align the UI

371:10

change with module one. The human didn't

371:12

explain it. So, a bad version shipped.

371:16

It can't reliably reference best

371:18

practices and standards either because

371:20

those often live in other repos. Writing

371:23

is even worse. The agent writes to one

371:25

repo at a time. It means it can't

371:27

validate changes downstream.

371:30

Modules 1CI should have failed on the UI

371:32

change, but it didn't. The agent can't

371:35

update consumers at the same time. Even

371:38

though, you know, while making the UI

371:39

change, it has perfect information to do

371:41

so. It knows exactly what it's doing.

371:43

So, the user has to reexplain stuff

371:45

imperfectly to each consumer.

371:49

Changing something across 20 repos means

371:51

you're explaining things 20 times. a lot

371:52

of developer time spent but also a lot

371:54

of tokens burn.

371:57

The second category is that the agent

371:59

forgets. The agent has no episodic

372:03

memory. Every session is a blank slate

372:06

and the human in this case becomes the

372:08

memory.

372:10

Here what the graph of your work

372:13

actually looks like. At the bottom there

372:16

is a repository graph. The artifacts

372:18

your organization produces plus every

372:22

open source repo you depend on. Maybe a

372:24

thousand repos you own and tens of

372:26

thousands of open source repos. At the

372:29

top there are all agentic sessions that

372:32

create and modify that code. Session

372:34

relates to each other. Repos relate to

372:37

each other. So this graph is a faithful

372:39

picture of the work in your

372:41

organization.

372:43

It describes what there at the bottom

372:45

and how it came to be at the top. That's

372:49

what you want your agent to see

372:53

here. What it actually sees is one

372:55

session, one small fraction of your

372:58

codebase, no memory. Okay? Because it

373:03

sees so little, it leans on the one who

373:06

understands the system, the developer.

373:09

Every developer has a part of that

373:11

graph, right? in their head at least in

373:14

the domain they know. agent generally

373:17

speaking doesn't if this doesn't sound

373:20

crazy right imagine an agent that could

373:22

see one file at a time maximum and can

373:25

only look five messages back sort of

373:28

constraint again both in space what can

373:30

see and time how far in the path could

373:32

see you would say that's impossible to

373:35

work in what we have now is similar to

373:39

that crazy picture and the more complex

373:42

the organization is the more apparent it

373:45

becomes

373:46

I'll show you how we solved it. Other

373:50

organizations I talk to have similar

373:51

solutions. So, uh look at the problem

373:54

and the solution conceptually, not a

373:56

specific tool. Although the tool is

373:57

pretty cool,

374:01

we built

374:02

uh an agent agnostic meta harness called

374:06

polygraph. Okay, let me show you what it

374:09

does and how it fixes the issues we just

374:11

discussed.

374:12

The first idea that we uh arrived at is

374:17

that if a GitHub user, any user has

374:20

access to thousands of repos, some of

374:22

them they own, many of them are open

374:25

source, we can analyze them and extract

374:30

a lot of metadata out of them to build

374:33

unified dependency graph. Uh no line of

374:36

code changes in those repo that all

374:38

happens kind of on the side, right? And

374:41

then we can get this metadata and feed

374:43

it to the meta hardness and create an

374:46

illusion of one big code base the agent

374:49

can read and write anywhere.

374:52

This is my personal graph. I only have

374:56

about 300 repos I own, right? And

374:58

thousands of open source repos my

375:00

projects depend on. Polygraph computes

375:03

what each one produces. each repo, each

375:06

project in each reper, what each project

375:08

in each repo consumes package wise, what

375:10

API they produce and consume, and lots

375:12

of other stuff, right? And it teaches

375:13

this together

375:16

uh into this like one big body of code

375:18

that your agent can work with.

375:22

So let's see what it does, right? The

375:25

first thing it does is uh it lets you

375:28

start a session to bring the relevant

375:31

repositories in. Right? Right. So what

375:33

it needs to do, it needs to uh set up

375:36

the source code,

375:38

install dependencies,

375:40

set up an agent for each repo,

375:44

wire them up so they can work together,

375:47

and provide a clean, beautiful TUI to

375:51

make non-trivial changes without getting

375:53

lost. I will show you how it all works

375:55

in a second. Right? So that's kind of

375:57

pulling information in.

376:00

Pulling information in is only one part

376:01

of the story, right? Honestly, it's an

376:03

easy part. Making changes is harder. If

376:07

you have 10 repos in one session, it

376:09

means you can have 10 pull requests,

376:12

right? You need to run CI, you need to

376:16

coordinate all of it, right?

376:19

You need to do all this stuff, right?

376:22

What if one of them fails, right?

376:24

Polygraph treats all the CI as one

376:27

vector.

376:28

Like if we look at early example uh when

376:31

we run CI for UI module one and module

376:34

two if module one fails within a

376:36

polygraph session it will figure out who

376:38

fixes it whether module one need the

376:41

patch or the UI component itself is

376:43

wrong and incompatible with module one

376:45

at which point everyone will need a

376:47

patch right polygraph lets you treat

376:50

complex multi-reo change as if it was a

376:53

single repo change

376:56

the same machinery by the way fixes

376:58

episodic memory

377:00

because we capture your work. No matter

377:02

how many repos are involved, we know

377:04

your intent, the repositories involved,

377:06

PRs. We also capture all agent traces.

377:09

Because we capture all of this stuff, we

377:11

can relate it. So now we can say your

377:13

work in one repo, connect to another

377:16

work in another repo, right? And all of

377:19

that lets us restore any session, any

377:22

piece of work on any machine or

377:24

reference it from anywhere. And I'll

377:27

show you again how it works in a second.

377:30

What you get is an agent

377:33

with idic or photographic memory of your

377:36

entire organization. It understands how

377:39

repos are written, how they relate, how

377:41

they put together and remembers every

377:44

session from every repo by basically

377:47

every developer, right? And that creates

377:49

a completely different development

377:51

experience.

377:53

Let me show you.

377:55

First, let's look at how we create a

377:56

session. Something simple. You run a

377:59

command

378:01

and you pick some repositories from a

378:03

list.

378:04

Here's a tiny GitHub work with only

378:06

three repos because a demo. I pick a

378:09

back end and a front end. Let's say I

378:12

need to make a change that, you know,

378:13

changes API and has to update both API

378:16

and how stuff is being displayed.

378:20

I need to give my session a name. I need

378:23

to pick an agent from the ones I have

378:25

installed. I picked Claude by any

378:28

installed agent works the same way.

378:29

Remember, polygraph isn't an agent. It's

378:32

a meta harness around an agent that

378:35

makes them uh more capable.

378:39

And

378:43

in a second, uh, the agent boots. And

378:46

here I could interact with it as if I

378:49

was in a single repo, even though

378:51

multiple repos are involved, right? I

378:52

could give it instructions.

378:57

It's going to uh plan out the change.

379:02

There's some cool animations in the TUI

379:04

as well

379:07

eventually.

379:09

It figures out how the two repos relate

379:12

and what the change is. I can ask it to

379:14

implement a change. My interaction with

379:17

this uh exactly same as if it I was

379:21

working in a single repo. The fact that

379:22

there are multiple repos involved is not

379:24

really important, right? Uh the only uh

379:27

part where it becomes important that I

379:29

have multiple pull requests, right? Uh

379:31

but I also get a polygraph session. What

379:33

those pull requests are, right? If I

379:35

look at the session, I will see I have a

379:37

description

379:39

uh that uh description of the session.

379:40

It describes the work conceptually kind

379:43

of bypassing the repo boundary saying we

379:44

had to change stuff in this repo and

379:46

change stuff in that repo. It gives me a

379:48

good view of which repos are involved

379:50

pull requests involved CI in those repos

379:52

everything I need to know. A lot of the

379:54

stuff is basically what I would have in

379:56

a single repo but many right and I also

379:59

have all the agent logs captured as well

380:01

which is important for resuming which

380:03

I'm going to show you in a second.

380:05

Now it gets interesting. I already saved

380:08

one reexlanation.

380:10

I didn't reexlain the back end change uh

380:12

in a in a front end repo, right? I

380:14

explained the change once and I got it

380:16

implemented in both repos and it's all

380:18

in agreement. Now let's resume a

380:20

session. Say I want a coworker to finish

380:22

the backend change. Perhaps they own the

380:24

backend repo. I send them the session.

380:27

They resume it on their machine. Right?

380:28

So this I'm sending them a session. They

380:31

could run the command. different

380:33

machine, different everything. They use

380:34

different terminal, right? Uh they would

380:38

reconstruct it on their machine. They

380:40

don't have this session, right? They've

380:42

never worked on it. They can pick an

380:44

agent. Uh the agent they pick could be a

380:46

different agent, right? I use code in

380:49

the original session. Let's say they're

380:50

using a different one, Cortex. The same

380:52

setup happens on their machine. Same

380:54

repos, same shaft, everything set up

380:56

correctly.

380:58

Agent starts in each repo like in mine,

381:00

right? They all connected again. So they

381:02

work together. They all primed with a

381:04

trace captured from my machine. So the

381:07

back end repo agent on their machine has

381:08

the same sh and the same history. The

381:11

front and the repo situation is the

381:12

same. It's it's checked out at the same

381:14

the correct SH has a agent running with

381:16

the correct history. So my agent was

381:19

clawed. They codex but they share memory

381:22

and they could actually make changes in

381:24

here as shown in a small video. Um but

381:28

important the memory sharing part is key

381:30

right uh I can work they can work and we

381:33

can share our memories although we use

381:35

two different agents of different

381:36

machine the full state of my session

381:38

kind of get materialized on their

381:40

machine it kind of less memory and more

381:43

about the state right the state of the

381:44

world attached to the session uh you

381:47

know is what enables them to continue my

381:49

session even though they had didn't do

381:52

anything with originally it's close to

381:54

the transport in Star Trek like a whole

381:56

copy of my session is always state

381:58

materializes on their machine so they

381:59

can continue and that's how I often work

382:02

when there is a pull request for me to

382:03

review and I have questions I usually

382:06

don't ask the person I resume their

382:08

session on my machine I get their exact

382:11

state fully functional zero setup and

382:13

then I just talk to my agent about the

382:15

decisions we made right because all

382:18

these decisions are in the traces

382:19

capture so my agent knows exactly what

382:22

the other person talked to their agent

382:25

right side note This is also useful when

382:27

I want to switch from say claw to codex

382:29

mid session when something goes down.

382:31

Okay.

382:33

Okay. Take the earlier case I talked

382:36

about where a bug land in production.

382:39

Here I'm going to reference this session

382:41

and say it's basically broken

382:45

uh and you know can you figure out

382:47

what's wrong and fix it.

382:50

The agent will look it up will download

382:54

what it needs. If description it's like

382:56

high level information is enough that's

382:58

great. If not, it's going to pull

383:01

relevant repos, relevant chars, agent

383:04

logs, right? It's going to get all this

383:07

information from the original session to

383:09

reconstruct that state such that it can

383:11

do the necessary fixes as shown here.

383:13

Here actually provided a fix, right? I

383:16

only had to say this happened. There is

383:19

a bug. That's it. No extra information

383:22

was required for me to provide.

383:26

Okay. So far we have manually selected

383:30

repos and sessions but we don't have to

383:32

right instead of selecting repos by hand

383:35

I can also tell the agent what I want

383:37

remember that graph has all this

383:38

intelligence right about how repos

383:40

relate I could tell my agent find every

383:44

repo that depends on a particular

383:46

version of a library and update it right

383:52

and it knows right I didn't have to

383:53

select them it knows a lot of metadata

383:55

about what's going on I You can also ask

383:57

loose questions things like you know uh

384:02

what if I want to write a blog post

384:04

right or an article I could describe it

384:07

and it will figure out which repo is the

384:09

most relevant based on relationships

384:10

between repos and what's in them.

384:13

Another example let's say I want to add

384:17

vector index into the PR collection and

384:20

I want to know if anyone at any point

384:22

did something relevant in any repo that

384:24

I can draw from. So in this case if I do

384:27

it I'll see that it will find several

384:29

session that appear to be relevant

384:32

and I can load one of them or both of

384:34

them right um it's useful for many

384:36

reasons just one small example it helps

384:39

with best practices and consistency

384:42

instead of doing stuff from scratch

384:44

where you know every single bespoke I

384:48

can make it replicate the approach used

384:49

in a session by an engineer I respect

384:52

now our code across repos is consistent

384:54

and that's a big

384:56

There is a lot more to it. Of course, if

384:58

you are in a repo, I can ask, you know,

385:00

for sessions, it will prioritize

385:01

sessions that's relevant to the repo and

385:03

vice versa. If I'm asking for repos, it

385:06

will look at my session and see what

385:08

similar sessions tend to bring in.

385:09

Right? There's a lot of interesting

385:11

intelligence that make it a lot more

385:13

useful that appear at first glance.

385:15

Okay.

385:17

Lastly,

385:19

uh everything so far I I used uh uh

385:22

everything I shown uh use the polygraph

385:24

CLI, the kind of meta harness CLI to

385:27

start it and then you can start clo or

385:28

cordex or whatever from within it but

385:31

you don't have to use it this way. So in

385:32

this case I'm already in a cloud session

385:34

but works with anything and I could just

385:36

say hey you know I actually think a

385:38

separate repo would be useful like maybe

385:40

I'm working on a vest plugin in this x

385:42

repo and I could say can you add the

385:44

vest uh repository to this session so I

385:46

know what's going on

385:50

in this case will engage polygraph and

385:52

we'll set it up you know configure

385:54

everything and we'll bring the vest

385:56

library which is the vest repo the open

385:58

source repo to my session. So now uh my

386:01

agent can you know explore it. It could

386:05

you know uh figure out how it works and

386:07

maybe resolve an issue I have in my

386:09

repo. I much prefer this to say context

386:12

7 because if I have the real code the

386:14

agent can go really deep. So the deep

386:16

problems are discoverable this way.

386:21

All right. So agents are constrained in

386:23

space and time. They only see a small

386:26

fraction of the codebase as they don't

386:28

know the past. Okay. Uh and both limits

386:32

could be lifted.

386:34

Polygraph uh gives agents access to the

386:36

entire code your organization can reach

386:40

the one you own in open source. So it's

386:42

no longer constrained in space. Any

386:44

agent can bring all of it, right? And it

386:48

gives your agent a perfect memory of

386:50

what happened.

386:52

Every session, every decision made is

386:54

within reach

386:56

because it crosses developer boundary.

386:58

It's not per developer. The agent can

387:00

have more contacts than any single

387:02

developer like a thousand engineers have

387:04

an organization create all these

387:06

sessions. They all accessible to to each

387:09

of them almost like sort of the Borg.

387:11

Every agent can run by every developer

387:14

contributes to kind of one big this hive

387:15

mind, right? So, uh if it's interesting,

387:19

my name is Victor. You can follow me on

387:20

Twitter. If you want to check it out, go

387:23

to trypolygraph.com

387:25

and see if it works for you. Thank you.

387:27

Hey everyone, I'm Ean the CEO of Amnara

387:29

and today I'm going to be talking about

387:30

the log is the agent. The basic idea of

387:34

the talk is simple and that is most

387:36

people think of an agent as the model or

387:39

the execution environment that it's

387:40

running in. And I think that that's the

387:42

wrong abstraction. I think that the

387:44

thing that actually gives an agent its

387:47

identity is its log. And that's what I'm

387:50

going to be arguing today.

387:52

So, think about a character you've spent

387:54

a hundred hours playing in your favorite

387:56

video game, in this case Skyrim. What

387:59

exactly is your character? Is it the

388:02

game engine? Is it the PlayStation? Is

388:04

it the controller?

388:06

No, it's not. Those things matter and

388:09

those things are what we'll interact

388:11

with and they'll run the character. But

388:13

none of those things are your character.

388:15

Your character is data. It's the save

388:17

file. And this is important because if

388:20

your PlayStation bursts into flames,

388:23

your character isn't gone. You can buy

388:26

another PlayStation. You can download

388:28

your save file from the cloud and you

388:30

can resume exactly where they were. And

388:33

that's because the agent and its

388:35

identity and history and its state is

388:37

all captured in its data. The character

388:40

lives in the data. And this is the

388:42

framing that I want to bring to agents

388:46

today. When people talk about agents,

388:49

they usually point at the wrong thing.

388:51

They'll say that the agent is the model

388:53

or they'll say that it's the runtime.

388:55

And again, as I mentioned earlier, those

388:57

things matter, but they're not the

388:59

agent. The agent is its data. It's

389:01

specifically the log. So what actually

389:05

is the log? At the simplest level, the

389:07

log is the appendon event history of the

389:11

agent. It's every user input, every

389:13

model output, every tool call, tool

389:15

result, permission, failure. And the

389:19

idea is that every state transition that

389:21

the agent takes is written to the log.

389:25

This is important because it means that

389:27

the identity of the agent isn't tied to

389:30

the runtime or the model or the tools.

389:33

Those things are all just interpreting

389:35

and appending to the log. They're

389:37

reading the log, acting on it, and

389:39

writing the next event back. And that's

389:41

important because then just using the

389:44

log on its own is enough to resume the

389:48

agent. Once you define the agent as the

389:51

log, the

390:04

Hello everyone. How's everyone doing?

390:09

Are you guys ready for some more loops?

390:13

Yeah. My name is Roland. My co-founder

390:16

and I were in this mythical place called

390:19

XAI working hard on agent infra and we

390:22

realized there's something new that has

390:25

to be done in a standalone way. So we

390:27

left a few months ago to really figure

390:30

out okay what's the next stage of how we

390:32

should deploy these always on longunning

390:35

horizon tasks. Um, and I'm happy to

390:39

announce we have a few findings that we

390:41

would like to present you. Um, and this

390:44

talk it's all about um, how you should

390:47

productize these ideas in ways that can

390:50

scale with your customers. Um, you've

390:54

heard a lot about auto research. Um, we

390:57

think there's a blueprint for 2026 and

390:59

beyond on how you should think about

391:00

auto research. And it really comes down

391:04

to three ideas.

391:06

Let's go through the first one. The loop

391:09

is the product.

391:13

We're all familiar with this. We've

391:15

started with everything goes down to RL

391:18

chief for models and how you should

391:20

train the model to become better and

391:22

better reasoning. We then quickly moved

391:24

to harnesses and how the model is a

391:27

commodity and it's all about the

391:28

harness. And now we're talking about

391:30

loops and how you should build these

391:32

loops uh and not touch code anymore. But

391:35

what does it really mean and why is

391:37

everyone saying that?

391:39

Do you guys remember Clawbot?

391:42

That was the original um original name

391:45

of what is now now known as Open Claw.

391:48

And this guy AJ built the first loop

391:52

around Claw Bolt.

391:55

What he did was to find a way to talk to

391:58

dealers and talk to Reddit users to get

392:01

bigger discounts on a car. He followed

392:04

these four steps. Um, and it's really

392:07

OpenClaw the one that did it. Go on

392:09

Reddit, find prices, find inventory,

392:14

talk to the dealers,

392:17

put dealers headto head and try to

392:19

figure out how to make them out bid each

392:21

other,

392:23

have a verifiable way to know when the

392:25

price is right, and then lock in, get

392:30

the car, and it worked. Um, probably

392:33

this was when all the Mac minis were uh

392:36

selling off the shelves, but this was

392:38

the first real example of loop is the

392:41

product and something that probably

392:43

should be a startup at this point. Um,

392:45

but we've seen how this became a recipe

392:49

for everyone to build loops. But let's

392:51

take a step back. Why are we here? Um,

392:55

we really think models have been trained

392:57

with this loop in mind. And it comes

392:59

from this idea of uda loops. It's a

393:02

terminology coined back in 1970s by the

393:07

US Air Force and is the idea of these um

393:11

jet fighters how to react in fast-paced

393:14

environments.

393:16

If you think of models calling tools and

393:19

taking observations, it's it's what

393:21

we've been trained on uh as humans but

393:24

also as as agents. Now, now what happens

393:27

when you put strong signals and

393:29

verifiable work uh at the other ends?

393:33

You get to these workers or cloud code

393:36

agents. Um and and what matters here is

393:40

the quality of the signal determines the

393:43

uh success rate of the loop and the uh

393:47

quality of the verifi verifier um um is

393:51

able to calibrate if that success is

393:54

actually correct or not. But there's

393:57

another loop here. Um what happens when

393:59

you take that and feed it back into the

394:01

signal? And this is what looping around

394:04

is all about is how do you generate

394:07

these artifacts at the end of the first

394:09

loop to then run a second loop on and

394:12

have a way to continuously improve.

394:16

And this goes to my second point. System

394:18

distillation is the mode and is really

394:21

the ability to understand what went well

394:25

and wrong in the first loop and know how

394:29

to process that in the second one.

394:32

So how do we tune these AI systems? Each

394:35

loop generates useful information around

394:37

harnesses, profiles, eval

394:40

models, resources, tools, and the

394:43

environment. What you really want is

394:47

to have a way to keep this portable, to

394:49

have a way to version this and to evolve

394:51

it over time. If you think about data

394:55

recipes in research, this is how RL

394:59

started to work really well. you

395:01

understood the recipes and how to

395:03

continuously change the recipe to combat

395:06

some of the behaviors that may happen

395:07

around hallucinations around reward

395:09

hacking and then you get to a stack

395:12

which is your final data recipe. We

395:14

don't have that for harnesses. We don't

395:16

have that for like AI systems in the

395:18

general term. So we thought there's

395:20

space for something like that. something

395:23

that contains the evils and contains the

395:25

tweaks and the human judgment and all

395:28

these things that are not predetermined

395:29

at the beginning, but they're defined as

395:32

you learn more about your agent acting

395:34

in in in the environment.

395:39

We think recipes can be applied to this

395:41

and we should use the same name. So an

395:44

agent recipe is really something that

395:46

enables you to create reproducible

395:48

frontier AI systems. It's something that

395:51

allows you to have a mode that keeps

395:54

getting better over time, which is not

395:56

tied to any platform or any provider.

395:59

It's something that you control lives in

396:02

your company and is agnostic to the

396:05

models and providers you use. And loops

396:08

should focus on this. Loops should be

396:10

the way you distill these systems into

396:12

recipes.

396:13

Failure patterns should become judges

396:15

and evals. Repeated behavior should

396:18

become skills and prompts. user

396:19

frustration, extensions and memories to

396:21

your harness and so on. You we're all

396:24

familiar with this, but we didn't have

396:26

the the the right like terminology of

396:28

how we should think about it and how we

396:30

should define it. And we think recipes

396:32

is a way to put everything together into

396:37

a git repo and treat it as your ongoing

396:42

um strategy for for uh building these

396:45

self-improving systems. So we are

396:48

introspection but you can think of

396:50

introspection as the way you generate

396:51

these recipes. So they're recipes for

396:54

introspecting on your on your system. We

396:57

wanted to build something that is

396:59

portable and provider agnostic. So we

397:01

built our um approach to recipes on the

397:05

pi harness and on harbor for evals.

397:09

We baked it into uh git repos so uh

397:13

everything could be versioned and agents

397:16

could have a way to continuously track

397:18

how this change and why and is meant to

397:21

be owned by you but managed by your

397:22

agents. And this is how products should

397:25

really be built going forward. It's

397:26

something that treats the owner as the

397:30

um almost like the the the higher taste

397:34

um personality in the room. But agents

397:36

should try to calibrate themselves to to

397:39

the taste of the of the maker. So we

397:42

think recipes should be basically

397:44

encoding the taste of the makers into

397:46

how you build these agents. And if I

397:49

want to use someone else's recipe, I

397:51

should be able to also bring that taste.

397:53

It's not just the harness, it's not just

397:55

the model, is how did you arrive at this

397:57

particular recipe and why? And that's

398:00

kind of like what uh what is behind uh

398:03

reproducible

398:05

um uh products and services around

398:08

agents. Um we have an early release of

398:11

recipes is called pi. Recipes. It's very

398:15

similar to what skills uh used to be in

398:17

2025 but is going a step forward. And

398:20

this is what do I need to have a

398:22

frontier agent is everything about how

398:25

do I codify paste into evals? How do I

398:27

run? How do we have the loops to

398:30

continuously improve those evals over

398:32

time? How do we process signals and know

398:34

what are the right signals to to use? Um

398:37

what are the right tools to work with

398:39

certain models? How do I have different

398:41

profiles of the harness to work with

398:42

different models? Um and everything in

398:45

between. So have a look at what we've

398:48

been building here. It's still early uh

398:51

but hopefully it's useful enough for you

398:53

guys to to get going. And we feel this

398:55

is going to grow into something that um

398:57

really allows you to to use uh different

399:00

um almost like different the to to be

399:03

able to use the taste of of different

399:05

makers as recipes for your agent.

399:10

And finally, the last point is valued

399:12

work per watt. And why is this the score

399:14

to really optimize for? Think of how um

399:18

cursor and cognition went from building

399:20

the best product to then building the

399:23

best evvels for the product and finally

399:25

building the best models based on the

399:28

previous two artifacts. We think this is

399:30

like the recipe for everything going

399:32

forward. Um code was the first domain

399:35

where this um was successful. um

399:38

everything beyond customer support,

399:40

legal research, um everything is going

399:43

to come down to this idea. How much

399:45

value am I getting per what? Um how do I

399:48

measure the value is the first step and

399:51

how do I know I'm getting a good deal on

399:53

that value is the second. And maybe this

399:56

makes it a bit more clear. We've all

399:59

started from a base harness and a base

400:03

set of evals and we went to go to the

400:05

frontier. Um and you only go through

400:07

that by running these systems in prod.

400:09

There's no way you you know what

400:11

Frontier is before you uh you start. Um

400:14

but the the the last step here which is

400:17

what is requiring a lot of research um

400:20

is okay once you've reached frontier how

400:22

do we make this um uh economically

400:26

viable which is how do we not spend more

400:29

than than uh we need for generating this

400:31

amount of value. Um, and we think we

400:34

have the building blocks now to make

400:36

this accessible and pretty efficient in

400:39

the sense of you've seen all these

400:41

fine-tuning APIs, all the infrastructure

400:43

that has been abstracted away for you to

400:46

do do this process. It's just the

400:48

knowhow that uh is not there yet. And

400:51

this is what we we we hope we can like

400:53

push for the knowhow for knowing how to

400:55

codify taste into evals and how to

400:58

validate that in experiments. Um, and

401:00

you you've you've heard a lot about

401:02

evals and experiments before, but you

401:04

didn't really think of them of like what

401:06

are they is is just tests is is really

401:09

what is the taste of the creator that

401:11

agents should be able to reproduce and

401:14

self-improve around. And no one has

401:17

thought of how do I make this as

401:19

portable enough? how how do I make my

401:21

taste as an artist or as a software

401:25

developer um something that anyone can

401:27

download in their brain and be able to

401:29

be a one-toone replica to me and this is

401:31

kind of like what RL is is is about now

401:34

is how do we uh turn these um taste

401:39

makers into uh environments and evals

401:42

around them so then we can move them

401:44

into the weights but um there's more

401:46

than that um you can think of the worker

401:49

as the inner loop And it generates all

401:51

these artifacts. But how you look at the

401:54

artifacts and know what to change is the

401:56

taste. Uh and this is what creates

401:59

candidates of what you should change and

402:01

how you should adapt based on that. And

402:04

experiments is what how you

402:06

self-calibrate that okay my taste is

402:08

actually validated in production with

402:10

users. And we make sure that not only

402:12

the maker is happy through the um

402:15

offline evals but the end users are

402:17

happy as well and they agree with what

402:19

we consider good.

402:22

Let's go through a practical example of

402:24

how this works.

402:28

Let's take a baseline um agent which

402:30

could be a talent sourcing agent. Um and

402:33

this is a very classical case of

402:37

everyone is doing recruiting differently

402:39

and it's very much

402:41

about not what is good recruiting but

402:45

who is leading that recruiting that

402:47

considers recruiting as good. So in this

402:49

case we're starting with something

402:50

pretty simple. Um a bunch of tools web

402:53

search LinkedIn uh a bunch of sub aents

402:56

that have been pre-popularized by

402:59

harnesses like codeex and cloud code and

403:01

a system instruction which is about your

403:04

recruiter.

403:06

First step is really understand the

403:08

signals. So you can think of patterns as

403:12

being a way to look at the traces,

403:14

extract some common um behaviors or

403:17

common user frustrations and turn them

403:19

into like a cluster. So let's say this

403:22

idea of uh the agent is going uh and

403:25

reaching out to a lot of big tech

403:27

employees. As a recruiter, you don't

403:29

really want that. You want to find

403:30

hidden gems. You don't want to try to

403:32

hire John Carmarmac. But an agent would

403:36

think that's, oh, John Carmarmac is

403:37

great. why would I not reach out to him?

403:39

Um, so, so this is a behavior that you

403:42

you'd never think of codifying, but you

403:45

discover the agent tends to do that. Um,

403:48

patterns is how you discover these

403:50

signals and inform you what you should

403:53

do next.

403:55

um calibration judges and evals is how

403:59

we used to think about how do we qualify

404:02

these these behaviors into um something

404:05

that can try to uh apply the same

404:07

judgment across traces and across uh

404:09

execution. So let's say we we build an

404:12

agent that looks at a trajectory and um

404:16

identifies exactly that pattern. Hey,

404:18

did did this agent reach out to Google

404:20

employees instead of trying to uh find

404:23

hidden gems on GitHub? Um, and the

404:26

calibration bit and the eval generation

404:28

bit is not that hard. It it it should be

404:31

doable by agents to build. You just need

404:33

a human in the loop to say, "Hey, um,

404:35

this is the approach we're taking. Do

404:37

you agree with this

404:40

judgment? Do you really agree that we

404:42

should look more towards hidden gems

404:44

rather than reach out to um um big tech

404:48

employees? And that's about it. You

404:50

don't need the human to actually build

404:52

the evals. You need them to calibrate

404:54

the evals. And agents should be the ones

404:57

that really take the the the taste of

404:59

the maker and and put them in into code.

405:01

Once you have this, it's pretty easy to

405:04

create recipe candidates. And this

405:05

should be the the diffs that you really

405:07

want to taste. Um, and

405:11

you can have a pretty good offline evil

405:13

set around this, but the the the test

405:15

here is when you go to prod. So, do the

405:19

end user agree with your taste of not

405:22

hitting up um big tech uh employees,

405:26

right? And this is kind of like what you

405:28

want is you build a product that really

405:30

emphasizes your taste and then you you

405:32

make sure that your users appreciate and

405:35

value that taste. and AB tests have been

405:38

a way to to to make sure that that's the

405:40

case. Um so with a multi-arm banded um

405:43

scenario for example you you'd be able

405:45

to do that pretty well. So once you

405:47

validate okay I have great taste and my

405:49

users believe uh I have great taste as

405:51

well that's when you promote and that's

405:53

kind of when you go to to the next

405:55

version of an agent recipe. The secret

405:58

is you keep doing this over and over

406:00

again and you know how to continuously

406:02

codify your taste and your um what what

406:05

what good is to you into an agent that

406:08

can reproduce the same service or

406:10

product uh for other people and they

406:12

also agree you have great taste and you

406:15

have great execution. And this is really

406:16

kind of like the the secret of building

406:18

good loops is okay can can someone

406:20

iterate on my um system in a way as uh

406:25

you know um a good example here is like

406:27

Miranda from the Delor product right

406:30

what would be Miranda do uh in certain

406:32

cases and you kind of want to codify

406:34

that that thinking into like agents that

406:36

can do the same stuff at a higher level.

406:40

So the takeaways are this. Um the loop

406:42

is the product. You try to automate

406:43

yourself as the u as a um higher level

406:48

judge and you want to make sure your

406:51

second loop agents are able to apply the

406:54

same judgment to to the agents you're

406:56

trying to to to push to prod. Second bit

406:59

system dissolation is the mode. So, how

407:01

do you continuously inject that taste

407:03

into these uh workers and they how how

407:05

they continuously self-verify and work

407:07

together is uh the biggest thing that

407:10

you should focus on and the faster you

407:12

do it uh the the the faster you you

407:14

build a defensible

407:16

um approach to to becoming a vertical AI

407:19

company. And finally, valued work per

407:22

what is how you should measure um am I

407:25

making progress or not. So first make

407:28

sure that uh the the the work you're

407:30

generating is valuable. Second make sure

407:32

that the economics makes sense and the

407:36

um the the difference in price is is

407:37

basically what um people would would

407:40

switch away from cloud code to to

407:42

something you provide.

407:44

We've been thinking a lot about these

407:46

ideas and we're building some very

407:48

interesting products around how to

407:49

deploy this in production. We'd love to

407:51

hear from you. would love to get um to

407:54

to understand more about how how certain

407:57

um vertical SAS companies are are

407:59

looking to go to prod with um or how

408:02

agent labs have been thinking about this

408:05

idea of um um creating these like auto

408:08

research uh labs around their their own

408:11

products. Um get in touch. Uh we're

408:13

going to be around the block for for

408:15

chatting more about this and thank you

408:17

very much.

408:28

tell you a story about a factory that

408:30

taught itself how to remember. Hi, I'm

408:34

Rushab. I run machine craft, a 100

408:36

people factory in India. No data science

408:39

team, no ML budget, none of that. And

408:41

somehow we ended up building a 36 AI

408:44

agent that runs our entire go to market.

408:47

I think that's still a little

408:50

ridiculous. Let me show you how it

408:52

happened and why you can do the same

408:54

thing.

408:56

So here's the thing about our company.

408:58

From the outside, it looks like machines

409:00

and metal, but the actual company, the

409:02

part that matters isn't the machines, is

409:05

the knowledge. Who the customer is what

409:08

we quoted them in 2019, why that one

409:11

machine needed that weird custom tweak.

409:13

And for three generations, all of that

409:16

lived in exactly three brains. Initially

409:18

my grandfather's, then my father's, and

409:20

now mine,

409:23

which is a genuinely terrifying way to

409:25

run a company when you sit with it. A

409:29

lot of people have joined us. People

409:31

have left us. The revolving door never

409:33

stopped. And every single time someone

409:36

walked out, a chunk of our brain walked

409:39

out with them.

409:41

We weren't scared of the competitors. We

409:43

were scared of forgetting or waking up

409:45

one day and realizing the whole company

409:48

only existed inside two increasingly

409:51

tired heads.

409:53

So, I had an idea. I'll be honest.

409:57

Sounded insane first. What if instead of

410:00

writing the knowledge down in some

410:01

document nobody ever reads, what if we

410:04

grew a brain that just held it? Not a

410:08

chatbot. You poke at a twin of the

410:11

company. I didn't hire a sales team. I

410:15

tried to build one.

410:17

A quick detour because you need to know

410:20

how messy this is. We make

410:22

thermopforming machines. They heat up a

410:25

plastic sheet and shape it. Same core

410:28

machine, but it ends up making

410:30

hydroponic farm trays, spa bathtubs, EV

410:33

car panels, medical casings, and even

410:35

packaging.

410:37

Seven totally different worlds, seven

410:39

totally different buyers. So, this brain

410:42

couldn't just memorize a brochure. It

410:45

had to know which universe

410:47

a given customer lives in.

410:50

Step one was almost boringly simple.

410:53

Feed it everything. And I mean

410:55

everything. years of quotes, drawings,

410:58

payment schedules, timelines, email

411:00

threads, hundreds of gigabytes of our

411:03

own private history. Not the public

411:05

internet, our internet.

411:09

And here's the plot twist, the part that

411:11

surprises every engineer I tell this to.

411:14

We never trained a model. No GPUs

411:17

humming in the basement, no fine-tuning.

411:19

We just looked at all the history,

411:22

chopped it into bite-sized chunks, and

411:24

let offshelf models, read it, and pull

411:27

out the facts. We stored the meaning of

411:30

each chunk as vectors and relationships.

411:34

Who's connected to what as a graph? The

411:36

brain is in a smarter model. It's

411:39

actually a really, really well organized

411:42

memory.

411:44

Now, this is where it gets a little

411:46

weird in a good way. We stopped thinking

411:49

of era as a software and started

411:51

thinking of it as something we were

411:52

raising. So we gave it a body modeled on

411:56

biology senses to figure out who it's

411:59

talking to, a gut to digest the

412:01

documents into facts, a memory, a dream

412:04

cycle, an immune system to fight off bad

412:07

information. Why biology? Well, because

412:10

evolution already spent a billion years

412:12

solving. How do you stay coherent over

412:15

time? We just copied the homework.

412:18

Okay, so the big question, why 36 agents

412:22

instead of one genius mega prompt?

412:24

Because, and you already know this if

412:26

you've ever tried it, one prompt that's

412:29

supposed to do everything ends up doing

412:31

everything badly. So, a isn't one mind.

412:35

It's a pantheon. A whole cast of

412:38

specialists.

412:40

Each one has exactly one job. Athena

412:43

runs the room. Prometheus owns the sale.

412:47

Plutus does pricing. Hippastus knows

412:51

every machine spec.

412:53

Vera fact checks everything. And Memon,

412:57

my favorite, guards corrections. So the

413:00

second a human fixes something, it stays

413:02

fixed forever. One agent, one job. It's

413:07

a team, not a hero. And here's the cool

413:11

part. They hold meetings. Athena pulls

413:14

in specialists. They actually argue and

413:17

a single answer comes out the other

413:19

side. It's like having a boardroom that

413:21

never sleeps, never gets tired, and

413:23

somehow has no ego.

413:26

So, what does all this actually run?

413:29

Honestly, the whole front business,

413:32

everything between a stranger exists

413:35

somewhere and now they're a customer.

413:39

Nine concrete jobs every single day.

413:42

Outbound emails that actually reference

413:44

my real world. Account briefs built from

413:47

cross-cheed truths before a call.

413:50

Quotations. A swipe left, swipe right

413:53

mode for outreach. Reviving dead leads,

413:57

which I call blast from the blast.

414:01

Inbound replies and figuring out before

414:03

we waste an hour whether a company is

414:05

even a fit. Nine jobs, one operator who

414:09

never sleeps.

414:11

Where does all this live? One cursor

414:14

tab. That's genuinely it. You type and a

414:18

reaches out with a dozen hands, searches

414:20

the knowledge base, reads the inbox,

414:23

drafts the email, builds the code, and

414:25

then shows you before anything actually

414:27

goes out. Under the hood is genuinely a

414:31

real stack, not a demo held together

414:34

with the tape. databases for vectors for

414:37

relationship graph for the CRM. Three

414:40

different model providers each picked

414:41

for the job it's actually best for tools

414:44

for Google for swallowing documents for

414:46

every communication channel plus

414:48

monitoring so we can see what it's

414:51

thinking

414:52

all of it Fabric.

415:09

Okay.

415:11

Hi everyone. I'm Arena, former engineer

415:14

at Microsoft and Supercell. And today I

415:17

want to talk about auto research in a

415:20

multi- aent AI village. I will use a

415:23

video game like AI Village as a running

415:26

example here, but the broader question

415:28

is one I think many AI engineers are

415:31

starting to run into. How do we evaluate

415:35

and improve agents that carry state over

415:38

a long period of time?

415:42

Before I get into the auto research

415:44

layer, I want to talk a bit about

415:46

project paradox.

415:48

We developed project paradox at

415:50

supercell's AI innovation lab. Me and my

415:54

teammate Arnach Manikanden.

415:57

We built a modular AI framework that

416:00

allows any developer to plug in

416:03

intelligent autonomous agents within a

416:06

video game that can interact, compete or

416:09

cooperate with other players or agents

416:11

as well and place them uh and make them

416:14

into dynamic game companions.

416:19

Now, to give examples of what these

416:21

agents can do, the agents can move with

416:24

intent. They can go to any location or

416:28

person, and they're guided by their own

416:30

memories, emotion, or curiosity.

416:34

These agents can interact with the

416:36

world. They can pick up objects, drop

416:38

them anywhere, and they're also aware

416:40

about the context in their own

416:42

environment, such as objects or other

416:45

characters or agents as well. I would

416:48

also like to note that game developers

416:51

can also add new actions for these

416:53

agents to accomplish within our

416:55

framework as well. Instead of just

416:57

dropping or uh placing objects,

417:02

agents can also obviously react to

417:05

what's happening around them. And these

417:08

events that happen around them affect

417:11

their own beliefs and emotions on the

417:13

fly as well. And of course, it wouldn't

417:16

be complete if agents can't start

417:19

conversations, right? agents can in this

417:22

scenario approach other agents or even

417:25

the player as well and this makes the

417:28

game feel more alive. And of course

417:31

these conversations are stored within

417:33

their memory and is according to their

417:35

own um and affect their own emotions and

417:39

beliefs or goals as well.

417:42

And al together these agents make our

417:44

multi-agentic framework.

417:50

Um yeah

417:52

yeah one second

417:55

so the architecture was intentionally

417:59

stateful behind this. The first

418:01

important part was per agent memory.

418:05

Each agent has its own memory namespace

418:09

backed by rag. So memory did not bleed

418:12

between agents.

418:14

Second, we tracked emotion as a small

418:17

vector. So after an event or

418:20

conversation, the system could update

418:22

values like joy, sadness, fear, anger,

418:26

or disgust.

418:28

Third, agents had belief scores towards

418:32

other agents and the player. You can

418:35

think of this as a trust matrix

418:36

basically like after the interaction

418:39

happens the LM basically decides whether

418:42

the trust score should go up down or

418:45

whether it shouldn't change at all. And

418:47

fourth, every memory receives an

418:50

important score. Um to to explain this

418:54

better, like let's say you had dinner a

418:56

few days ago, you probably wouldn't

418:58

remember what you had for dinner, right?

419:00

But um

419:02

if someone was murdered a few days ago,

419:04

you definitely remember that. So the

419:07

agent will evaluate or the LM will

419:10

evaluate uh an important score of an

419:12

event and if it crosses a threshold, it

419:15

will store that specific memory uh in a

419:19

separate cache so that important context

419:22

can be retrieved better later on.

419:26

And here's an example of it just

419:29

working. Um, we going to ask one of the

419:33

characters to go on a picnic with us.

419:35

Here, uh, our character Blossom

419:39

um, decides to pick up a pastry and go

419:41

to the picnic area because we asked her

419:44

to do so. Keep in mind during the

419:47

conversation in the background, she

419:49

plans all of these sequences of actions

419:51

to accomplish. And one when we talk to

419:54

her afterwards, she will also reply

419:57

within context as well.

420:02

Yeah.

420:04

But this is where an interesting problem

420:08

actually started. As you saw in the last

420:11

example, like for shortterm game play,

420:14

this our architecture worked pretty

420:17

well. like a character could make a

420:19

plan, move around, talk and remember the

420:22

recent interaction and respond to us or

420:25

other characters as well. But over

420:28

longer horizons,

420:30

this is where we notice the social

420:32

consistency start to get weaker. So in

420:35

this example, we have one agent

420:37

spreading a rumor about a sale on

420:40

mangoes to another agent and that agent

420:43

receives that information and goes and

420:45

tells another agent about it. Later on,

420:49

after a number of events occurred in

420:52

between, when the player asks one of the

420:54

agents about the mangoes, it doesn't

420:57

exactly store that context that we were

421:00

expecting or it doesn't give us the

421:01

context that we kind of wanted to. And

421:05

this is where things are starting to get

421:07

messy naturally. Like the system may

421:11

remember the rough topic but lose the

421:14

source of the topic. A rumor may be

421:16

concern instead of just a rumor like the

421:20

agent might state it as a fact or um an

421:23

agent might know a fact but fail to

421:26

execute fail to remember it while

421:27

creating a plan for its actions. So the

421:31

question here became how do we improve a

421:34

multi-agentic system over longunning

421:37

social behavior and not just over one

421:40

response.

421:42

And this is where we wanted to bring in

421:46

auto research. As you all know, a few

421:49

months ago, Karpathi posted out auto

421:52

research and this this made us

421:55

immediately very curious. Uh perhaps we

421:58

can make the system run experiments uh

422:02

on itself and can we use this for our

422:04

system as well. So what we understood is

422:08

instead of manually tuning a prompt or

422:10

watching one nice demo, we could define

422:14

a a scenario suit, run the agents,

422:17

collect traces, score the behavior and

422:20

change a small policy surface and only

422:23

keep the changes that actually improve

422:25

the score. And this is where we're

422:28

trying to bridge project paradox with

422:30

auto research. So at this point

422:33

basically our multi-agentic framework

422:36

project paradox is more like a lab bench

422:39

and auto research becomes the

422:41

experimental loop around it. And

422:44

importantly this is not only about

422:46

improving rag retrieval. The broader

422:49

framing is optimizing the agent protocol

422:52

like how do agents write memories,

422:55

retrieve them, communicate uncertainty,

422:58

update trust attribute sources and

423:00

replan around new facts. basically

423:07

um yeah in this context uh

423:14

oh yeah in this context art research is

423:16

a not another agent in the village like

423:19

I said it's a meta system outside the

423:22

village the villagers have local

423:25

perspectives of course they only know

423:28

what they saw heard remembered or

423:31

inferred because there isn't a common

423:33

memory database in between them.

423:36

Information only travels once uh other

423:39

agents communicate them properly.

423:42

The auto research layer has a different

423:44

job here. It reads the full traces of a

423:47

run, compares what happens against the

423:50

scenario ground truth,

423:53

uh scores the behavior and proposes a

423:56

constrained

423:58

change to the agent protocol or

424:00

cognitive policy. Then it reruns the

424:03

scenario and asks society level behavior

424:06

like did society level behavior get

424:08

better. This is the key shift we were

424:11

trying to look for. So we were no longer

424:14

evaluating one answer. We were

424:16

evaluating an entire run.

424:20

And this is what one of the loops would

424:23

look like. Like first we define a

424:26

control scenario which I'll elaborate a

424:28

bit more about later. For example, one

424:31

agent learns a public fact or one agent

424:34

hears a rumor. Uh that could be a

424:37

controlled scenario. Then we run the

424:40

simulation. During the run, we collect

424:43

structured traces, observations,

424:46

conversations, memory rights,

424:48

retrievalss, belief updates, whatever is

424:50

relevant to us in that case, we collect.

424:54

Then we score this behavior. Did the

424:57

information spread as we expected it to?

425:00

Did the source attribution survive? Such

425:02

as, does the agent remember who started

425:05

the rumor? Did uncertainty stay

425:08

uncertain? Did agents act on what they

425:11

actually knew? And then the auto

425:14

research layer here proposes a small

425:17

policy change. And this is important. It

425:21

should not rewrite the whole

425:22

application. Of course, it should only

425:25

edit a controlled policy surface. And

425:29

then we rerun. If the score improves and

425:32

the guard rails hold, we we keep the

425:35

improvement. And if not, we simply just

425:37

revert back.

425:41

And talking about controlled scenarios,

425:44

the reason why uh scenario design

425:48

matters is that social behavior is

425:51

otherwise a bit fuzzy uh in general in

425:55

the sense if you just let the agents in

425:58

our environment wander around, it might

426:01

look cool and you might get nice

426:03

interactions, but it's actually very

426:06

hard to evaluate on whether the system

426:08

actually improved. So this is why we

426:12

believe you need controlled scenarios.

426:15

For example, one scenario could test a

426:18

public fact diffusion. Let's say agent A

426:21

learns uh the bakery will close

426:24

tomorrow. Do the right agents learn it?

426:27

Do they remember who said what? Do they

426:30

do they change their plans based on this

426:33

fact? Another scenario could test rumor

426:36

uncertainty. agent. Let's say agent A

426:41

hears that agent C might leave the

426:43

village. When this rumor spreads, does

426:46

might leave suddenly become is leaving

426:50

or does it stay as might leave? Like

426:53

does it become a fact or does it still

426:55

stay as a a rumor?

426:58

Another scenario could test replanning.

427:02

The group has a plan but one agent

427:04

learns let's say the route they wanted

427:07

to take is blocked. Do agents update

427:11

this and communicate this uh with each

427:13

other to avoid uh a improper plan or

427:16

scale actions.

427:18

The point is not that these exact

427:20

scenarios are universal here. The point

427:24

we're trying to make is that long

427:26

horizon agent behavior needs scenario

427:28

suits.

427:31

And talking about our Mango example

427:34

again, after running one of our auto

427:37

research loops, this time after uh a a

427:42

long pro period of time, when the player

427:45

finally asked one of the agents about

427:46

the sale on mangoes, we did find that u

427:51

the the agent was able to respond within

427:54

context this time like compared to last

427:58

time.

428:01

Um yeah and for this talk the form the

428:06

exact formula we believe is less

428:09

important than the shape of the

428:10

scorecard.

428:12

Uh you do not want a single vague met

428:16

metric like agent quality. This will

428:19

hide all the interesting failures.

428:21

Instead you want a balanced scorecard.

428:25

For diffusion, you might measure reach

428:28

like how many agents know the fact after

428:30

end steps. For provenence, you measure

428:34

source retention among agents who know

428:37

it. How many remember it where it came

428:40

from etc. For rumors, you can measure

428:43

uncerny preservation and false surn

428:46

rate. For planning, you can measure

428:49

action consistency and time to replan.

428:52

And for privacy, you can measure

428:54

containment. This matters because

428:56

optimizing only one metric can create

429:00

bad behavior because let's say if you

429:03

only optimize for diffusion, the agents

429:05

may learn to overshare everything. And

429:09

let's say if you only optimize for

429:10

memory recall, you might create noisy or

429:13

still um like memories. So this

429:17

scorecard is what keeps the system

429:19

honest and prevents the auto research

429:23

agent from gamifying the system to just

429:26

increase one specific score.

429:31

The other important engineering lesson

429:33

that we learned over this project is

429:36

that uh it's important to keep the

429:40

editable surface really small. The auto

429:44

research layer should not have

429:46

permission to randomly rewrite the whole

429:48

codebase. Instead, it's really important

429:52

to freeze the harness, the scenarios,

429:55

and the metrics. So, we're only exposing

429:58

the part of the system that we actually

430:01

want to optimize. Here in project

430:04

paradox for us that meant things like

430:06

memory writing policy, retrieval policy,

430:10

communication prompt, belief, trust

430:13

rules, source attribution, replanning

430:15

triggers, etc.

430:17

This gives the search pro process room

430:20

to improve behavior, but it also

430:22

prevents it from gaming the evaluation

430:24

directly as we mentioned before. And

430:27

this is the difference between the LM

430:29

writing random patches versus the LM

430:32

actually searching within a controlled

430:35

policy space.

430:39

And here here are examples of the kind

430:42

of changes I want this kind of loop to

430:45

search over. If if source attribution

430:48

disappears, the policy change might be

430:51

preserve source in memory and uh write

430:55

uh memory rights and summaries. If

430:58

rumors harden into facts, the policy

431:00

change might be store confidence, marked

431:03

firsthand versus secondhand, and require

431:06

hedging when retelling uncertain claims.

431:09

If if facts if public facts stay local,

431:12

the policy change might be classify

431:14

useful public facts differently and make

431:17

agents proactively share important

431:19

source evidence.

431:21

The key is that these are small changes

431:24

to the agent protocol, but they can have

431:27

larger effects on a society level

431:30

behavior for multi-agentic systems. This

431:33

is also where I kind of want to be

431:35

careful about our claims here because

431:38

with we believe without repeated current

431:41

loop results like I wouldn't say the

431:44

system just

431:46

generally improved. We're trying to say

431:49

this is the right kind of surface to

431:51

expose to an auto research layer uh loop

431:54

because it is small enough to control

431:57

but it's still rich enough to change the

431:59

social behavior to some extent at least.

432:03

And the biggest lesson for me perhaps

432:07

was that memory is not enough here. You

432:11

can add a rag memory to an agent and

432:14

still not get the current long-term uh

432:17

horizon behavior that you were looking

432:19

for. Um because agents need to sometimes

432:23

know where that information came for uh

432:26

came from. You need to preserve whether

432:28

it was firsthand, secondhand, verified

432:31

or uncertain. Sometimes you need to

432:33

separate raw episodic memories from what

432:35

the agent currently believes too. And

432:38

you need to test behavior through

432:40

scenarios, not not just through vibes.

432:43

So the other lesson is that uh roll back

432:47

also is not optional. When you optimize

432:50

social behavior, a change can improve

432:52

one thing and damage another. So, a

432:55

policy that spreads public facts uh

432:58

faster might also leak private

433:00

information. A policy that increases

433:02

recall might increase stale memory

433:04

usage. So, the loop should basically be

433:08

like a ratchet. Try a change, score it,

433:11

keep it only if the scorecard improves

433:14

and guard rails whole.

433:18

And we we definitely believe this is not

433:21

only relevant for game agents because

433:25

although I gave you an example using a

433:27

game village um we believe like let's

433:31

say for example support agents support

433:33

agents need to know which policy update

433:35

comes from where right and whether it

433:37

supersedes an older answer. Personal

433:40

assistants for example need to remember

433:42

commitments that they previously made

433:44

and h make corrections if uh if the user

433:49

uh wants to change those personal

433:51

commitments. Research agents need pro uh

433:54

provenence citations, contradiction

433:56

handling and hypothesis updates. Coding

434:00

agents need longunning context across

434:02

issues, files, teammates and changing

434:05

requirements. Workflow agents need

434:07

access controls, handoffs, and

434:09

replplanning when the world changes. All

434:12

of these systems have the same

434:14

underlying problem. They maintain state

434:17

over time. And that state affect affects

434:20

future action.

434:23

So they need control scenarios and

434:25

behavioral scorecards is what we are

434:28

proposing.

434:30

So again in brief, a recipe for long

434:34

horizon agents. If there is one

434:36

practical recipe I want you to take

434:38

away, freeze the harness, define

434:42

scenarios, log traces, score behavior,

434:46

and expose only a small policy surface.

434:50

Search over these changes, keep only

434:52

changes that survive your measurement.

434:55

And this is an engineering pattern that

434:58

we believe would uh make sense for

435:01

longunning agents. The real question we

435:04

believe is across controlled runs, does

435:07

the system behave better?

435:11

To close, project paradox started as an

435:14

attempt to make game agents feel alive

435:16

in a 3D world. But the deeper engineing

435:19

problem was not animation or dialogue

435:21

for us. It was the state such as which

435:25

agent knows what, which agent told whom,

435:29

what is true, uncertain or outdated. And

435:32

do agents act on what they remember?

435:35

Otter research. Otter research gave us a

435:38

way to approach this a bit more

435:39

systematically. Not by trusting one demo

435:42

and not by endlessly handtuning prompts,

435:46

but by running control experiments and

435:48

keeping only the changes that survived

435:50

our measurement. Long horizon agents

435:53

need experiments and not just prompts.

435:55

And I hope that's the takeaway that you

435:58

get from this talk. And yes, please do

436:01

connect with us. We'd love to talk if

436:03

you have any questions. Thank you so

436:05

much for listening.

436:22

Hi, I'm Amole, CEO of Nori Aentic. We

436:27

deploy an AI employee that understands

436:29

your company, your code, docs, Slack,

436:33

and other kinds of data. We spend a lot

436:36

of time thinking about how coding agents

436:38

really work. Most people think coding

436:41

agents only write code, but if you ask

436:43

me, that's just bad marketing. Forget

436:46

the name for a second. Coding agents can

436:48

do almost anything. There's just one

436:51

trick. You have to be able to think like

436:53

an agent to get it to do what you want

436:55

it to do.

436:56

Today we're going to talk about how we

436:58

use coding agents to do something most

437:00

people think agents are terrible at.

437:02

Make visual artifacts like slides, docs,

437:06

and yeah, even video.

437:11

Every day, the world pours something

437:13

like 34,000 human years into making

437:16

slide decks. Most of that time isn't the

437:19

thinking, it's the fiddling. A deck that

437:22

takes 10 hours should really take about

437:24

25 minutes once you remove all the

437:27

formatting and the branding and the

437:28

moving things around. Say you need to

437:31

make a slide. What do you do? You open a

437:34

tool, PowerPoint, Slides, Figma, Canva,

437:38

and then you start manipulating a

437:40

canvas. Every one of these tools is

437:42

built for human hands and human eyes.

437:45

Click, drag, drop, resize, snap to grid.

437:49

All motions and patterns that make sense

437:51

for our geospatial view of the world.

437:53

There is a data structure underneath,

437:55

but it's in a format that only the

437:57

application can read. What happens when

437:59

you hand these tools to an agent? Well,

438:02

the output comes out all wrong. Things

438:05

overlap in weird ways. You can't see the

438:07

text. There's no alignment. It's just

438:09

garbage.

438:10

AI skeptics say that it's not just the

438:13

tools. agents fundamentally can't reason

438:15

about space. And there are whole

438:17

benchmarks like Arc AGI that are built

438:20

exactly around that premise. There's a

438:22

famous little test for this from

438:24

developer Simon Willis. He asks every

438:27

new model the same thing. Can you draw a

438:30

pelican riding a bicycle? But there's a

438:33

trick. The agent is only allowed to use

438:35

SVG. It's a quick gut check for whether

438:38

a model can reason about space at all.

438:41

Here are some examples of what the

438:42

models actually give you on this test.

438:44

And yeah, these are pretty bad. Like

438:48

genuinely, deeply really bad. So, does

438:52

that mean it's hopeless? Agents are just

438:53

doomed to be bad at graphics? No, I

438:56

don't think so. If you ask me, it's not

438:59

the model, it's the medium. If I asked

439:02

you, someone who is presumably human, to

439:05

handwrite an SVG of a pelican, you

439:08

wouldn't be able to do that either. SVGs

439:10

are just a wall of numbers. You can't go

439:13

from a wall of numbers to a pelican. You

439:15

just can't see that way. That's just not

439:17

how people think. We think graphically.

439:21

So, we build tools that let us draw on a

439:23

canvas. Figma, MCP's, PowerPoint, CLIs,

439:26

screenshot and replace loops. What do

439:29

all of these agent tools have in common?

439:31

They all approach the problem like a

439:34

human. But an AI is not a human. Asking

439:37

an AI to use a canvas is like asking a

439:40

human to write SVG by hand. It doesn't

439:42

really make sense. You need to give the

439:44

AI tools based on how it thinks, not in

439:47

pixels, in language. Words, tokens,

439:51

structure. That is its native medium.

439:54

Imagine a language that's incredible at

439:57

describing layout, that models have seen

439:59

and trained on billions of examples of

440:02

that they understand intuitively, that

440:05

renders to pixels and can run

440:07

everywhere.

440:09

Oh, right. HTML lets a model think in

440:13

structure. HTML tags have meanings built

440:16

into the language, a heading, a chart, a

440:18

grid, and the browser turns it all into

440:20

pixels. So, the model never actually

440:23

places a coordinate. And you can get all

440:25

sorts of visual effects, charts and

440:27

layouts, fonts and motion, all of it for

440:30

free. Remember that pelican from

440:32

earlier? Now ask it to do the same exact

440:35

task, but in HTML. Same bird, but now

440:39

it's in a structure that the model can

440:40

reason about. And you can read and theme

440:43

and edit every single line of it.

440:46

I spent my whole life building slide

440:48

decks with PowerPoint. So, I always

440:50

thought that those two things, slide

440:52

decks and PowerPoint, were synonyms. But

440:54

that's just not really true, is it?

440:57

PowerPoint is a tool that you use to

440:59

make slide decks. The deck itself,

441:02

that's just the presentation mode. And

441:05

as it turns out, no one in your audience

441:07

is going to care how you got to the

441:08

presentation mode. The editing format is

441:11

totally arbitrary. So you can just pick

441:14

the editing format that the agents are

441:15

already good at HTML and if you need to

441:19

render to a different format like PDF

441:21

later on. We use this HTML trick to

441:24

build all of our slide decks, our board

441:27

decks and our sales decks. These are

441:29

real things that we actually present and

441:30

send out constantly. We use it for our

441:33

docs, too. It gives our docs color and

441:36

vibrancy all while following our brand.

441:39

And of course, we also use it to make

441:41

videos like this one. What you're

441:44

watching is just HTML and CSS. It's

441:47

literally just divs all the way down.

441:52

Almost everything is better with a

441:54

little structure and a little bit of

441:56

color. Plain text is a choice, generally

441:59

a choice of convenience, but it's

442:00

usually the wrong one if you're actually

442:02

trying to create something of use.

442:05

Now, I do want to take a quick beat here

442:07

and point out that a beautiful deck on

442:09

its own is generally not worth anything.

442:12

You still have to go and get all of that

442:13

content, all of the things that actually

442:15

populate that deck, right? Well, again,

442:19

we can think like the model. If you just

442:21

give the model access to your data, say

442:24

your call transcripts or your emails,

442:26

you can have the model build the deck

442:28

end to end. Let your agents do all the

442:30

grunt work while you focus on vision and

442:33

story. That's what Nory Sessions lets

442:35

you do. I've built entire board decks

442:38

for my phone on the subway during my

442:40

commute. Why? Because our Norybot lives

442:43

in the fabric of our company. Of course,

442:46

Nory ships with everything you need to

442:47

make this all work. So, don't bother

442:49

reinventing the wheel. That's my little

442:52

spiel. Thanks for listening. If you have

442:54

just one takeaway, it's this. Stop

442:58

thinking like a user. Think like the

443:00

model. Give it the right language. And

443:02

for graphics, all you need is HTML.

443:11

Hi everyone, 10X. You feel it yet? Hi,

443:15

my name is Zion and I'm a mobile

443:18

software engineer for the last 14 years

443:20

and I'm here to talk to you today about

443:22

10X, reimagining the mobile dev

443:24

workflow.

443:26

So, you know, back in the old times when

443:28

cursor was that thing you make with your

443:30

mouse and AI agents were that dystopian

443:33

character from sci-fi books or movies,

443:36

whatever fits your style, you know, just

443:38

a few months back then when we thought

443:40

that we will still be using our IDE just

443:42

maybe slightly better. And now we know

443:45

that we already switched to like chat

443:48

style um engineering when we discuss

443:51

with cloud code codex cursor whatever um

443:54

and we just tell them what to do and we

443:56

don't use our IDs unless it's for

443:58

debugging or something that the agent

444:00

couldn't figure out and that in theory

444:03

should have made us 10 times more

444:04

productive right that's what everybody

444:06

says right with are we 10 times more

444:08

productive do you feel it I don't know

444:10

because I can't feel that we are 10

444:12

times more productive not as a single

444:14

engineer and not as a whole group and

444:17

not as the whole company. So why is

444:20

that? Why do we don't see the promise of

444:23

10 times more productive came to an

444:25

actual life?

444:27

So you know they tell the story about

444:29

how when factories switched from steam

444:32

engines to electric engines at first

444:34

they didn't see that big of a gain. So

444:36

yeah, the electric engines were better.

444:39

They were more efficient, but they

444:41

didn't see that 10x, 20x, 30x uh more

444:45

productiveness that they have been

444:47

promised. And the reason for that was

444:50

that they only changed the steam engine

444:53

with the electric engine. But the real

444:56

gain came some years afterwards when

444:58

they understand that it's not only about

445:01

changing the engine, it's about changing

445:03

the whole workflow. Because you see,

445:05

they used to have like one giant big

445:08

steam engine in the factory and all of

445:10

the machines were rearranged

445:14

based on their power consumption and

445:16

their proximity to that steam engine.

445:20

So it wasn't organized by the workflow

445:24

that it should have been like from the

445:26

start to the end of the workflow. No, it

445:29

was designed by proximity to that

445:32

central engine. When they realized that

445:35

and they also realized that they could

445:36

take the electric engine, make it

445:38

smaller and put it inside each machine

445:41

and then they rearranged the factory to

445:44

make it work as the workflow should

445:46

because now it was made possible. Then

445:49

the real gain came. Now they were 10

445:52

times, 20 times, 30 times more

445:54

productive than they were before. Not

445:56

because of only changing the engine but

446:00

of changing the whole workflow. And that

446:03

is what I want to talk to you about

446:04

today. Let's think how AI make things

446:06

that weren't possible before possible

446:09

now. And we can change our workflow and

446:11

then becoming 10 times 20 times more

446:13

productive.

446:15

To do that, let's look at the current

446:18

workflows. The PMs have an idea. They

446:20

iterate with the designers. They iterate

446:22

with the user. They iterate with the

446:24

dev. They then back with the designer.

446:28

Then they iterate with the QA. And they

446:30

iterate back with the dev. And maybe

446:32

after all those iterations maybe you

446:35

have something in production.

446:38

So what was that word that was repeating

446:40

so many times? Yeah, iteration. And this

446:44

is the problem

446:46

because iteration creates friction.

446:50

Each iteration creates context switch

446:53

create time waste creates communication

446:56

that needed to be done syn

446:58

synchronization that needed to be down

447:01

and AI didn't eliminate all of that AI

447:04

sped up code but didn't eliminate the

447:06

friction didn't eliminate the iteration

447:10

why is that so let us reimagine what we

447:15

could do bear with me for a moment what

447:18

if what if What if instead of using one

447:21

tool for designing, another one for

447:23

testing, another one for coding, and

447:24

then another one for releasing, what if

447:27

we could use one tool, one codebase?

447:29

What if instead of designing on Figma,

447:31

then sending a design doc to the

447:33

developer in order for them to figure

447:35

out how to um make those uh designs

447:38

alive? What if designers could actually

447:41

design own code and then send the

447:44

developer a PR? What if QA could iterate

447:47

with the agent itself, just getting a

447:49

link with the simulator and they can

447:52

tell the agent exactly what to test,

447:53

what to be cautious of, and if they find

447:57

something, exactly what to fix?

448:00

What if we could make the dev workflow

448:03

works on the code itself? What if God

448:06

was one of us? No, sorry, I got carried

448:08

away there. And you're probably asking,

448:12

how can we do all of it? So one way

448:14

would be to tell everyone to just

448:16

download their Xcode and and their

448:18

Android Studio and teach designers and

448:20

PMS and QA how to build and how to uh

448:24

test on simulators, emulators and blow

448:27

to their laptops with a 200 GB on

448:30

storage and uh whatever they do to the

448:32

to our memory. That's one way.

448:36

But let me guess that most of them would

448:38

reject that idea and for good purposes.

448:42

So we can make another way. Maybe we

448:45

just put it in our CI, right? So we let

448:47

the agent iterate with the CI so they

448:50

don't have to download Android Studio

448:51

and Xcode and everything.

448:54

But you actually know that CI builds

448:56

take between 20 to 40 minutes. And we

448:59

can't actually let our agent wait for 40

449:01

minutes just to understand that the iOS

449:03

code that it pushed actually failed to

449:05

build.

449:07

So what else? What can we use?

449:10

Introducing cloud sandboxes.

449:14

So cloud sandboxes are actually concept

449:16

that has been around already for many

449:19

years, just not for mobile development

449:21

yet.

449:23

Using cloud sandboxes, you can tell the

449:25

agent, here's an here's a CLI. Talk to

449:28

the CLI. Create a VM, a small VM that

449:32

runs only for this iteration.

449:36

The VM boots up in 30 seconds or less.

449:38

Make the build. show them a simulator on

449:41

their inapp browser in the cloud code,

449:44

codex, cursor, whatever. And then they

449:47

can iterate over it, tell you it to

449:49

change that pattern, uh to go back and

449:51

test something and change the code and

449:54

they push and open a PR and then the

449:57

designer can work on code, send a PR to

450:00

the developer after they done.

450:02

Developers make an iterations make one,

450:05

two, three, four different VMs uh to run

450:09

in parallel. They send the PR for

450:12

review. QA can take it from there and

450:14

tell the agent exactly what to test and

450:17

tell it what to fix and from there it

450:20

goes straight to the stores for review.

450:24

So let's see it. Let's see how it should

450:26

work.

450:29

So imagine you see this screen. Imagine

450:31

you're inside Codex for example. You

450:33

have the chat interface to your left.

450:35

You have the actual app to your right.

450:37

The designer is iterating with the

450:40

agent. tell it exactly what they want

450:42

them to do, what they want to change and

450:44

see the changes immediately on their

450:48

screen. Build time is faster. It's done

450:50

on the cloud and preview time is faster.

450:54

Then they some more not with the

450:57

developer but with the agent on their

451:00

laptop without the need to install Xcode

451:02

or Android Studio. And once they done

451:06

they can tell the agent to take that

451:08

code, open a PR and send it to the

451:11

developer. This workflow is what makes

451:14

us 10 times more productive. Not only

451:17

because of using AI but because of using

451:19

AI to change the workflow, reimagine it

451:22

and remove all that friction that we

451:24

took from for granted in the old times.

451:29

That is how we become 10 times more

451:32

productive. Thank you.

451:37

>> Hi everyone, my name is Gabe Dees Mesa.

451:40

I'm an engineer here at OpenGV and today

451:43

we're going to be talking about agents

451:44

in production. Specifically, how open

451:47

gov built and scaled og assist. Uh so um

451:52

this presentation is going to be

451:54

jam-packed with just so much good stuff.

451:58

Uh we're going to talk about uh AI

452:00

agents. We're going to talk about our

452:01

harness. We're going to talk about um

452:04

eval observability traces. We're going

452:08

to talk about um tools and skills. Um

452:12

it's there's going to be a lot of good

452:13

stuff in here. We're going to talk to

452:15

you guys about uh what we do at OpenGV

452:18

and how we operate at the scale that uh

452:20

we operate at um in production. So

452:23

you'll be able to see a real use case

452:26

and workload uh with AI agents. Um so

452:29

without further ado, let's get started.

452:33

Okay, agenda. So just really quickly

452:36

going to go through uh high level what

452:38

we're going to talk about today. Uh I'm

452:40

going to tell you guys a little bit

452:41

about OG Assist and what uh OpenGV is.

452:44

I'm going to tell you guys the origin

452:46

story of how this all kind of came to

452:48

be. Uh we're going to talk about OG

452:51

Assist's uh big bet on effect uh a

452:55

little bit into our core agent loop. Uh

452:57

we're going to talk about the A2A

452:58

protocol, eval.

453:01

We're going to talk about how we manage

453:03

long context. We're going to talk about

453:06

um monitoring observability, how we

453:08

collect feedback uh and how we iterate

453:11

on that feedback. We're gonna lastly uh

453:13

also talk about tools and skills and how

453:16

at open gov uh we use um AI not only

453:21

externally uh that we uh serve to

453:23

customers but also internally to improve

453:25

our development workflows.

453:28

Just a little bit about me before we go

453:30

any further. My name is Gabe. I'm a

453:32

software engineer here at OpenGV. I work

453:35

on the AI agents team and uh I'm one of

453:38

the folks that helped build uh OG Assist

453:40

and some of the systems that you guys

453:42

will be seeing today.

453:44

So, a little bit about OpenGV. OpenGV is

453:47

a software company uh on a mission to

453:50

power more effective and accountable

453:52

government.

454:39

Heat. Heat.

455:17

Heat. Heat.

455:23

Heat. Heat.

455:44

Heat.

455:57

Heat.

456:06

Heat. Heat.

456:26

Heat.

456:43

Heat.

456:49

Heat

457:08

up here.

457:14

Heat. Heat.

458:00

Heat. Heat.

458:11

Heat. Heat.

458:21

Heat. Heat.

459:11

Heat.

459:31

Heat up here.

464:11

Please welcome our MC for this

464:13

afternoon's programming, director of

464:15

technology at Oliver Wright Americas,

464:17

Deina Delias.

464:35

Good evening everyone. Gosh, I am so

464:40

grateful to be up here with you. House

464:43

AIE 2026.

464:48

Thank you for being here live and

464:52

online.

464:53

Thank you so much. So

464:56

um apologies

465:00

Deina Delias Oliver White Americas we do

465:03

integrated business planning and

465:05

strategy consulting.

465:07

So honored to be here with you all. We

465:10

covered so many grounds, 18 tracks of

465:13

workshops, keynotes, panels, expo

465:16

sessions,

465:18

breakouts, and most of all, your

465:20

networking sessions. Have you met all of

465:22

your friends tonight?

465:25

Yes. No.

465:27

Precious.

465:30

Am I the only one who thinks the more I

465:32

know, the more I don't know? Show of

465:35

hands.

465:36

Oh, thank you. What? Pity hands up. I

465:41

I'll take it. Thank you.

465:45

But thankfully for us, the expo has a

465:47

mass of wonderfully supportive sponsors

465:51

and expo partners ready to assist you in

465:53

your business and personal projects for

465:56

best practices.

465:58

Talk to them, visit them, let them help

466:00

you achieve your goals. Check out the

466:02

dancing robots. Take a picture with

466:05

them. Win the giveaways. check out start

466:09

start a battlefield tonight

466:12

um and talk about

466:15

best practices.

466:17

This next speaker is someone I truly

466:20

look up to and honored to make his

466:22

introduction. His achievements are so

466:26

vast it's hard to wrap them all up in a

466:29

few sentences. So I'll use his humble

466:32

words instead.

466:34

He's an author, an educator, advocate

466:37

for AI best practices. He translates

466:41

complex technical concepts into

466:44

accessible learning materials.

466:46

I am truly excited for what he has to

466:49

say for us. Give a huge round of

466:52

applause for Addios Mani.

467:13

Howdy folks.

467:15

So, good afternoon or good whatever time

467:19

it is when you're watching this on

467:20

YouTube. I'm really excited to be here

467:23

and um today I want to talk to you about

467:27

really uh what it takes to keep the

467:29

human in the loop where engineering is

467:31

concerned. I really want to start with

467:33

the human side before we talk about the

467:35

architecture here. I think that the

467:37

engineer of the future is going to be

467:40

really defined by the person who is able

467:43

to choose what is worth doing.

467:46

They're going to own the evidence.

467:48

They're going to own the understanding

467:49

as well as the verdict around

467:52

increasingly automated work that's being

467:54

done by agents. Now, when I use the term

467:58

verdict, I don't mean that we're

468:00

suddenly all going to be Judge Judy.

468:02

We're not. But what I mean really is

468:05

something just a little bit different. I

468:07

mean we're going to be accountable for

468:09

the production decisions.

468:12

Does something ship? Do we block it? Do

468:14

we redirect it or accept the risk?

468:17

Quality is something that we all talk

468:19

about a lot, but quality produces

468:21

evidence. A verdict assigns

468:24

responsibility

468:26

and answerability is really what lets us

468:28

stand behind a verdict. And this, of

468:31

course, is not the only way that our

468:33

industry is starting to think about our

468:36

roles evolving.

468:38

Boris Churnney recently put some useful

468:40

language around what many teams are

468:42

starting to feel. The old craft

468:44

boundaries are getting blurry and roles

468:47

are rebundling around the work itself.

468:50

And the important question here becomes

468:51

a lot less about what is your title and

468:54

more what part of the system can you

468:56

own.

468:58

Now I like this taxonomy quite a lot. Um

469:02

it's optimistic without being overly

469:05

vague. So things like prototype, build,

469:07

sweep, grow, and maintain. And these are

469:09

real engineering modes. Agents are going

469:12

to help with all of them, but the scarce

469:14

thing is not merely doing the task. It's

469:17

going to be knowing which mode your

469:19

product needs and what quality bar

469:21

applies and who owns the result. At the

469:23

end of the day, now we've been talking

469:26

about harnesses and loop engineering and

469:29

software factories over the last couple

469:30

of days. We can talk why this shift is

469:32

happening. We move past the model as the

469:35

whole story, right? With harness

469:37

engineering, the coding agent is the

469:39

model plus the harness around it, right?

469:41

Your context, your tools, your file

469:43

system, git. And the harness is what

469:45

turns intelligence into something that

469:46

you can delegate to. The next move was

469:49

loop engineering where we weren't just

469:51

prompting one run anymore. We were

469:53

designing systems that kept prompting,

469:55

checking, and remembering, and deciding

469:57

what happened next. And that's really

470:00

when agents started to feel like

470:01

infrastructure. And once you start

470:03

putting all of those things together,

470:05

you get that software factory. Dex

470:07

covered this well in his talk. But you

470:09

have agents that are running inside that

470:11

inner loop and evidence that comes out.

470:14

Humans still end up making the

470:16

production decisions in this loop. And

470:18

the wind really isn't moving us from it.

470:21

The wind is moving human judgments the

470:23

highest leveraged checkpoint I think.

470:26

And this is why it starts to matter now.

470:29

AI generated and AI assisted code is

470:31

becoming normal code for a lot of us.

470:34

One of Sonar's 2026 surveys said that AI

470:37

assisted code is no longer marginal.

470:39

It's increasingly having a large role in

470:42

our code bases. And once that happens,

470:44

answerability stops being this

470:46

philosophical world. It becomes an

470:48

engineering requirement. And there's a

470:50

quality point here as well, right? Like

470:52

we used to care about clean code. code

470:54

that people could read. But cleaner code

470:57

is actually not just going to help the

470:58

next human and the next person on your

471:00

teams. It actually helps the next agent.

471:03

Another one of Sonar's research uh

471:05

studies found that clean and messy repos

471:07

had roughly the same pass rates, but

471:10

clean code actually used fewer tokens

471:11

and caused fewer revisits. So there's a

471:13

lot of benefit to maintainability that

471:15

can fuel efficiency for your factories.

471:18

Now making generation cheaper does not

471:21

automatically make review cheaper,

471:23

right? I think a lot of us are facing

471:25

this moment and we know that engineers

471:27

are not naive. The sonar numbers say

471:29

that almost everybody is skeptical of AI

471:32

code. Now I love working in my software

471:35

factory. I love building my engineering

471:37

loops. But the problem is still

471:39

capacity. If 96% of people don't fully

471:42

trust that code, but only about half

471:45

always verify before committing, we have

471:47

this danger that we've got distrust

471:49

without bandwidth. And so safety comes

471:51

from making verification cheaper,

471:53

clearer, and harder for people to skip.

471:56

And if you zoom out from the individual

471:57

reviewer to the organization, review and

472:00

validation start becoming a bottleneck

472:02

when governance isn't able to catch up

472:05

and adoption is already moving way

472:07

faster than any company can go and set

472:09

their policies. And this means that we

472:11

have some hard questions we have to deal

472:12

with like did a model actually touch

472:14

this file. And the hard questions are

472:17

also like what constraints guided that

472:19

work? what evidence was produced, what

472:22

risk was accepted, and who owned the

472:24

result. Now, the agent can ship more

472:27

than any of us can review, right? So,

472:30

what are we still good for? I It's a

472:32

question that's on a lot of our minds,

472:34

right? And you know, if Homer Simpson's

472:38

experience automating computers can

472:39

teach us anything, maybe this is our

472:42

future. I don't think it is, but it's

472:45

one direction things can take. Now,

472:47

let's try that again. If change is where

472:50

humans enter the loop, if generation

472:52

scales faster than comprehension, the

472:54

scarce resource becomes judgment that's

472:56

backed by evidence. So the question is

472:59

no longer how much can the agent do, but

473:02

where does human judgment still create

473:04

leverage. Now I want to talk to you

473:07

about two terms that I'm going to use

473:10

for the career part of this talk. Alpha

473:12

and decay. Alpha is the gap between what

473:16

you can do today and what current models

473:19

can do. That gap is a very real thing

473:23

and decay is the clock on that gap. If

473:26

the thing that makes you special is a

473:28

capability, the frontier is eventually

473:31

going to come for it. Right? And there's

473:35

a whole conversation around this. This

473:36

is one of the reasons why taste keeps

473:38

coming up. Paul Graham had a point here

473:40

that I think is very right. When anyone

473:42

can make anything, choosing what to make

473:45

becomes very important. And I buy that.

473:47

But I also think that we have to be very

473:49

careful because taste can become a magic

473:52

word for whatever part of the work we

473:54

don't want to explain just yet.

473:58

Mitchell Hashimoto gave us a more useful

474:00

version of this definition. Taste is the

474:02

ability to make highquality qualitative

474:05

judgments where no objective metric

474:07

exists yet. That matters because it puts

474:10

tastes before the benchmark and before

474:12

the market has fully voted. When you try

474:15

out a model and you see the kind of UX

474:17

and the kind of experiences that it

474:19

builds, you can often tell when you

474:22

think it has taste or lacks taste or

474:24

when there's a gap there that humans can

474:26

fill. Now, this is also only useful if

474:28

we can turn some of this concept around

474:30

taste into critique examples and better

474:33

judgment over time. So yes, taste

474:37

matters when production gets cheaper.

474:39

And if anyone can generate 10 options,

474:41

the scarce skill is really knowing which

474:44

option deserves to exist. But taste is

474:47

not some eternal moat.

474:50

It's alpha as well. Now the people with

474:53

taste are still going to matter. I

474:55

personally think they're still going to

474:56

matter for a long time. But the best

474:59

version of that skill is not mystique.

475:03

It's making better calls and leaving

475:05

behind examples that your team in the

475:07

system can learn from. Now let's apply

475:10

the decay test. Well, we used to have

475:12

speed that decayed. We used to have

475:15

recall. You know, harnesses have memory.

475:18

Verification is moving into harnesses,

475:20

eval static checks, and model critique.

475:23

Taste. I continue to think this is going

475:25

to decay much more slowly, but it still

475:27

resets as models learn from examples and

475:29

preferences. Even judgment in some ways

475:32

is a slope rather than a wall. So the

475:34

strategy is not to cling to any one

475:36

capability. It's for us to keep moving

475:38

our edges up a level. So this is one of

475:41

the reasons why what can the agent do is

475:43

not the best strategic question anymore.

475:46

The list of things that agents can't do

475:48

just keeps shrinking. The better

475:50

question for us is really what can only

475:52

a human be answerable for? Not because

475:56

you know any of us are are magical in

475:58

any way, but because some decisions

476:00

actually require ownership. They require

476:03

context, risk acceptance, and

476:05

responsibility after that work shifts.

476:08

This is why the word engineer has to get

476:11

just a little bit stricter. More people

476:14

than ever can now make computers do

476:16

things. And I think that's truly

476:17

awesome. The total addressable market

476:19

for builders has never been larger, and

476:22

that's so cool. But it's a huge

476:24

expansion of the leverage. An engineer

476:26

is not merely somebody who can code, you

476:29

know, and and get things to exist. An

476:31

engineer can reason about systems. They

476:33

think about constraints. You defend

476:35

trade-offs. You can manage risk. And

476:37

you're the person that can be reached

476:38

out to when things start to break. So

476:41

what are things that engineers should

476:42

avoid if we want to stay effective and

476:44

accountable in this moment? Well, the

476:47

first thing to avoid really is cognitive

476:50

debt. Now, cognitive debt is the erosion

476:53

of your understanding and memory around

476:55

how to solve problems. I think a lot of

476:58

us start to feel this the more that

476:59

we're using agents every single day. I

477:01

know that I feel this a lot and it's

477:03

because we're deferring more and more to

477:05

AI to solve our problems. For code, it's

477:08

the gap between how much code exists in

477:10

your repo and how much any human on your

477:13

team genuinely understands. And this is

477:15

why things like delegation depth end up

477:17

mattering. You can have a build that

477:20

passes you know your tests a PR that you

477:22

can merge but your team can still end up

477:24

losing its ability to actually explain

477:26

the system that they are shipping to

477:28

production.

477:30

Now a very real pressure is much is also

477:32

how much we delegate. So agents can now

477:35

stay inside the system long enough for

477:37

the human to lose the thread. So a 30

477:40

secondond run right can feel like an

477:41

interaction but an hour or a daycale

477:45

task so something long horizon that's a

477:47

work stream and when tasks can end up

477:50

you know lasting that long especially

477:52

when you begin running many of them in

477:53

parallel review can't just be a glance

477:56

at the end it has to become a whole

477:57

control system. The second thing to

478:00

avoid is cognitive surrender. Now this

478:03

is when you blindly accept AI's um

478:06

responses like delegation is important

478:09

because delegation says do the work then

478:12

show me enough evidence that I can judge

478:13

it. I still make a judgment in that

478:16

situation. Surrender is really saying

478:18

hey your answer is now my answer before

478:20

I have formed any opinions myself. Now

478:24

uh Wharton did a study that kind of

478:26

offers us a warning light here. when AI

478:28

was wrong, 73% of people still thought

478:33

that they, you know, they picked the

478:34

wrong answer and they felt more sure. So

478:37

the failure mode is not using AI, but

478:39

it's borrowed confidence.

478:42

The third thing to avoid is

478:43

orchestration tax. Now, if you've been

478:46

in the Bay Area, you will see people

478:48

who, for better or worse, are still

478:49

walking around with their laptops open

478:51

or are talking to you about cloud

478:53

agents. And we're increasingly trying to

478:56

run more and more and more in parallel

478:58

or telling each other that we're

478:59

shipping with hundreds of agents or

479:00

thousands of agents. More AI agents

479:03

running does not mean that there is more

479:05

of you available. Your cognitive

479:07

bandwidth does not parallelize. So every

479:11

loop that you create ends up causing

479:13

more decisions to route, merge, verify,

479:16

and integrate. And the fix is not

479:18

necessarily fewer agents, but it's about

479:20

designing your attention like a system.

479:23

like where you enter, what you require,

479:25

what you reuse. You just want to be very

479:27

intentional about it. Now,

479:30

accountability can be a scary word for a

479:33

lot of people, and I wouldn't be

479:35

surprised if it made you want to go hide

479:36

in the bushes and just tell your agent

479:39

to deal with it.

479:41

But accountability is not what remains

479:43

after agents get good. It's what lets

479:45

the rest of the whole system scale. If

479:48

agents can do more work, if they can do

479:51

it faster in parallel, better than what

479:54

many of us could do, the scarce thing

479:56

becomes the ability to explain intent,

479:59

to inspect evidence, to accept risk, and

480:01

improve the system when the decision was

480:03

wrong.

480:06

Now, here is the career math. The

480:08

halflife of an edge might be one model

480:11

release. speed, recall, verification,

480:14

even taste all move as the frontier

480:17

moves. But the half-life of a signature,

480:20

your credibility, your expertise is much

480:23

longer. And by signature, I really mean

480:25

the name on the work, the person, the

480:28

team, the institution, whoever stands

480:31

behind what's actually shipped. So

480:33

skills can earn leverage. Accountability

480:35

can turn leverage into trust. And this

480:38

is one of the lines that I want to draw

480:39

pretty clearly. Agents can choose, they

480:42

can route, they can merge, they can

480:43

escalate, they can operate inside

480:45

policy. And in many systems, you know,

480:48

they can, they should, but execution and

480:50

responsibility are very different

480:51

things. The agent can follow your

480:54

runbook, but it can't inherit the

480:55

consequences. When something fails, the

480:58

question is, who understood the policy?

481:00

Who accepted the risk? And who owns the

481:02

blast radius? High agency is something

481:05

that a lot of us talk about these days

481:06

as being like this thing that we're

481:08

looking for when we're hiring. High

481:10

agency is actively taking ownership of

481:12

your outcomes. So knowing when to

481:14

delegate, when to inspect, when to stop,

481:16

and when to put your name on the

481:18

results. High agency in this world is

481:21

not I personally do everything. You

481:24

know, that version doesn't really scale.

481:25

It's not just hustle theater, but it's

481:27

ownership with judgment attached. This

481:30

agency ladder tries to make that a

481:33

little bit more concrete. At the bottom,

481:35

you've got someone that flags a problem

481:37

and leaves it for the system. higher up

481:40

they execute, diagnose, propose,

481:42

recommend, and resolved. And the rare

481:45

top movement is discernment. You know,

481:47

maybe you find a problem and you decide

481:49

whether or not it's worth investing in.

481:51

Maybe it's not and maybe you move on.

481:53

But when agents make more paths

481:55

possible, agency is not chasing every

481:58

single path. It's really just deciding

482:00

which paths deserve your ownership and

482:03

attention. So translate that into an

482:06

operating model. agents can run much

482:08

more of the inner execution loop. They

482:10

can investigate, implement, test and

482:12

report. I think that there's leverage in

482:14

that, but that outer loop is still

482:16

engineering. So deciding, verifying,

482:19

approving, owning, that inner loop is

482:21

capability. The outer loop is agency.

482:24

And this is a boundary that I really

482:26

care about. Your agent returns evidence.

482:29

It returns diffs, tests, logs,

482:32

rationale, traces, trajectories,

482:34

screenshots, whatever the work itself

482:37

requires. But then the engineering

482:39

really begins. We decide whether the

482:41

work was worth doing. We verify whether

482:44

the evidence is enough and we approve or

482:47

redirect or own what reaches production.

482:50

It doesn't matter if you're someone

482:51

that's just working with a small number

482:52

of agents or whether you're working with

482:54

thousands of agents. I still very much

482:56

think that these ideas apply. So the

482:58

boundary is not human looks at AI

483:00

output. The boundary is evidence and

483:03

responsibility.

483:05

So here's an operational rule. Explain

483:08

it or don't ship it. And it's not

483:10

because humans have to type every line

483:12

or read every line, but because someone

483:14

has to understand the work well enough

483:16

to defend it. If you've ever worked in a

483:18

large codebase or an enterprise

483:20

codebase, some code bases have this

483:21

concept of an owner's file or c certain

483:24

subdirectories where there are people

483:25

who are on the hook for that part of the

483:27

system. You can think about this in a

483:29

very similar way. Who's accountable for

483:31

that part of your architecture in your

483:33

codebase? Your model might write the

483:36

code and the question is really still

483:37

whether you can explain those changes

483:39

that the agent is shipping, whether

483:40

you've got the evidence where you

483:42

understand the risks. Now, this is one

483:44

of the things I want you to remember

483:46

near the end. Automation moves the floor

483:49

for all of us. Engineering continues to

483:52

move up a level. And our new work might

483:55

be loop design, evidence design, and

483:57

brownfield stewardship, but fewer

483:59

keystrokes doesn't mean less engineering

484:03

over the next few years. It means that

484:05

there is more surface area that needs

484:07

taste, verification, ownership, and

484:10

ultimately care.

484:12

I don't think I've ever been more

484:14

excited about the future of this field.

484:17

Every time that we've made it easier to

484:20

write software, we've predicted that the

484:22

world would need less of it. And in

484:24

fact, the opposite happened. Higher

484:26

level languages happened, frameworks,

484:28

cloud, low code. The pattern always went

484:31

the other way. And when you lower the

484:33

cost, latent demand ends up appearing.

484:36

Those ideas that people didn't think

484:37

were feasible to build and get out there

484:39

are suddenly unlocked. And agents are

484:41

going to do the same thing for a lot of

484:42

people. It's not going to remove

484:44

engineering work. It's going to move the

484:46

bottleneck from can we build this to

484:48

should this exist and can we answer for

484:50

it. So build the factories, keep the

484:54

lights on, own the verdict. I hope this

484:56

was useful. Thank you.

485:06

Now joining us on stage are the

485:08

co-founders of artificial analysis,

485:11

George Cameron and Micah Hill Smith.

485:33

Hey, hey. Good afternoon everyone. I'm

485:35

Micah. This is George. And we are the

485:38

co-founders of Artificial Analysis.

485:41

Artificial Analysis is an AI

485:43

benchmarking company. And today, we're

485:46

going to be talking to you about the

485:47

cost of intelligence. A couple of years

485:50

ago, when neither of us would give talks

485:51

like this, we would spend a bunch of

485:53

time justifying why intelligence and

485:56

cost trade-offs matter. Today I'm going

485:59

to skip that whole part of the bit and

486:01

we're just going to get straight into it

486:02

because I would be shocked if I needed

486:04

to convince anyone in this room why the

486:06

cost of intelligence is an important

486:08

topic for us to be talking about in mid

486:10

2026.

486:12

So here's what we're going to do. I'm

486:14

going to tell you a bit about who we

486:15

are. We're going to use some of our data

486:17

to take a brief look at the state of the

486:19

AI race. Then we're going to spend most

486:21

of our time breaking down the cost of AI

486:24

today and what's driving it. We're going

486:27

to use some data from our latest agentic

486:29

knowledge work evalu.

486:39

What the heck does that mean? We build

486:42

benchmarks and evals to test everything

486:45

in the AI stack that matters to

486:47

developers and companies making

486:48

decisions about AI technologies. We test

486:52

chips, cloud infrastructure, models, and

486:54

agents.

486:56

We try to figure out how smart the

486:58

models are, how fast they are, and how

487:00

much they cost. We publish a ton of that

487:03

data on this website. Hopefully, some of

487:05

you have seen it. And we work with

487:07

companies throughout that entire AI

487:09

stack to measure their technologies,

487:12

help them in the world understand what

487:13

they can do. Got a handful of examples

487:16

on the slide back there from some of our

487:18

work with OpenAI, Google, and Nvidia on

487:20

their models. recently.

487:24

Let's have a look at the state of the

487:25

race.

487:27

Before I show the first chart, going to

487:29

talk about an idea that is very

487:31

important to the way that we think about

487:33

building AI evvelts.

487:37

The vast majority of the things that we

487:39

foreseeably want AI to do, the models

487:42

are still far too dumb to do. It's

487:45

utterly profound what the models can do.

487:48

Today things are pretty nuts and yet

487:51

because the future is so enormous this

487:53

is almost certainly still true. So what

487:56

this means is that at any given moment

487:59

in AI we've got this concept that we

488:01

think of as the intelligence frontier

488:03

what today's smartest models can do.

488:06

If we think of most of the tasks

488:09

being beyond that, certainly beyond that

488:12

in terms of being able to reliably do

488:14

them, that explains why so much of what

488:18

all of us in this room want to do with

488:19

AI is focused on what the absolute

488:22

latest frontier models at any given

488:23

point can do. It also implies that there

488:26

exists a set of tasks that are inside

488:29

the frontier and that that set of tasks

488:32

is growing every month as new models

488:34

come out.

488:36

For that set of tasks, playing the

488:38

intelligence cost trade-off is

488:39

incredibly important because by choosing

488:42

to not use the smartest model for every

488:44

single thing, you can spend 10, 100, a

488:47

thousand times less to get the same work

488:50

done by the AI.

488:53

The state of the race,

488:56

we publish a metric called artificial

488:58

analysis intelligence index.

489:00

We like to say that it is the best one

489:03

number for understanding the AI race,

489:05

but that if we thought you only needed

489:07

one number, we wouldn't need to publish

489:08

the rest of the website.

489:11

What this metric actually is is a

489:12

synthesis across nine different emails

489:15

that we run. We're at version 4.1 of our

489:17

index. It includes a bunch of agentic

489:19

stuff. It includes a bunch of hard

489:21

reasoning Q&A type stuff.

489:24

And we really do think that it is the

489:26

best one number for your sense of what's

489:28

going on. We've got Claude Fable 5 on

489:31

top. That little not currently available

489:33

thing. I guess we get to go remove that

489:35

from the website after this today.

489:38

One of the things we like to do with our

489:40

intelligence index is plot how it's

489:42

changed over time. This chart here

489:47

is the smartest model from each one of

489:49

these labs over the last few years.

489:53

Some of it hasn't changed that much. You

489:55

can see OpenAI and anthropic trading

489:57

blows over the last few years.

490:00

You can kind of see the dots getting

490:02

closer together on the right hand side

490:04

on the X-axis because the pace of

490:06

releases especially over the last year

490:09

has gone up and up. You can also see all

490:12

of the companies hot on the heels of the

490:14

frontier who have been and are releasing

490:17

models that achieve the same level of

490:20

intelligence as those frontier models

490:22

just months later.

490:24

If I take some of these lines off and

490:26

all we look at is the smartest model

490:28

overall and the smartest open weights

490:30

model at any given point, we can draw

490:32

this line and we can look at the gap

490:34

between the open weights frontier and

490:36

the overall frontier.

490:38

In any given month, you can probably

490:40

find a headline saying that open weights

490:42

models are further from the frontier

490:44

than ever or that open weights models

490:46

have just caught up to the latest

490:47

proprietary models. I think when we read

490:50

this chart, what we see is that

490:53

unfortunately neither of the extreme

490:55

versions are true and we see a

490:57

consistent 3 to nmon gap that's held

491:00

surprisingly consistent over all of the

491:03

last 3 years.

491:05

That's still pretty nuts by the way

491:06

though because that does mean that

491:09

within 9 months of Mythos being

491:11

announced, we are predicting that

491:13

someone's going to give away a copy of a

491:14

model as smart as Mythos. You can hold

491:17

us that prediction. I'd be very

491:18

surprised if this trend goes away

491:20

anytime in the next year or so.

491:24

Beyond intelligence, we can plot a bunch

491:28

of the metrics that you have to trade

491:29

off against how smart the model is. This

491:33

one's pretty simple. This one's the

491:35

price of the tokens.

491:38

This one actually might be surprising in

491:39

a talk that we've called the cost of

491:41

intelligence because we all have this

491:43

feeling that the amount we can spend on

491:45

AI is skyrocketing higher right now. And

491:48

that's completely true. But this trend

491:50

here is also true. Token prices have

491:53

continued to fall by 5 to 10x every year

491:57

for each fixed level of intelligence.

492:00

Each of the lines here is a band of 10

492:02

points of intelligence index. I promise

492:04

you that if you ever have to pick

492:05

between a model that's 10 points higher

492:07

on our intelligence index than another

492:08

model, it's incredibly hard to find any

492:10

task at all in the full distribution of

492:13

tasks that the model that is 10 points

492:14

dumber will outperform the better model

492:16

on

492:19

each one of these lines goes down

492:21

incredibly quickly. It's a log axis on

492:23

the y-axis on this chart, by the way.

492:26

And the cost of tokens at the frontier

492:27

has stayed surprisingly consistent.

492:30

But we look at cost per task across all

492:34

of the emails and tasks that we run for

492:36

our intelligence index and yeah the

492:38

number is going up.

492:40

This is the average across every task

492:43

which includes some agentic stuff, some

492:45

non-agentic stuff. So it's actually

492:47

hiding how extreme cost per task gets in

492:51

some situations today.

492:53

If we break it out a little, these are

492:56

kind of small but we've got the highest

492:57

numbers on the left there. GBQA diamond

493:00

famous important open source evaluation

493:02

data set from a few years ago. It's a

493:05

reasoning evaluation. We don't let the

493:07

models work as agents. It's largely

493:08

solved right solved now. We see from

493:12

fractions of a scent per answer for each

493:15

model up to about 50 cents. In our

493:18

coding agent index and in our new AA

493:21

briefcase agent acknowledge work eval.

493:23

We see up to beyond $20 being spent on a

493:27

single task.

493:29

The most expensive task in a briefcase

493:31

is actually several times that leading

493:34

that of course we do have claude fable 5

493:35

although fun fact it's kind of small

493:37

here but you can see claude sonnet 5

493:39

actually uses an enormous number of

493:40

tokens and so it's nearly expensive in

493:42

our AA briefcase tasks down the bottom

493:45

there but this is the thing that we're

493:48

all feeling that we're trying to do

493:50

these really hard tasks the frontier

493:53

keeps moving there are more things that

493:54

we can ask the models to do than there

493:56

were a while ago So we can spend

493:59

enormously more per task than we could

494:03

even though that cost per token for each

494:06

fixed level of intelligence is falling

494:08

by 5 to 10x every year. These orders of

494:10

magnitude are not things that our brains

494:12

are good at getting intuitively and the

494:14

contradictions are kind of nuts. So I'll

494:16

pass off to George now to break down how

494:19

we understand some of these

494:20

contradictions.

494:22

Thanks Micah. So why does AI feel more

494:26

expensive than ever while for fixed

494:29

levels of intelligence the prices of

494:32

accessing that intelligence in terms of

494:34

tokens is falling dramatically and I

494:37

think this is AI engineer world fair we

494:39

actually want to spend more

494:42

higher token budgets

494:46

when what I'm going to do now is use our

494:49

AI briefcase benchmark to do analysis of

494:52

this cost of intelligence

494:56

Our AA briefcase benchmark is our new

494:58

agentic knowledge work benchmark. It

495:01

benchmarks models on realistic

495:04

professional tasks.

495:07

There's four private scenarios,

495:10

each representing weeks of human

495:13

equivalent work.

495:17

And do we ask models to complete

495:19

realistic tasks? Then we grade models on

495:22

the outputs of those tasks across three

495:25

dimensions. Rubric correctness,

495:28

analytical quality, and presentation.

495:32

Much like we think about assessing human

495:34

work.

495:36

One of the differentiators for a

495:38

briefcase compared to other benchmarks

495:40

is we've tried to make it as realistic

495:42

as possible.

495:44

When giving a task to someone else on

495:47

your team or when receiving a task,

495:50

unfortunately, you're not given it on a

495:53

platter with the precise information

495:55

that you need to complete the task. You

495:58

need to go out and find it. You need to

496:01

troll through emails, pick up on the

496:03

latest Slack messages. That's what we

496:05

expect for ourselves and others. And so,

496:07

we've tried to mimic this in the task

496:09

that we're giving models in a briefcase.

496:13

The environments that models are

496:15

completing tasks in

496:18

are thousands of files,

496:21

messy Excel files, unstructured

496:24

documents, structured documents and

496:25

reports with hundreds of pages,

496:28

emails, Slack messages. And we expect

496:32

and ask of agents to complete these

496:35

tasks just like we ask of ourselves.

496:41

When we look at the outputs of models in

496:45

completing these tasks, you can see vast

496:48

differences

496:50

in the quality of the outputs. And this

496:53

is how we assess the quality and

496:55

intelligence of these models on these

496:57

agentic knowledge work tasks. It also

497:00

gives us a perspective on the progress

497:03

that's been made over the last couple of

497:05

years on this task which is a commercial

497:08

due diligence task. GPT40

497:12

presents a pretty basic slide. 03 a

497:16

breakthrough model that was released

497:19

early last year.

497:22

Thinking about that 03 was only last

497:24

year is crazy to me.

497:27

You can see that 03 produces a few

497:31

bullet points helpful but not what we

497:33

would expect of ourselves in completing

497:35

this kind of task. And so this shows us

497:38

the progress that's been made when we

497:40

look at Opus 4.8's output and Fable 5's

497:43

output, which goes a lot more in depth

497:47

depth in terms of analytical rigor and

497:49

presentation quality.

497:57

So let's look at how models completed

498:01

this task and what it cost. If you

498:04

remember Micah's slide, he showed that

498:05

some models are take using over $20

498:09

worth of tokens uh to complete these

498:11

tasks. And so let's look at the drivers

498:14

to learn a bit about the costs of

498:16

agentic tasks.

498:18

Four drivers to look at and the key

498:21

drivers here are token price, the number

498:23

of turns in the agent trajectory, the

498:26

token efficiency and usage of models,

498:29

and last but potentially most important,

498:32

the impact of prompt caching.

498:35

Taking a look to start with the prompt

498:39

with the token prices.

498:42

What we can see as a first takeaway here

498:44

when looking at the cash hit rate token

498:48

price the input not considering a cash

498:52

hit or without a cash hit price and the

498:55

output token price. Firstly is that

498:57

there's orders of magnitude differences

498:59

between the model. This is a critical

499:01

driver.

499:02

There's order of there's two orders of

499:04

magnitude difference in terms of the

499:06

token price between Frontier models like

499:11

Claude Fable 5 and still good very

499:15

usable workhorse models like Deep Seek

499:18

V4 Flash and GPT OSS120B.

499:23

The second takeaway here is the

499:25

difference between the individual token

499:26

or the types of token prices.

499:29

You can see that there's vast

499:32

differences in the cash hit price and

499:34

the input token without a cash hit price

499:36

and the output token price. And we'll

499:38

get to that impact later when we look at

499:40

token usage.

499:44

Next, these are longunning agentic tasks

499:47

that we are now asking of models,

499:49

especially in realistic environments

499:50

where they need to navigate all of these

499:52

thousands of files to get to an answer.

499:55

And models are doing that. They're

499:56

starting to really explore the

500:00

environment

500:02

actually similar to humans when we

500:04

search Slack and and and do similar

500:06

tasks like that. You can see here with

500:08

the breakdown of tool calls of models is

500:10

that they're doing hundreds of calls and

500:13

they're exploring their environment.

500:15

They're viewing images. They're reading

500:17

files. They're writing files to do ad

500:20

hoc analysis that's going to feed into

500:22

the the slide output that we just saw.

500:26

And this costs

500:29

each turn is output tokens and then

500:32

those output tokens flow into input

500:35

tokens in the agent trajectory and we

500:38

pay for that.

500:40

When we look at the output tokens to

500:43

complete a task, we can see there's vast

500:46

differences.

500:48

You can see that Claude Sonnet 5

500:51

released only yesterday used over

500:54

200,000 output tokens per task.

500:58

Compare that to your chatbt query uh a

501:01

couple of years ago where you might have

501:03

been doing couple of hundred tokens,

501:07

couple of thousand tokens, maybe 200,000

501:10

tokens to complete a task. And you can

501:12

see here that models vary orders of

501:15

magnitude. And this is driven by two

501:17

things. This is the number of turns that

501:19

we just looked at. And secondly, it's

501:22

the output verosity of the model. Both

501:25

in terms of how much reasoning they're

501:27

doing, how many reasoning tokens they're

501:28

outputting to complete a task and also

501:32

in completing their answer. It needs to

501:33

put together that slide and all of that

501:35

detail. That takes tokens. And we pay

501:38

for those tokens.

501:42

But stepping back not just at output

501:44

tokens that the model's output but to

501:48

total tokens that we're paying for.

501:52

We have that on the left hand chart

501:54

here. AA briefcase token breakdown

501:57

answer tokens, reasoning tokens, input

502:00

tokens. Can anybody see any output

502:02

tokens here? They're all input tokens.

502:04

The vast majority

502:07

of tokens to complete longrunning

502:09

agentic tasks are input tokens. You can

502:11

barely see any output tokens there. And

502:14

so therefore, the two token prices that

502:16

we want to look at first is the input

502:20

token price without a cash hit and the

502:22

input token price with a cash hit.

502:26

And if we remember that slide, there's

502:29

vast differences between those models.

502:31

And you can see that on the right chart

502:32

here, which is the cash discount for a

502:35

cash hit of an input token.

502:38

It's usually around 90% here, but it's

502:41

also different for models and providers

502:44

whereby some models here are 99% and

502:47

others are around 80%. And if we think

502:50

about all the the vast majority of

502:52

tokens being input tokens,

502:56

you can understand that this can change

502:59

by uh multiples a difference in a cash

503:03

discount or a cash hit rate the total

503:06

amount of an agentic task.

503:08

And so I think we're used to thinking

503:11

about output tokens, but I'd ask us,

503:13

let's start with the cash hit price when

503:16

thinking about the cost of an angentic

503:18

task and tokens.

503:21

I think the last perspective we want to

503:24

share with you and wrap up with is the

503:27

most important chart for understanding

503:29

the AI landscape in 2026. In 2025, it

503:34

was simpler. It was our intelligence

503:36

index bar chart. Now we start with the

503:39

intelligence versus cost per task as we

503:42

are now wrestling with these trade-offs

503:44

of the cost of intelligence.

503:48

And a helpful archetype to understand

503:50

this and to reason about how to think

503:52

about cost per task whether we should

503:55

just use the most intelligent model or

503:57

the cheapest model is to break down

503:58

tasks into two archetypes. The first

504:02

archetype is a task whereby there's not

504:05

a ceiling on how much intelligence you

504:07

could want to complete the task. More

504:09

intelligent equals better outputs. And

504:11

this is the case for most knowledge work

504:13

today

504:15

in prof in professional tasks.

504:20

Not everybody agrees with that but

504:21

that's something that artificial

504:22

analysis we believe quite strongly.

504:24

Think about analysis that you might do

504:27

on strategy or on how we can save costs

504:31

or on even writing a job description. It

504:35

can always be better. We can always do a

504:36

better job as humans and that's the case

504:38

for models. So there's not a ceiling on

504:41

that in terms of what level of

504:43

intelligence we need, but we do need to

504:45

trade-off costs. And so the question

504:46

therefore is how much are we willing to

504:49

pay for the extra intelligence? And you

504:51

want to look at the paro line here in

504:54

making that decision. The second

504:56

archetype of task is whereby there's a

504:58

ceiling. An example is how much did I

505:01

spend on Stripe fees last month.

505:06

A smarter model doesn't necessarily give

505:08

you a different or a better answer to

505:10

that. There's a ceiling on the task and

505:12

then you want to think about what is the

505:14

level of intelligence, the minimum level

505:16

of intelligence that can complete the

505:18

task. And then you want to choose the

505:21

cheapest model

505:23

that which is to the left on this chart.

505:28

So that is the cost of intelligence.

505:31

We're artificial analysis. We're hiring.

505:33

Thanks very much. Thanks.

505:45

Please join me in welcoming the

505:47

co-founder and chief technology officer

505:50

at Arena, Whene Chiang.

506:09

Hello everyone. Uh excited to be uh uh

506:14

here sharing our experience uh building

506:18

agent evals in Arena. My name is Wayin.

506:21

I'm the co-founder and CTO at Arena. Um

506:27

quick intro on me. Uh I did my PhD in AI

506:31

research at UC Berkeley. uh where my

506:33

focus was building robust scalable

506:37

evaluations for AI systems and that work

506:40

eventually become the foundation for

506:42

what we are building today at Arena uh

506:45

to measure intelligence in the real

506:47

world. Some of you uh some of you may

506:50

have heard uh our earlier work uh like

506:54

LMS as a judge back in uh 2023. We did

506:58

uh some of the early study as well as

507:00

building a chapa arena which and some of

507:03

the um evaluation research I was

507:05

fortunate to contribute.

507:08

So what is Arena? Um simply put it Arena

507:13

is a AI evaluation company. Our mission

507:17

is to measure intelligence in the real

507:19

world beyond just static benchmark but

507:23

uh the intelligence actually delivering

507:25

real values to the users the customers

507:30

and over the past couple years uh we

507:33

have been tracking you know all the

507:35

major AI breakthrough obviously after

507:39

you know the chip moment in 2022

507:43

after that it was GPD4 turbo able GPD 4

507:48

uh having the breakthrough in chat and

507:51

multimodel capability and then evolving

507:54

to uh the reasoning model thinking model

507:57

with uh openi01

508:00

and in 2025

508:02

we uh saw the image uh generation

508:06

breakthrough of nana banana uh which was

508:10

originally uh started testing in arena

508:14

as a code name uh before it's public

508:16

release and we are also seeing um Grock

508:21

catching up GPT images 2 recently

508:24

released uh to become you know the

508:27

current frontier of image uh models as

508:31

well as you know the video AI

508:34

generations um B and recently bid CES

508:41

so towards the end of 2025 when Opus 4.5

508:45

5 4.6

508:46

uh went from being a great coding model

508:50

to a gen genuinely agentic coding model

508:53

that can do longer horizon uh task that

508:57

also showed up uh in arena 2 that where

509:01

we measure in co- arena uh we see you

509:04

know significant improvement over the

509:06

past generational model and the most

509:09

recent fable breakthrough um where we

509:12

measure in Asian arena

509:14

uh we will talk a little bit more later

509:17

as well as the most recent GLM 5.2

509:20

release which is like really a big

509:22

milestone uh for the open source model

509:25

community.

509:27

So we have at Arena we have done this

509:30

with scale. We now see 10 million

509:33

monthly visitor going to uh our product

509:37

uh arena.ai AI and we have collected 700

509:41

million conversations across all the

509:44

modalities text, vision, image, video,

509:48

coding these days agentic and we have

509:52

hit a huge milestone. Very excited to

509:54

share that just we just recently

509:57

announced we hit 100 million um

509:59

annualized revenue in just eight months

510:01

after we first released our evaluation

510:04

product.

510:06

We are also uh ranked among the top

510:09

genai product globally by unique number

510:12

of monthly visitors according to az U

510:16

analysis.

510:19

So

510:20

the um topic I want to cover today uh

510:23

and the core of what we are offering um

510:26

is life leaderboard uh which is based on

510:30

real world evaluations u powered by the

510:33

10 million users 700 million um traces

510:37

to rank all the top AI models from tier

510:40

models uh for the past couple years and

510:43

we cover text image video uh code agent

510:49

Um so really wanted to build a um

510:52

leaderboard that can help everyone to

510:54

find the best model for their use cases

510:57

and it's free. It's available for anyone

511:00

to see to use at arena.ai/leerboard.

511:03

You can see all the analytics thereof

511:06

frontier comparing cost performance you

511:09

know use cases different category

511:11

different modality of these models

511:12

capability.

511:14

So yeah, so the real problem today I

511:17

want to talk about is to share the

511:19

experience how we how do we evaluate

511:21

agents. um wanted to share our firsthand

511:25

experience uh in the past common month

511:27

we've been building uh the agentic eval

511:29

which is very very different from the

511:32

you know past in the past we evaluate

511:34

chat bots and I wanted to share some

511:35

lesson here before we diving into uh the

511:40

details first why does this matter um

511:44

wanted to talk about the trend so we

511:46

have been seeing um the very rapid shift

511:49

from uh the chatbot to Asian

511:53

um paradigm shift and if you look at the

511:56

openi's data on codeex traffic the share

511:59

of the output token coming from agent

512:02

has just skyrocketed and you can see

512:06

inside openai essentially 100% of the uh

512:09

output tokens from agent from codeex and

512:12

for other organizations you know average

512:14

is like above 60% now and individual

512:17

also climbing very fast so there's no

512:20

question that the token flow is now

512:23

driven by agents

512:26

and we also see that agents are not just

512:29

for engineers right it's not just for

512:31

software engineering if you look at

512:33

codeex adoptions by department at um

512:36

openai engineering obviously 99% but

512:40

also finance recruiting legal and so on

512:44

they are all like almost like 90% and as

512:47

so as so as you can see you know the

512:50

studies from common sac the monthly

512:52

token usage is also skyrocketing towards

512:56

like you know 60 quadr quad trillion

512:59

tokens in the next couple years.

513:03

So really you know the economics also

513:05

tell the same story. If you look at the

513:08

REM data the AI spending is getting

513:10

closer to people spend right. So if you

513:13

see like you know the top 1% of the

513:16

company's monthly AI spend is per

513:19

employee is actually already like 7 4K

513:24

um roughly half of the salary software

513:27

engineer. So this is really like you

513:29

know historical shift that um meaning

513:32

also the stack of like choosing the best

513:35

model the right model and optimizing

513:37

your agentic AI workflow is you know

513:40

more has never been more important.

513:44

So

513:46

the key question here is like um we give

513:49

agent lots of autonomy. We spend a lot.

513:52

We invest a lot. And the key question

513:54

here is like how do we actually measure

513:56

agents outcome? So that's really the

514:00

bottleneck, right? You want to

514:01

understand the value of these agentic uh

514:04

output and actions.

514:06

And this turned out to be a pretty hard

514:10

technical problem for a few reasons.

514:13

First agents are multi-component

514:16

systems, right? You got the model, the

514:18

agent take loop, um the tool, the

514:21

harness, um you know, any of these

514:24

pieces can break the system. You also uh

514:28

have agent operate through complex

514:30

workflow. Now in a real environment, you

514:34

build building app, debugging, doing

514:36

research, producing document, uh slide

514:39

deck and so on. So it's like more

514:41

involved task. Uh and third the uh

514:44

signals that we can collect you know in

514:46

this trajectory are also becoming sparse

514:50

a spread across longer horizon. Um you

514:53

know a task may take 100 to calls to to

514:56

finish right before you know if it's

514:59

succeeding or failing or you give any

515:02

feedback of a chance to steer it. uh and

515:05

to deeply understand the problem at

515:08

Arena we decided to actually firsthand

515:11

building real world you know agentic

515:14

product and app to actually source the

515:17

organic traces and feedback from the

515:20

actual users for us to you know do

515:22

research and deeply understand that. Uh

515:24

so last month we launched uh Asia mode

515:27

in arena uh to allow anyone to go to you

515:31

know arena to experience and evaluate

515:33

agentic capability. So it's right now

515:36

available for everyone to use and wanted

515:40

to show you a very quick demo if if I

515:44

can start the uh is the video moving.

515:47

Okay so this is agent arena you go to

515:50

agent you go to arena.aii I you you

515:52

choose the agent mode and this is a real

515:55

world you know agentic product you can

515:58

go and evaluate model you come in and

516:00

type any question you want in this case

516:03

um it's like I ask download Google's Q1

516:07

earning report uh and create a slide

516:09

deck summarizing the output in

516:11

PowerPoint and you can see the agent

516:14

goes off and and doing work searching

516:17

the web pulling the right website start

516:20

structuring the deck and then using some

516:23

of the batch tool writing Python code to

516:26

um generate the the slide deck right and

516:31

you can see that and at the end uh

516:33

there's like a artifact generated by the

516:36

model uh that user can download and see

516:39

and this is like a you know a real

516:41

powerpoint uh outputed by the model and

516:44

then user can at the end we ask every

516:47

turn like we ask was this task

516:49

successful or not and user can provide

516:51

feedback that way and this one of the

516:54

signals that we use to evaluate and

516:56

understand whether agent actually

516:58

delivers the outcome.

517:00

So yeah this is just to highlight the

517:02

panel

517:04

and under the hood how we build the

517:07

Asian arena it you know we give model

517:09

set of tools um file system tools

517:13

rewrite edit and so on and search web

517:16

fetching image uh generation speech as

517:20

well recently added so just really

517:22

giving the model tools similar to like a

517:25

cloud co-work like harness and also

517:28

terminal access to run code to to to to

517:31

you know do work and we also are adding

517:35

more and more uh connector soon like

517:37

GitHub uh which can connect to your repo

517:40

to you know do more serious software

517:42

engineering task um and you can see this

517:46

plot is the the usage of these tools uh

517:49

in a in a time in a oneweek time frame

517:52

you see 5.7 million to calls um you know

517:55

bash is was the you know the number one

517:58

used That's around 46% and the these

518:02

agents are actually using these tools to

518:04

do real real work for users.

518:07

So we also you know dig into the data

518:10

and seeing users are you know pushing

518:13

really hard to um trying to do more

518:17

harder and complex task. Um so real

518:20

session we've been seeing like you know

518:21

users are building you know a movie

518:24

watch list app debugging a control

518:26

systems for autonomous you know vehicle

518:30

and and architecting building a rack

518:33

pipeline you know implementing features

518:36

in micro and so on. So these are the

518:38

sessions like go over hundreds some of

518:40

them go hundreds of turns and couple

518:43

hundreds of tool calls very serious

518:45

stuff. Um and you can from this you can

518:48

tell that the u the agent that we built

518:51

uh at arena is actually doing real work

518:53

with users and giving user real value

518:56

and we believe the best evaluation

518:58

should be uh grounded and measured in

519:00

real world use cases like this.

519:03

So we launched agent arena uh just a

519:06

months ago and in the first months over

519:08

uh we collected over a million agentic

519:11

traces and these are you in task

519:13

spending coding research document

519:16

brainstorming planning and we see more

519:18

than the half of these uh uh traces fall

519:21

into work related category more like

519:24

towards professional use and complex

519:26

tasks. Um and we have seen Asian also

519:30

written um more than 50 million lines of

519:33

code uh on arena, Python, Markdown,

519:36

HTML, JavaScript and so on. This is the

519:40

tool distributions that you can see the

519:42

coding is the number one and some of

519:44

these um task you can see is some of

519:47

them are more complex using more tool uh

519:50

some of them use less and this is the

519:53

the line of code generation.

519:56

So now the going back to the evaluation

520:00

question, right? So say we collected a

520:04

million agentic traces. How do we

520:06

actually turn these traces into a

520:09

leaderboard that we can understand which

520:11

model performs better than the others?

520:13

And we primarily um mine the signals

520:18

from three type of uh basically signals.

520:20

One is like explicit which I just show

520:23

you that user will tell us directly like

520:25

which task succeeded or failed. Some of

520:28

them the other one is some implicit. Uh

520:31

we see that if user is actually uh say

520:34

downloading the file or like um

520:37

complaining about the output of the

520:39

generation from the model or praising it

520:42

and so on. So more like implicit signals

520:44

we we sense through all the traces and

520:47

also there's environment feedback where

520:49

you know what actually happened when the

520:51

code run whether the command succeeded

520:53

or failed and so on. So we basically use

520:56

these you know scans through all these

520:58

sessions traces every user message

521:01

assistant action tools resolve feedback

521:04

and aggregate them into you know some of

521:06

these signals like success rate praise

521:09

over compliance durability bash recovery

521:13

to hallucination and each of these

521:15

signal can produce the ranking right you

521:18

can measure precisely you know which

521:20

model performs better than other in this

521:22

particular signal and we combine that

521:24

into the final um leaderboard that you

521:27

see on you know on the website. Um so um

521:33

that's what you looks like um today. You

521:36

see like um this video has five

521:39

different signals and model performed

521:41

differently across board and right now

521:44

fable five is the number one models that

521:46

was you know the net improvement of like

521:48

14% over the average which is the you

521:52

know average of all the models followed

521:54

by call opus GPD fivei high and what's

521:59

interesting about this data boy is like

522:01

you can look at the signal by signal um

522:04

the model may be really really good at

522:06

test success but sometimes weaker in

522:08

terms of like you know stability in

522:11

terms how do you control the model and

522:12

you can see exactly like where the model

522:15

is failing and so on and we are going to

522:17

add you know more and more signal richer

522:19

signal to capture these failure pattern.

522:22

So methodologically the core idea is

522:25

basically a randomized control trial

522:27

where we intervene on agent component.

522:29

We measure the causal effect of you know

522:32

any given component on the task outcome

522:34

like the signal that we care uh and the

522:37

mandible basically is is like the causal

522:39

effect of of the orchestrator models um

522:43

that you can you know right now but this

522:47

framework is general enough so we can

522:49

also measure the interaction effect

522:52

between different uh components for

522:55

example let's say you want to measure uh

522:57

tool you want to measure different

522:59

harness harness or different system

523:00

prompt uh and so on. So all these are

523:02

possible within this framework and we're

523:04

going to you know uh evaluate that too

523:07

and if you are interested more technical

523:09

details are published uh on our blog

523:11

post.

523:13

Um so um we have been tracking like I

523:18

say all the major release in Asian is

523:20

one of the release happened couple of

523:22

weeks ago fable five in Asia arena um so

523:25

if you wanted to follow us on X you will

523:28

see all the you know latest release and

523:31

the interesting thing about this

523:33

leaderboard is because this is real data

523:35

right based on millions of agentic

523:37

traces you can slice it into any task

523:41

distribution you care about so for

523:43

example like let's say you care about

523:45

you know GDP tasks this more like

523:47

economically valuable professional work

523:50

versus consumer use cases you can uh you

523:53

can do some of the data analysis to

523:55

slice the data and one you know inside

523:59

here what you see is like GPD5i is

524:01

actually pretty good uh in terms of like

524:04

GPT sorry like GDP tasks uh and GM

524:09

Gemini tends to do better in consumer

524:11

use cases is so basically the the best

524:14

model generally depends on uh what

524:16

you're doing what you care the

524:18

distribution

524:20

um and on the other side is the cost

524:23

right you know cost matter too you can

524:25

we basically can plot these uh net

524:28

improvement which is performance against

524:30

the average cost to see to to help you

524:32

see the parto frontier here you can see

524:34

fable is the one that's the best uh cost

524:37

about $10 per session and 5ifi is still

524:40

very

524:41

bit cheaper and GP GLM 5.2 Gimme is like

524:45

the most efficient one. So you can with

524:48

this data decide which one is the best

524:50

model for your budget.

524:52

Another dance is tokens uh higher

524:54

performing model sometimes generate more

524:57

output token like using more thinking

524:58

model um and but uh not always you can

525:02

you can see here like GPD5 is relatively

525:05

more efficient than other models. And

525:08

the other interesting thing here is like

525:10

if you only look at the list price you

525:13

may see uh some of the model is like

525:15

same price but if you actually put it in

525:17

the real world some of the model would

525:19

use more tokens to to for the same task

525:22

right. So actually we can show here like

525:25

for example GBD5i although it has

525:27

similar price this price uh as OPUS but

525:31

in the in the real world it use less

525:33

token fewer tokens to achieve the same

525:35

task uh which is more efficient than the

525:38

others and as you can see um so to

525:42

summarize um if you are building an

525:44

agentic app um obviously you should

525:47

definitely be logging your agentic

525:49

traces to understand to log all the

525:51

interactions between agent and the user

525:53

and the customers and then be able to

525:57

you know look into the data mind for

525:58

insights and measure the outcome links

526:01

to whatever business metrics you care

526:03

and use that data to real world data to

526:06

choose the best model for you. Uh and

526:08

what we are headed next is you know

526:11

obviously going to add a lot of

526:12

different connectors to bring in more

526:14

user context and enable really the light

526:17

emails for many different kinds of

526:19

agents coding agents on real repository.

526:22

Um and we also wanted to bring more

526:26

complex task professional users slice

526:28

that into different categories to help

526:29

you understand uh how model is doing in

526:32

those category and so as more like

526:34

richer signal for um developers to use

526:38

to pick which model is the best as well

526:40

as rubrics to do more final grand um

526:42

scoring and even working collaborating

526:45

with the user to define what could look

526:47

like. Um so that's it uh for me. would

526:51

love to hear your feedback or if you

526:52

have any question feel free to uh reach

526:55

out. You can find more insights on our

526:57

leaderboard u arena.ai or follow us on

527:00

X. We also publish technical blog post

527:03

you know regularly and yes we are also

527:06

hiring so you know check out this link

527:08

or just DM me on X to reach out. Thank

527:11

you.

527:18

Please welcome back our MC, director of

527:21

technology at Oliver Wright Americas,

527:24

Deina Dias.

527:36

Hey everybody, thank you so much and

527:39

give yourselves a great round of

527:42

applause for being here till the end.

527:45

Yeah,

527:47

thank you guys. We really truly saved

527:51

the best for last. So, the startup

527:53

battle, I lie to y'all. It's not

527:55

tonight, it's tomorrow night along with

527:58

the closing speaker notes. So please be

528:01

there. We look forward to be there. So

528:04

thank you for the incredible sets of

528:07

talks for our afternoon keynotes and big

528:11

big thank you for the organizers. We

528:15

truly have incredible sponsors. The

528:18

event could not have happened without

528:20

them. We're incredibly excited to

528:23

partner with so many wonderful

528:26

organization.

528:28

presenting sponsor

528:30

Microsoft.

528:37

Okay. Okay.

528:40

Where where is it?

528:44

Okay. So, Lav and Platinum sponsor

528:55

and our gold sponsor

529:03

and of course our silver and bronze

529:06

sponsors.

529:11

Thank you all. Have a marvelous rest of

529:14

your evening and we'll see you tomorrow

529:17

morning.

529:48

It's really incredible what is going on

529:50

in the world today.

530:17

allows them to unlock more and more

530:19

levels of automation.

530:24

AI writes codes faster than humans can

530:28

review it.

530:33

Everything.

531:17

Yeah.

Interactive Summary

The video features a series of keynote presentations at the AI Engineer World's Fair, focusing on the evolution of AI agents, coding assistants, and the shift towards 'agentic' workflows. Speakers highlight the transition from simple generative chat models to sophisticated agentic systems capable of autonomous work, verification, and research. Key themes include the importance of reliability, the shift from model-centric to system-centric development, the use of 'loop engineering' for self-improving AI, and the necessity of human judgment and accountability in deploying these systems.

Suggested questions

4 ready-made prompts