Agents are Robots Too: What Self-Driving Taught Me About Building Agents

Agents are Robots Too: What Self-Driving Taught Me About Building Agents — Jesse Hu, Abundant

Watch on YouTube

Now Playing

Agents are Robots Too: What Self-Driving Taught Me About Building Agents — Jesse Hu, Abundant

Transcript

484 segments

0:02

All right. So this is my talk called

0:04

Agents of Robots 2. I've given different

0:07

variants of this talk in person for

0:10

different events, but this is the first

0:11

one that I've done for coding agents.

0:15

So to kick things off, um just a little

0:17

bit about me. I've been a lifelong ML

0:21

engineer and I've worked at places like

0:22

YouTube and Google where I worked on the

0:25

two tower embedding model as well as

0:27

some early work on BERT and mixture of

0:29

experts.

0:31

I worked on ML and robotics at Whimo

0:34

where a lot of my focus was on the data

0:35

side as well as reward modeling and

0:38

evaluation.

0:40

And most recently, I've been working on

0:42

a company called Abundant, where we work

0:44

on a lot of the same concepts applied to

0:47

data sets for Foundation Model Labs and

0:50

their training for agentic coding

0:52

models.

0:55

Um, none of this will cover any inside

0:58

information about Whimo, but we'll

1:00

instead cover some general topics that

1:04

are carried over from self-driving and

1:06

robotics into digital agents.

1:11

So I'll kick things off in kind of like

1:13

talking about what some of the parallels

1:15

are. And I think one of the main things

1:17

is that you sort of have this 1% versus

1:20

99% problem where you think that the

1:23

model is doing most of the work. But

1:25

when you get into real world

1:27

applications, the model is only doing 1%

1:29

of the work and 99% of the work goes

1:31

into other things. So in robotics you

1:33

have the hardware and sensors and

1:35

actuators you have integration

1:36

deployment and you have this whole

1:38

offline stack that does simulation

1:41

training um and other things. In agents

1:45

you also have this. So if we take a look

1:46

at the two stacks

1:48

um so in in robotics you have hardware

1:51

and you have actuators you have the

1:52

fleet and in agents you also have um

1:56

sort of like a body right whereas

1:58

robotics is you know very obviously

2:01

embodied because you go from a brain to

2:03

a physical body. In agents you go from a

2:05

model to sort of a body of a digital

2:08

robot that includes tools. So now we

2:10

have APIs and MCPs as well as more

2:13

advanced uh embodiment in terms of the

2:16

terminal and the browser and the VM. So

2:18

you're starting to see like the robots

2:22

hands and arms and legs to even more

2:25

advanced things like the entire OS and

2:27

persistent file systems and things like

2:29

that. Um in addition you have the

2:31

offline stacked so transfer over. So

2:33

we're not just finished when we have the

2:35

model. We also have to continuously

2:36

retrain. We have to monitor these

2:38

things. We also have human feedback

2:40

loops and all this other stuff that we

2:42

have to build as far as the tooling to

2:44

even support development of the agent.

2:47

And that's like sort of one of the first

2:49

learnings that I I want to share is that

2:51

um often times in self-driving people

2:53

would often talk about the winning team

2:56

not just having the best model and the

2:58

best online stack but having the best

3:00

offline stack because that enables

3:01

developers to be much faster and ship

3:04

more much more reliably.

3:08

So moving on, there's this concept I

3:10

want to share in robotics of open loop

3:12

and closed loop. This is very simply uh

3:16

being able to take an action or to uh

3:19

move an actuator or a motor and then

3:21

being able to get the feedback of how

3:22

that actually uh happened in the real

3:25

world so that you can close the loop on

3:27

that actual action. So, for example, if

3:29

I turn the wheel left, I want to

3:31

actually measure uh how much did my car

3:34

actually turn so that I can recalibrate

3:36

and make sure that I'm turning exactly

3:38

the amount I intended to because these

3:40

things aren't perfect.

3:42

In the same way, we're starting to see

3:44

where some openloop things actually need

3:46

to be closed. So, for example, if I run

3:48

a bash command and I run an open-ended

3:51

process, well, sometimes I can't observe

3:53

the outputs, at least not in real time.

3:55

I can't measure whether that bash

3:57

command completed and I can't exit early

3:59

if I need to. So that that's an example

4:02

of where we need to make things more

4:03

closed loop.

4:06

Another thing that's kind of nuanced is

4:08

the fact that um we are implicitly

4:11

discretizing in time. So what do I mean

4:13

by that? There are explicit design

4:16

choices that we need to make in robotics

4:18

about the input space and then the

4:21

action space. And particularly in the

4:23

input space, you have different

4:24

modalities. So you have the option to

4:27

use vision, LAR, radar, all these

4:31

different inputs and then combine them

4:32

in different ways to get a sense for the

4:35

world. You also have the ability to

4:37

discretize the world in different ways.

4:39

You can sample things every second. You

4:41

can sample things only when they're

4:42

pushed to you. Or you can sample things

4:44

in this example on like 50 Hz, so 50

4:46

times per second.

4:48

So that means I'll keep updating the

4:50

state of the world and I'll keep

4:52

replanning uh and I'll react to the

4:54

world very quickly. However, in agents

4:57

we've kind of done this implicitly. So

4:59

in agents we often have a conversation.

5:02

So we wait to take our turn. We execute

5:05

a tool, wait for the entire response.

5:07

Maybe we do that in sort of weird ways,

5:10

but we don't do this thing that's

5:12

natural robotics where we keep sampling

5:13

from the world and we keep interacting

5:15

in real time. So this is an implicit

5:17

design decision that is made that has

5:20

its pros and cons. The pros are it's

5:22

very easy to reason about when we have

5:24

turns. It's very easy to reason about a

5:25

conversation. It's really easy to reason

5:27

about an input and output of a turn. Um

5:31

but in in uh but the downside of that is

5:34

that we don't get to do things in real

5:36

time. You can't immediately respond to a

5:38

pop-up. We can't immediately interact

5:40

with a longunning process. So these are

5:43

the implications of the design decisions

5:46

that we make.

5:49

So more on those uh inputs and action

5:52

spaces. So in inputs we actually have

5:55

handcrafted a bunch of tools, a bunch of

5:58

ways that we can stream from tools, we

6:01

can stream from the user, but there are

6:03

other options out there. So one example

6:05

I want to highlight is the terminus

6:07

agent from terminal bench. Um, so this

6:09

is very very awesome and unique in that

6:11

they're actually using a T-X stream. So

6:14

you can actually do character by

6:15

character uh input and output if you

6:17

want to where you can do things like

6:19

control C or you can do various window

6:22

commands if you want to. Um, and so that

6:24

that's a very unique and more flexible

6:25

way of interacting with our action space

6:29

that we don't traditionally think about

6:30

when designing agents.

6:33

Other ways in which you could do action

6:35

space and robotics. We could plan in

6:37

purely XY. So you move up one block and

6:40

then move over by two. You can do that

6:42

in coarse ways. You can do that in

6:44

continuous space. You can do things in

6:46

2D. You can do things in 3D. You can do

6:48

things in acceleration instead of just

6:50

position. You can do things in

6:51

velocities. Um in agents we should also

6:54

think about this although it's less

6:55

relevant. You can you don't have to

6:57

think about just interacting with uh

6:59

MCPS and tool calls. Like I mentioned

7:01

with Terminus, you can interact with the

7:04

computer at a character level. You can

7:06

even do things like the dreamer paper

7:08

where you interact with the computer

7:10

purely by interacting at 20 frames per

7:13

second with the mouse clicks and

7:14

keyboard. So the question is what

7:17

trade-offs are we making and what

7:18

implicit or explicit design decisions

7:20

have we made that either enable us to do

7:22

more or is limiting what we can do with

7:25

our agent.

7:30

The next thing I want to talk about is

7:31

how we're going from stateless processes

7:33

to stateful processes. If you think

7:36

about driving in a video game, you can

7:38

spawn from nothing. And you don't have

7:39

to worry about where I came from and

7:41

where I go after I terminate the

7:42

session. You just have to worry about

7:43

what I do during that session. But

7:45

that's obviously not true in the real

7:46

world. In the real world, you have a

7:48

real car. That car takes up mass. It

7:51

takes up space. And so, you do have to

7:53

worry about where that car ends up. And

7:55

you have to worry about how we got into

7:56

the scene, right? everything is moving.

7:58

There are implications to how fast

8:00

you're moving and how fast everyone else

8:01

is moving. Similarly, we're going from

8:03

these stateless agents to more stateful

8:06

agents. Right? Before we just had to

8:08

spin up a session and the session, get

8:10

an artifact out of it. That's great.

8:12

Now, we have VMs. VMs that are stateful

8:15

both in terms of what's running, but

8:17

also the persistent file store. And so,

8:20

now when we have agents and we spin them

8:22

up, we have to consider, hey, what is

8:24

the entire space that we're running

8:25

into? What are all the Slack messages

8:27

that are currently going on? What is the

8:28

state of the world? What are all of the

8:30

things that I have to interact with? And

8:32

not only how we do that, deal with that

8:33

online, but how does that impact how we

8:35

do evaluation and simulation? So these

8:38

are this is one of the more interesting

8:39

things that's happening in agent space

8:41

right now.

8:43

One of the more nuanced things more

8:45

familiar to the people that are working

8:46

on modeling and training is a sort of

8:49

like dagger and out of distribution

8:50

problem. So just like in robotics and

8:52

agents, we have options of training our

8:55

models with imitation uh imitation

8:57

learning being similar to the SFT from

9:00

human demonstrations versus RL. And RL

9:03

can be in simulation or it can be in

9:05

other ways as well. But one of the known

9:07

issues with imitation is that as soon as

9:10

you get a little bit out of distribution

9:12

or off policy in relation to the human

9:15

examples, you get really out of

9:16

distribution. And you can start to see

9:18

this in agents such as browser agents.

9:20

When you see a pop-up that never

9:21

happened in training because humans

9:23

actually interact with pop-ups quite

9:25

naturally, it gets confused and it gets

9:26

really confused. So this is an issue of

9:28

cascading issues that you can see has

9:30

been studied for quite a while in

9:32

robotics.

9:35

And the general theme around this is

9:36

that actions have consequences. We're

9:39

not just dealing with classification

9:41

models. We're not just dealing with

9:43

prediction models or sequences. We're

9:45

dealing with a whole new paradigm in

9:47

which you predict, you act, and then you

9:49

deal with the consequences of that

9:51

action and then re-evaluate everything

9:53

you've done before. And that's really

9:55

tough because actions have consequences

9:57

and actions have consequences in a very

9:59

messy real world.

10:03

And as a result of the complexity of the

10:05

real world, that's where simulation

10:07

comes into play such that you can

10:09

represent all of these complexities and

10:11

all the messiness of the real world into

10:13

your starting state and you can play

10:15

through uh the real world not just in a

10:18

single path but all the paths that you

10:20

could possibly take as your agent

10:22

changes. So we call that playing out

10:24

counterfactuals.

10:28

The other thing to be aware about, and

10:29

this is sort of like classic

10:31

reinforcement learning or robotics, is

10:32

the concept of an MDP. And so that's

10:35

where there's an agent that takes into

10:37

account a state and a reward and then

10:40

we'll take actions on an environment or

10:42

a world. And this is just sort of a

10:44

formalism about how to conceptualize how

10:47

you're running the agent loop. And these

10:49

are just useful primitives to have on

10:51

hand so that you can describe and you

10:53

can communicate what's going on.

10:57

The reason this is important is because

10:59

we're moving from just plain chat models

11:02

to agent models that take action. For

11:05

context, a lot of self-driving uh

11:07

initially seemed really fast but was

11:09

really slow in progress because it was

11:11

sort of the same issues. So everybody in

11:14

the space from 2017 to 2020 was really

11:18

focused on perception models and

11:19

thinking that all you really needed to

11:21

do was uh take the state of the world

11:23

and make boxes and then you can drive

11:25

around the boxes really easily. It turns

11:28

out that assumption wasn't necessarily

11:29

true and there's a lot of hidden

11:31

complexity in creating action models and

11:34

not just predictive models. Similarly in

11:38

language models we can see that we can

11:40

understand basically everything about

11:42

the world that comes in via text. We can

11:44

generate really long sophisticated

11:46

reasoning traces.

11:49

But when you take these really

11:51

sophisticated plans, really

11:53

sophisticated chains of tool calls and

11:54

you implement them in the real world,

11:56

you can see things go wrong all the

11:58

time. You can see the tool calls fail

12:00

and the agent failed to progress. You

12:02

can see the agent failed to correct from

12:04

its own mistakes. This is sort of the

12:07

loop that is deceptively tricky about

12:10

when you get into actions from

12:11

predictive models. This is really where

12:14

the bulk of the work had been and where

12:17

a bulk of the work will continue to be

12:19

in agents as well.

12:22

I also want to point out in both of

12:23

these cases in self-driving when it

12:25

comes to robotics

12:27

and in code when it comes to digital

12:29

agents we're actually very lucky in

12:31

both. Like why are we lucky? I you can

12:33

see self-driving working really well in

12:35

production today in limited cases

12:37

whereas the rest of robotics is still

12:39

limited to demos and this is because of

12:42

how we have this machine that's

12:46

predefined with human controls that's

12:48

been really well refined over the last

12:50

few decades and then it has electronic

12:53

controls and it has built-in telemetry

12:55

right so it's something that you already

12:57

have a predefined interface to take

13:00

actions with and you have predefined

13:01

interfaces to collect the data from.

13:05

So that makes it really convenient to

13:08

operate through code and it makes it

13:10

really convenient to perform machine

13:12

learning and learning in general on. We

13:14

have this predefined interface with

13:16

predefined actions and predefined

13:18

telemetry and that makes it much much

13:21

easier of a task than going into some of

13:23

these other knowledge work tasks that

13:25

require the full desktop and things that

13:28

are less easy to codify.

13:31

So when we explore new domains, these

13:34

are some of the things we want to

13:35

consider. Is there somewhere where we

13:38

already get a predefined human interface

13:40

that makes it easy to do those two

13:42

things?

13:44

Finally, I want to talk about one of the

13:46

things that we face from day-to-day and

13:48

that's the hill climbing process. And if

13:50

you're not familiar with hill climbing,

13:51

it's basically this iterative process of

13:54

building or iterating on a complex

13:57

system such as an LM or an agent. when

14:00

you don't always make forward progress.

14:02

So before when we were working on full

14:04

stack web applications or working on

14:05

more simple systems, you implement a

14:07

feature and you probably guarantee that

14:09

feature will arrive into prod. Nowadays

14:12

you have this sort of like nebulous

14:14

metric that you're trying to hit. And

14:15

the only way you can do that is by

14:17

guessing and checking. So you have a

14:19

metric like a benchmark, then you make

14:21

some guess, you run some experiment and

14:23

you hope you go up, sometimes you go

14:24

down, but as long as you keep going up

14:26

and up and up, then you can eventually

14:28

reach your goal. And that's the concept

14:29

of hill climbing. But how we do it in

14:32

the self-driving way is a little bit

14:34

more sophisticated. We actually start by

14:36

learning and then going through

14:37

simulation. Simulation helps you deploy

14:39

with confidence and it also helps your

14:41

learning. But then once you deploy, you

14:43

can actually get logs from the real

14:45

world that feed back into your

14:46

simulation engine. That's really

14:48

important because you want to ground

14:49

your simulation on something. And so you

14:51

start to get this full loop. The logs

14:53

actually become a much more important

14:55

part of the process than they are today.

14:57

you can get a lot more insights than

15:00

just your numbers, right? So like a 70%

15:02

at a benchmark will tell you a little

15:04

bit, but if you start to break them down

15:05

into different categories, different

15:07

cities, different ways you can mess up,

15:09

start to triage the individual failures,

15:11

you can get a lot more insights about

15:13

how to improve your system and on where

15:16

to improve. And that's a lot of what

15:18

we've developed our tooling around and a

15:20

lot of what we've developed our

15:21

processes around that help some of our

15:23

customers with their hill climbing.

15:25

Finally, like you know, we're only part

15:27

of the way there. At least this is a

15:28

metric from the remote labor benchmark.

15:31

And you know, I'd like to compare this

15:33

to where self-driving was back in the

15:35

beginning. And it's because we have

15:37

really great demos and we have really

15:38

great predictive models, but we're not

15:40

nearly there as far as endto-end work

15:42

completion. A lot of the reasons are

15:44

because of the things I brought up

15:46

before with actions having consequences

15:48

and the complexity of the real world. To

15:51

recap, we've covered the parallels

15:53

between robotics and agents. Some of

15:55

those are having to do closed loop

15:57

systems, getting closed loop feedback,

15:59

how we discretize systems, how we pick

16:01

action and input spaces, how we can go

16:04

from stateless to stateful, how we're

16:06

going from predictive models to action

16:08

models, how we utilize simulation in

16:11

deployment and in training, um, and how

16:13

infrastructure is really important to

16:14

the entire development process. If

16:17

you've gotten this far, I'd like to say

16:19

congrats and you've become a master in

16:22

this new topic that we're calling

16:23

agentics because why not? Because, you

16:25

know, robotics sounds cool. Why not make

16:28

the this agent development stuff just as

16:30

cool? Um because I think it takes a lot

16:32

of these core concepts and abstractions

16:34

to really make this go from something

16:37

that we hack on to something that has

16:39

dedicated real science and really

16:41

becomes a practice. And so if any of

16:43

these concepts are useful for you like a

16:45

lot of these things are pretty easy to

16:47

understand and read about. You can read

16:49

about openloop and closed loop control

16:51

MDPs fully versus partially observable

16:54

environments. You can read about dagger

16:56

uh offline RL is a really cool topic

16:58

that is featured in more recent robotics

17:00

work. And then just like the intro

17:02

reinforcement learning book is all

17:04

great. You probably will understand

17:06

these things natively because the

17:08

problems are really obvious and easier

17:10

to understand in agent space. And

17:11

finally, you can read up on a lot of the

17:13

recent robotics literature as well since

17:16

a lot of the field is converging. So you

17:18

can just start from the papers.

17:20

Just as a recap, you know, agents are

17:23

robots too. They act in the real world.

17:24

They make mistakes. They have to

17:25

recover. And all of these little things

17:28

really matter. Thanks. You can feel free

17:30

to get in touch. Here's my email,

17:31

jesseabund.ai.

17:33

Feel free to send me any thoughts or

17:34

feedback. Thanks.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video explores the parallels between robotics and digital coding agents, emphasizing that agents are effectively 'robots' operating in a digital environment. The speaker, drawing on their experience in machine learning and robotics, argues that success in building agents requires moving beyond simple predictive models to robust, closed-loop systems that handle actions and their consequences in the real world. Key topics covered include the shift from stateless to stateful processes, the importance of offline infrastructure, simulation strategies, and lessons from self-driving technology that can be applied to improve agentic development.