HomeVideos

E23: I Spoke To The Man Building The Robotic Future.

Now Playing

E23: I Spoke To The Man Building The Robotic Future.

Transcript

1002 segments

0:00

I'm excited to share this exclusive

0:01

robotics interview with the investing

0:03

community. Most people still think

0:05

Nvidia only powers AI in data centers,

0:08

but you're about to get an inside look

0:10

at how they're putting that same stack

0:12

into robots and physical AI. I'm joined

0:15

by Spencer Hang, product lead for

0:17

robotic software at NVIDIA. Spencer

0:20

spent the last four years scaling some

0:22

of the most advanced robotic systems on

0:24

the planet. And he had some surprising

0:26

things to say about where physical AI is

0:29

headed next and how soon it could

0:31

happen. But that's just one of the many

0:33

technologies I'll be covering live at

0:35

GTC next week. GTC is Nvidia's massive

0:38

AI conference, showcasing the biggest

0:41

breakthroughs in everything from

0:42

robotics and self-driving cars to AI

0:44

agents and the chips that power them.

0:46

They have tons of sessions on robotics

0:49

with speakers from Nvidia, Agility, and

0:52

even Tesla. And anyone who signs up for

0:54

a free online session at GTC with my

0:56

link can enter to win an Nvidia RTX 5090

1:00

graphics card. Just attend any session,

1:03

take a screenshot as proof, and send it

1:05

to me after the conference using the

1:06

links below. GTC should be on every

1:09

investor's radar and so should Nvidia's

1:11

ecosystem for physical AI because the

1:14

next chat GPT moment won't be on your

1:17

screen. It'll be robots bringing AI into

1:20

the real world. Your time is valuable,

1:22

so let's get right into it. When I think

1:24

about the whole robotics industry, I

1:26

naively think about just building

1:29

humanoid robots, but there's obviously a

1:31

lot more to that. And you know, Jensen

1:33

announced a lot of really interesting

1:35

things about robotics during the keynote

1:37

earlier this week. So, can you walk us

1:39

through Nvidia's approach to robotics at

1:40

a high level?

1:41

>> Sure. So, you've you've heard Jensen

1:44

talk about the three computer solution.

1:46

And for robotics, the way that we think

1:48

of it is you need uh you need a computer

1:50

that allows you to train the brain,

1:52

>> right? So, this is something like a DGX.

1:54

You're training your video your your

1:56

vision language action model. You're

1:57

training up your your base model like a

1:58

VLM. Um, basically any model uh that

2:01

you're going to be using for cognition

2:03

is likely going to be trained on that

2:04

DJX. Yeah. But you still need to be able

2:06

to put that into a stack and test it

2:08

inside of a simulation or more

2:10

importantly for for humanoids and these

2:11

more autonomous skills. You want to be

2:13

able to train a skill in simulation. So

2:16

we need a computer to simulate the

2:17

world. And so you have a computer that's

2:19

that's there to train the brain, a

2:21

computer that simulates the real world.

2:23

And then we have a third computer which

2:24

is actually deployed in the real world.

2:25

And that would be IGX and AGX of Jetson.

2:28

And that gives you everything from the

2:30

brain to the body uh to the physical

2:32

apparatus inside of the world. And so

2:34

that's our three computer stack.

2:35

>> So the first one is really about

2:36

training the AI model, right? The second

2:39

one is really about making sure that

2:40

model gets as much practice as it can in

2:42

a digital world before it goes into a

2:44

physical world.

2:45

>> Yeah. And that simulated world could be

2:47

used not only for training but for

2:48

evaluation. So when you train a model uh

2:51

you know an LLM or these typical uh ML

2:54

models um you train it and then you

2:56

evaluate it. And with a simul with a

2:58

policy like a robotic policy, it's not

3:00

quite the same because you have to

3:01

interact with an environment and the

3:02

environment has to react back, right? I

3:04

poke something, it has to do something.

3:06

Uh, and so you need a simulated

3:08

environment. It's not necessarily just

3:10

uh am I classifying it as a cat

3:11

correctly. And so because of that, you

3:13

need to test it inside of something

3:14

that's a proxy to the real world. Now,

3:16

because we already have that proxy for

3:18

the real world called Omniverse, we can

3:19

also generate tons of synthetic data.

3:21

And so synthetic data compensates for

3:24

the lack of real data. And what we mean

3:26

by this for physical AI when we say that

3:27

we don't have lots of real data um LLM

3:30

started with the compendium of knowledge

3:32

that humans have written over the last

3:33

couple centuries we've spent most of our

3:35

life trying to make sure that we can

3:36

instill our knowledge for the next

3:38

generation. And so it was already there

3:40

for us to to start combing through and

3:42

turning into these these language models

3:43

that you know eventually turned into

3:45

chatbt. For physical AI we don't have

3:48

the same information for contact data.

3:50

We don't know how to we haven't captured

3:52

what is it like when you take a rigid

3:54

body like a finger um like a bone or or

3:56

or a metal hand and interact with

3:58

something very very soft. That

4:00

interaction that data doesn't exist

4:02

>> and that implies that sorry not to

4:04

interrupt you but that implies that um

4:06

the video data out there isn't enough

4:08

like

4:08

>> it's not yeah so what video data gives

4:10

us and the reason why reasoning is so

4:11

important this year is it's actually if

4:14

you think of it the video models were

4:16

trained to understand semantics. How

4:18

does the world work? How do things inter

4:20

how do they relate to each other? When I

4:21

think of if I ask you to build a

4:23

kitchen, you're not going to put, you

4:24

know, a 4x4 inside of the kitchen.

4:26

You're not going to put uh a a chair on

4:29

top of the table. You're not putting a

4:30

cutting board on the floor. You know,

4:31

semantically, there's places where these

4:33

objects are supposed to live in relation

4:35

to the environment that they're in. So,

4:37

what what the video model gives robotics

4:39

is the ability to have this cognitive

4:40

reasoning. It gives you semantic

4:42

reasoning. It gives you the ability to

4:44

understand and interpret the world. But

4:45

what it doesn't do is tell you how the

4:47

world is going to interact when you

4:48

start interacting with it.

4:49

>> That physical data is where the gap is.

4:52

That's that gap. And that's why we call

4:53

it physical AI is when you're trying to

4:55

start interacting with the world, those

4:57

reactions also matter. And so we need to

4:59

have lots and lots of data of how to

5:01

interact with the world. Otherwise, you

5:02

might grab an egg at the same strength

5:04

you would grab a baseball because

5:06

otherwise you have no you have no

5:07

interpretation. They're both balls,

5:08

right? They're both spherical objects.

5:09

But the materials themselves change. And

5:11

so we need that. That's why we need to

5:13

use SE simulation like Omniverse or or

5:15

even Cosmos as a world model.

5:17

>> How do you determine like when data is

5:19

when simulated data is good enough?

5:21

>> Sure. That's um that's kind of the

5:22

million-dollar question. So simulated

5:25

data synthetic data in general is is

5:27

more of an art than a science. And the

5:28

reason I say that is because going back

5:31

to LLM, when we have a corpus of data,

5:33

we could do things like data data

5:34

engineering, feature extraction. These

5:36

were because we had real data. We

5:38

understood okay once we have a data set

5:40

how can we start uh analyzing this data

5:42

set and saying well there's certain

5:43

characteristics and you know we call

5:45

them features that don't necessarily

5:46

have any impact perceptive impact on the

5:48

model. Um if that's the case then why

5:50

include it in the in the training data.

5:52

And so we were able to actually start

5:53

engineering the data and and creating a

5:55

good corpus of of training data that

5:57

results in a really well-trained model.

5:58

For physical AI we're lacking that. You

6:01

have lots of physical data that we can

6:02

collect but even real data we're not

6:04

sure what is good data or bad data. So

6:07

for instance, all of last year, you saw

6:09

lots of people doing tele operation to

6:11

uh open drawers and and you know handle

6:14

different objects inside of like kitchen

6:15

environments or various industrial

6:17

environments. And what they were trying

6:18

to figure out is one, how do I capture a

6:21

human demonstration that's clean enough

6:23

so that way I can train a policy? When I

6:25

say clean, I mean uh humans are

6:27

imperfect. And when we grab something,

6:28

if it's a demonstration, you want to do

6:30

it perfectly because you want the robot,

6:32

you don't want them to train off of bad

6:33

demonstrations. You want good

6:34

demonstrations for the most part. And so

6:36

what is a good demonstration also kind

6:39

of compounds once you get to synthetic

6:41

data because a good demonstration may

6:42

visually look good but it might actually

6:44

not improve the model itself because it

6:46

might be looking for different features

6:47

that aren't necessarily included in the

6:49

dimensions of that data. And so there's

6:51

a lot of um open questions on what types

6:53

of data dimensions do we need inside of

6:55

this data and what types of modalities

6:57

do we need? Video, visual, contact,

6:59

action. And so we capture all sorts of

7:01

data that's not just text or video

7:03

visual anymore. It's action data, the

7:05

motions itself, it's contact data. When

7:07

I grab something, what what exactly? And

7:10

so these are all things that that we're

7:11

learning right now.

7:12

>> That's exciting.

7:13

>> Um I'd love to work through a specific

7:16

example just to understand the endto-end

7:18

chain.

7:19

>> Sure.

7:19

>> So there's an awesome robot right down

7:22

the hall from us that's doing spinal

7:23

surgery. So, can we start with the AI

7:26

model training, walk through that piece,

7:28

then walk through how it would work in

7:31

simulation,

7:32

>> and then walk through what that would

7:33

mean for the physical robot in the final

7:35

account.

7:35

>> Sure. Um, spinal surgery is a a

7:37

challenging one. It's a good one to

7:39

choose. The the reason is because it's

7:41

rigid soft body.

7:42

>> Okay.

7:43

>> And so, what I'm going to describe is

7:44

not fully available today. It's it's

7:46

just going to describe the path that

7:47

we're on and the journey that we're

7:49

going to take.

7:49

>> Are those paths different depending on

7:51

the problem set? Sorry. I'm like, so

7:53

surgery versus industrial versus

7:56

>> warehousing, you know what I mean? This

7:57

is all very different.

7:58

>> It is. It is different, but don't

8:00

imagine it as the verticals. Imagine it

8:02

as the physical the physical problems.

8:04

Okay? So, for instance, I could be in an

8:06

industrial warehouse and I could be

8:08

picking up just boxes. And so, as long

8:10

as they're relatively they're rigid

8:11

boxes and they're nor you know, they're

8:13

they're known shape, known materials,

8:15

these are things that we could handle

8:16

today. If you want to do cable

8:18

management or wiring inside of a car,

8:20

that becomes much more difficult. But

8:21

that cable management and wiring is very

8:22

similar to threading needles and and

8:24

sutures for healthcare. Right? So if you

8:27

think of of it in the physical

8:28

characteristics of the problem, then the

8:30

verticals don't matter as much. And so

8:31

what we're trying to do is is figure out

8:33

between each vertical where are the

8:35

overlaps between the near-term problems.

8:37

So that way we can meet as many of the

8:38

customer needs as possible and because

8:40

we have to solve for it eventually. It's

8:42

just where do you start taking chunks?

8:44

And so, you know, to go end to end, we

8:46

could say, uh, let's let's start with

8:47

surgery. And I'll give you I'll give you

8:48

surgery, but I'm I'm going to pair it

8:50

right next to pick and place. Okay,

8:52

they're very very similar in a sense.

8:54

Um, because you're using some apparatus

8:57

to do some type of manipulation. So, the

8:59

first thing that you would do is I need

9:01

to understand what the task is. If it's

9:02

suturing or if it's picking up a box,

9:04

um, you're either going to be training a

9:06

policy and simulation where you're

9:07

fairly certain that you can train this

9:09

just, you know, off of behavior cloning

9:11

where I capture demonstrations. you

9:13

know, I watch I I put sensors on a

9:15

human. They teleop the robot a couple of

9:17

times and then we can generate more of

9:19

those actions and then train a policy

9:20

from that. So, we might do that for

9:22

either one of these. So, if you're

9:24

working with um with inner body, for

9:26

instance, it's all squishy things.

9:28

>> You know, they're it's not crunchy, it's

9:30

all squishy, and and it's not hard and

9:31

metallic. And I mean, maybe there's a

9:33

lot of metal in there depending on who

9:34

you are. Um, but the squishiness factor

9:36

makes it much more difficult because

9:38

when you're interacting with a, you

9:39

know, you take a probe and you put it

9:41

into a squishy thing and you kind of

9:42

move it around, it's elastic. And so how

9:45

it works and how it manip how it, you

9:47

know, interacts with the ob the the tool

9:49

itself really matters to the surgeon.

9:51

And so the simulation has to be

9:53

extremely high fidelity for that.

9:55

>> In the case of picking place, if I were

9:56

to choose just a box,

9:57

>> it's actually relatively easy cuz I'm

9:59

just taking two rigid bodies. I'm

10:00

grabbing a box and so it's relatively

10:02

easier in that sense.

10:03

>> Sure. But the process for training it is

10:05

basically the same. And so as long as

10:07

the technology catches up, so once we

10:09

have um proper physics simulation for

10:11

these really elastic soft bodies or

10:13

cable or you know different types of

10:14

rope, like everything's kind of a thread

10:16

in a lot of ways. Um if you were to

10:18

catch up, then the process staying the

10:20

same, the technology basically gives you

10:22

unblocks, you know, new uh new skills,

10:24

new areas and domains that you can start

10:26

applying. And this is why the fidelity

10:28

of the simulation matters so much I

10:30

assume because you know the the more

10:31

fine the action the more accurately you

10:34

need to capture just so much different

10:36

things about the data right from the

10:38

>> yeah sim the the goal is that if

10:40

simulation can be as close to reality as

10:42

possible then we could likely turn you

10:45

leveraging agents we could automate this

10:48

data generation process. So imagine for

10:50

a minute that you do a demonstration in

10:52

the real world and I can put it into a

10:55

pipeline that takes that one

10:56

demonstration and then turns it into

10:58

thousands of different data, you know,

11:00

different data outputs. It could be

11:02

augmented data, it could be multiplied

11:04

data, it could be, you know, all sorts

11:05

of things.

11:06

>> Doing this basically turns your data

11:07

flywheel from very limited to one eye

11:11

capture demonstration, that's what I

11:12

have to one to many. And we want to get

11:14

the one to many. So the higher fidelity

11:16

the simulation, the more complex

11:17

problems that we can start simulating,

11:19

which means that we can generate more

11:20

and more data.

11:21

>> Got it. That makes a lot of sense. So

11:23

sorry I kicked you off track a little

11:24

bit. So we're talking about pick and

11:26

place as well as spinal surgery. Uh walk

11:28

us through the next step.

11:29

>> Sure. So the first is data, capturing

11:31

data, uh generating data, augmenting

11:33

data. The second is training your model.

11:35

Now there's typically a few models that

11:37

might go in place for a robot. It's not,

11:39

you know, we're headed towards end to

11:40

end, but today it's it's kind of a

11:42

mixand match. So you'll have a

11:44

perception stack, something that is

11:45

classifying objects and poses. I want to

11:48

if I want to grab something, I want to

11:50

know what pose it is. So I know, you

11:51

know, how do I angle my my hand in order

11:53

to grab it? And then more more

11:54

importantly, where do you want me and

11:56

how do you want me to put that object?

11:57

So how I grab the object is also

12:00

influenced by how I need to place the

12:01

object. if I need to grab this and then

12:03

flip it upside down, maybe it's easier

12:04

to grab it upside down for the robot and

12:06

flip it versus doing this and and so

12:08

there's ways that um that you have to

12:10

think about what the robot is doing in

12:12

terms of its uh you know how it's it's

12:14

generating the trajectory. So you you

12:15

train a model for some of this. Maybe

12:17

you'll you'll train a skill inside a

12:19

simulation and then you can put them

12:21

together in a robot stack which would

12:23

have perception meaning I know how to

12:25

navigate around an environment. I can

12:26

see the environment. I know how to

12:27

perceive, you know, what's around me and

12:29

and identify obstacles, things like

12:31

that. I have a policy that once I get my

12:33

body, so imagine that a a skills policy

12:37

for manipulation, it's basically how do

12:39

I get my hands to where they need to be?

12:41

And so one part of the stack today is

12:43

how do you just move the body? The other

12:44

part of the stack is what do you do with

12:46

the hands? And so once you get your your

12:48

hands to the location, then it's okay, I

12:50

want to start doing a task. And so you

12:52

want to validate both of these things

12:54

inside of the robot stack before you

12:56

deploy it on robot. And so you can do

12:57

this in simulation. We call it software

12:58

in the loop testing.

12:59

>> Okay?

13:00

>> And so software in the loop testing is

13:01

where you simulate the robot and the

13:03

world. And then after that passes, we go

13:05

to this thing called hardware in the

13:06

loop testing, which is, you know, one

13:07

step before real deployment. Hardware in

13:09

the loop testing is where you simulate

13:11

the world, but you use the real hardware

13:13

component. So we use that third computer

13:14

and we feed the simulated data from the

13:16

second computer, the omniverse computer,

13:18

simulation computer into the onboard

13:20

edge computer. Oh,

13:21

>> and so the robot doesn't actually even

13:22

know that it's not out in the real

13:24

world.

13:24

>> It thinks it's doing spine surgery. It's

13:26

placing.

13:27

>> Yeah. And so we're going to feel and

13:28

then after that you can go into

13:30

deploying in the real world. And so this

13:32

is that end to end process data training

13:34

evaluation and then validation

13:36

deployment.

13:37

>> That's so interesting. So it's really a

13:38

matter of do you have like certain

13:41

specific buckets of skills that

13:43

determine what kind of tasks you can do?

13:45

like is it about building a skill

13:46

library to make a generalized robot or

13:48

is it like describe a little more about

13:50

how the capabilities for robots grow

13:52

over time?

13:53

>> Yeah, you you you um I think you you

13:55

framed it perfectly, the skills library.

13:57

So, we're going from specialist to

13:58

generalist. Think of a specialist as I

14:01

can do one very very specific thing very

14:03

very well.

14:03

>> Yeah. That I can do this millions of

14:07

times a day and I will not mess up. A

14:09

generalist needs to be robust to

14:11

changing environment circumstances, you

14:13

know, perturbations, things like that.

14:14

Yeah. And so if we capture all the data

14:17

from specialist, you can train a

14:18

generalist. And then the next step after

14:21

generalist is creating a generalist

14:23

specialist. So an equivalent would be a

14:25

child is a specialist in some ways,

14:27

right? They learn how to play with their

14:28

toys and they're really good at doing

14:30

some things, but they don't have enough

14:31

experience in the world to be able to

14:33

put all these skills together. Yeah.

14:45

>> I need to be posted for this. And so

14:47

then you go in and you get you get post

14:49

trained for that. And you get into the

14:50

real world. And once you're in the real

14:51

world, you still need to learn on the

14:53

job, right? And so a generalist is like

14:55

getting out of college. I'm I'm a a

14:57

fully functioning adult, but I'm not an

14:59

expert at anything. I'm just really good

15:01

at existing. I can I'm I can go into new

15:03

environments. I can learn new skills,

15:04

but I can learn. That's the important

15:06

part. I can learn a new skill. Right

15:08

now, we're at the point where we're

15:09

trying to train atomic skills. How to

15:10

grab things? Well, how are you

15:11

manipulating things? Very these are all

15:14

the same skills that a toddler or

15:15

three-year-old is trying to work on. And

15:17

then over time, you take these and you

15:18

start building them like Lego blocks

15:19

together. And so, the difference between

15:22

um you know, shaking a hand uh and you

15:24

know, maybe using a pool keel somewhat

15:26

similar in in action. And so there's all

15:28

these different actions that kind of

15:30

combine to create these composite skills

15:32

over time. And so we're we're doing

15:34

exactly what you're what you're talking

15:35

about. Think of skills library as one,

15:38

you could train a policy for it or two,

15:39

you want to just capture and generate as

15:41

much data for that skill.

15:42

>> Sure.

15:42

>> And then over time we can put them

15:43

together into these end policies which

15:45

is, you know, large multi-killed

15:47

policies.

15:47

>> Yeah, that that makes a ton of sense.

15:48

That's a that's a really exciting way to

15:50

think about it because a I think it

15:52

mirrors how like humans think about

15:53

their own learning process, their own

15:55

training process, and their own

15:56

validation process for lack of a better

15:58

word. Like how good am I at this skill,

15:59

right?

16:00

>> Um so validation is a question I'm

16:02

actually really interested in. Right? So

16:05

surgery versus pick and place. I need to

16:07

be much better with my hands to pass do

16:10

a successful surgery than to pick and

16:12

place a box. How do you know how do you

16:14

decide how good a robot is at a specific

16:17

skill? Sure. Not just that it can do it

16:19

but

16:19

>> yeah that's that's a good question. So

16:21

um one is is how do we evaluate? So for

16:23

instance we've released something uh

16:25

just recently called Isaac Lab Arena.

16:27

And so when you train a when you train a

16:29

skill in Isaac Lab which is our it's our

16:31

framework for robot learning. You want

16:33

to test that skill a b against a variety

16:35

of different environments right? So

16:37

imagine that um you know using

16:39

chopsticks.

16:40

>> Yeah that's a great

16:41

>> doesn't matter where you are. You're

16:42

going to use chop I could use chopsticks

16:43

to pick up pizza. I use it to you know

16:45

eat chips so I don't get my hands dirty.

16:46

Like you can use it's a skill. It's not

16:49

related to a specific food or a dish.

16:51

It's you can use it as a as a tool. And

16:52

so if you imagine I've taught something

16:54

to use a chopstick, but now I want to

16:56

have it do all sorts of objects, pick up

16:57

all sorts of different objects, you

16:59

know, like glass noodles and big pieces

17:01

of something and D. And so you want to

17:03

have all these environments laid out.

17:04

And so Isaac Lab Arena gives you the

17:06

ability to create all these different

17:07

environments and scenarios very easily

17:09

like Legos. And then you can create this

17:12

large library of scenarios to test

17:14

against. So that way as you're training

17:15

a policy you can see how it's performing

17:17

in not only one environment but in the

17:19

environments that you matter to you.

17:20

>> Yeah.

17:21

>> Um so that's that's one way. The second

17:23

is the hardware configuration really

17:25

matters.

17:26

>> Just because your policy can you know

17:28

you're likely able to train a policy

17:29

doesn't mean that the robot itself is

17:31

mechanically going to be able to do

17:33

whatever you need to do

17:34

>> and manipulate the chopsticks. what kind

17:35

of chopsticks

17:36

>> dexterity matters and that's why you

17:37

haven't seen many dextrous hands yet is

17:39

because a lot of the hands that we had

17:41

um they were lower than 22 degrees of

17:43

freedom um you know the human hand is is

17:45

pretty high up there you know even

17:47

between like you don't think about it

17:48

much but we use this palm area quite

17:50

often and most of the robots you see

17:52

they they grab and they they lose this

17:53

area in the palm they can't really use

17:55

it or rigid palm and and so

17:57

>> we're starting to see the mechatronics

17:59

become more advanced and start to mature

18:01

which means that the hardware can now

18:03

actually do what these policies should

18:05

be trained to do and so it's a mix of um

18:08

is your is your policy trained up enough

18:10

and you can say I don't have enough data

18:12

or I didn't train it correctly you know

18:13

things along those lines that's all

18:14

software bits and data bits on the other

18:16

side it's meatronically do I have the

18:18

right hardware in order to do it and

18:19

that's why companies like intuitive

18:20

surgical they have the hardware that can

18:22

do it and so now it becomes how do we

18:23

build policies around those hardware in

18:25

order to do certain things

18:26

>> and when you say policies you just mean

18:28

here are the not the rules but like the

18:30

general guardrails for how you should

18:32

manipulate or like

18:33

>> so a Um, a model like a perception model

18:37

uh would do things like classification

18:38

or pose estimation. So you feed in an

18:40

image and it goes okay this is what I'm

18:41

going to do. Um I'm just you you give it

18:43

an image and goes okay here's the pose

18:45

or you give an image here's

18:46

classification for a trained robot

18:48

policy. The reason why we call it a

18:50

policy is you know like what Pirates of

18:52

the Caribbean is like they're more like

18:54

guidelines.

18:54

>> The code is more what you'd call

18:56

guidelines than actual rules.

18:58

>> And so it's basically a set of

18:59

guidelines that say when you're in a

19:00

certain situation how would you react to

19:03

that situation? Right? So, I'm using

19:04

chopsticks and I've got this bowl full

19:06

of noodles. How am I going to approach

19:08

this versus if it's a bowl full of rice,

19:10

>> right? And so, that's the that's kind of

19:11

what it's it's trying to it's not a

19:13

black or white type of thing.

19:14

>> Yeah. And in surgery, you might really

19:16

care about that policy because how it

19:17

reacts to maybe something unexpected

19:19

during the surgery really matters,

19:21

right? So,

19:21

>> and functional safety, um the safety

19:25

boundaries are different between tasks

19:27

and and environments. So there are areas

19:30

where you have to be extremely safe. And

19:32

then there's areas where, you know, we

19:33

think of autonomous vehicles. If you

19:35

want to test an autonomous vehicle, you

19:36

want to have an autonomy autonomous

19:38

vehicle work. Um the safest is just to

19:40

build safety directly into it because it

19:42

has to be around human drivers. There's

19:43

no avoiding it. Um if I want a robot to

19:45

be safe today, I just move you to a

19:47

different room,

19:48

>> right? And so there's a there's a little

19:50

bit of u

19:51

>> the circumstance for that task. If it's

19:53

in a surgical room, then it needs to be

19:55

around humans. And so it has to be

19:56

extremely safe. It has to go through

19:57

plenty of certifications. If I'm just

19:59

trying to do a material movement, I'm

20:01

okay with it not being as safe. I just

20:03

make sure the environment itself is

20:04

safe. And so you don't have to build it

20:05

into the robot itself yet.

20:06

>> But eventually we will. It's just that

20:08

safety comes after fun, you know, after

20:10

the skills capability. Otherwise, what

20:12

what is it safe? The safest is just turn

20:14

it off.

20:14

>> Yeah. Right. And and I think that makes

20:16

that's actually a really interesting

20:18

like optimization problem too, right?

20:19

Because the safer you can make the

20:21

robot, the closer you can bring other

20:22

robots to it and still have it in

20:24

unison, you know,

20:24

>> and the more that you can put into a

20:26

single environment,

20:27

>> right? If you if you have a robot that's

20:28

only slightly safe in in some regards,

20:31

like it's safe in a very specific

20:33

setting, then you're limited by how many

20:34

you can put into any specific

20:36

environment, but if they are generally

20:37

safe, which is what our goal is

20:39

>> in end to end, you know, we we look at

20:41

humanoids because they're the hardest

20:43

problem. A humanoid has the problem of

20:44

locomotion, has the problem of dexterity

20:46

and manipulation, perception,

20:48

navigation, memory, balance, like uh and

20:51

then on top of that, because it has legs

20:52

and the upper body, it's this thing

20:54

called whole body control. Meaning that

20:55

you if you're grabbing a box, you bend

20:58

down and you pick it up, right? Like if

21:00

you're listening to the doctor, bend

21:01

down, pick it up, use your hips, right?

21:03

And that's actually whole body control.

21:05

Most robots aren't going to know that

21:06

out. They don't have this ability. They

21:08

only know either how to manipulate their

21:10

wheels and they can move around or then

21:12

they're doing it. That's why you don't

21:13

see very many robots that are walking

21:15

and drinking at the same time, right?

21:16

That's loco loco manipulation. And so

21:18

that's whole body control and we're

21:19

starting to get there. But humanoids

21:21

allow us to tackle these large problems.

21:23

That's why the ecosystem is is focusing

21:25

on this. If you can tackle the humanoid

21:26

problem, everything you build along the

21:28

way becomes plumbing and infrastructure

21:31

and tooling that you can then back

21:32

propagate into all of the industrial use

21:34

cases that are much more specialized or

21:36

narrowly scoped.

21:37

>> If you start with those, you actually

21:38

put yourself into into a corner. And

21:40

that's why we're we're tackling this

21:41

large problem. So you talked about so

21:44

tackling that problem means skills are

21:46

getting better over time. They're

21:47

linking together into bigger and bigger

21:49

skill sets so you can do more and more

21:51

right. One thing I'm really interested

21:52

in so when we talk about large language

21:54

models there are plenty of different

21:56

benchmarks for specific skills math

21:58

science literature are we going to see

22:00

equivalent robotic benchmarks?

22:02

>> Absolutely. The um the benchmarks that

22:04

you see today uh and this is exactly why

22:07

we built Isaac Lab Arena. So Isaac LLab

22:08

Arena is built on top of Isaac Lab and

22:10

it's basically uh an interface and a

22:13

framework for being able to design the

22:15

environment, the scenario, and the task,

22:17

right? Um I want to test a grasping task

22:20

inside of an industrial environment and

22:22

the scenario is that there's boxes that

22:24

are coming down a chute or something,

22:25

right? And so these three things are

22:27

like Lego blocks. If you could, you

22:29

know, manipulate these and create a

22:30

bunch of different scenarios from all

22:32

these existing Lego blocks, makes life a

22:33

lot easier. And so inside of the or

22:35

currently in the ecosystem there's

22:37

there's lots of benchmarks. Libro robo

22:39

bench behavior from you know Stanford.

22:41

These are all benchmarks that are

22:42

academic and used for testing the policy

22:45

themsel from an academic perspective.

22:47

One of the things that we're going to

22:49

start seeing more often is these

22:50

industrial benchmarks. I don't want to

22:52

just pick up a banana and place it on a

22:55

plate or you know any of those which are

22:56

absolutely necessary for the

22:58

state-of-the-art and and more frontier

23:00

testing. But once you get into

23:01

integration and I want to start using

23:03

this model to do things

23:04

>> I want to have my environment my

23:06

scenario my task and we're going to this

23:09

is where we're starting to to see the

23:11

environment or the ecosystem pick these

23:12

up start building their tasks so that

23:14

way they can test these these policies.

23:15

So you're going to see something very

23:16

very similar to math except maybe it'll

23:18

be micro assembly or maybe it'll be a

23:20

benchmark on um you know picking and

23:22

placing from a bin full of random

23:24

components and you have to do it in a

23:25

certain order or you have to do some

23:27

type of assembly task. So, you're going

23:28

to start seeing all sorts of these and

23:30

then we'll have categories of them and

23:31

it'll be very, very similar and it'll be

23:32

a whole library.

23:33

>> I'm really excited for that because

23:34

that's going to be very visually

23:36

engaging and I'm sure there will be

23:37

competitions around that whole

23:38

ecosystem.

23:39

>> And the cool part is whatever you do in

23:40

sim to some degree, you're going to need

23:42

to have it in real. And so, not only you

23:44

going to see uh these these scenarios

23:46

showing up in simulation to test, but

23:48

we're going to see the real world

23:49

equivalence of that because you need to

23:50

close the loop. So, you hear this often

23:52

in robotic close the loop.

23:53

>> Yeah. So you need to have a physical

23:54

space that you deploy these these things

23:57

onto these policies and these stacks

23:58

onto. You do the same task as you did in

24:00

simulation and you validate it in the

24:02

real world. And once you can validate in

24:03

the real world, we've now closed that

24:04

loop because now you've gone from uh

24:06

data and capture and all that all the

24:09

way to testing and validation. And now

24:11

this becomes that last bit of deployment

24:13

once we're like okay it works. What we

24:14

train does what we expected and now we

24:16

pull it. So you'll hear that quite often

24:18

and that's that's what makes physical AI

24:20

so hard is that we can't just leave the

24:23

validation in the data center in LLM

24:25

we're you know we have the privilege of

24:26

being able to test it in a data center

24:28

and leave it in a data center cuz its

24:29

whole life is going to be somewhat in a

24:31

data center and edge device

24:32

>> it's a SIM to SIM almost right

24:34

>> never really yeah exactly just like SIM

24:36

to SIM never really has to to actually

24:37

step into the real world and so that's I

24:39

think that's the the major challenge but

24:40

it's also the big fun of of this field

24:43

>> sure um does the loop ever go the other

24:45

way where it's like Hey, I have this

24:47

really difficult problem and I'm not

24:49

sure what kind of robot to even build to

24:51

solve it. So, I'm going to simulate that

24:52

problem first, see what kind of problem

24:54

like see the different solutions and

24:56

then build the robot that does the

24:58

solution the best or is

24:59

>> absolutely that you're asking all the

25:00

fun questions too. So the the what we're

25:03

seeing in aerospace for instance um they

25:05

can use simulation technologies and and

25:08

these AI agents that are able to modify

25:10

the design of thrusters and then

25:12

simulate you know the so similar to that

25:15

where they say we're trying to hit a

25:16

certain output right they want to have

25:18

some type of output for for their

25:19

engines and so it the agent is going to

25:21

keep going and optimizing and changing

25:23

until it gets to this until it gets to

25:24

this metric. something similar uh could

25:27

happen in robotics and this is something

25:28

that we're openly trying to uh trying to

25:31

research. Um first and foremost is the

25:33

embodiment the hand for instance the

25:35

hand the morphology of the hand. Do you

25:37

need three fingers? Do you need five

25:38

fingers? Is two fingers enough? Um what

25:40

exactly do we need to get the task done?

25:42

And it's so hard right now because um

25:44

like I said the the manipulator uh space

25:47

so like hands the ecosystem is is still

25:49

just is just beginning and we're

25:50

starting to see awesome hands coming to

25:51

market this year. And because of that, I

25:53

think what you're describing, being able

25:55

to look at the problem first and then

25:57

start evaluating, well, which robotic

25:58

components do I need in order to

26:00

accomplish this problem, it'll start

26:02

coming uh as as these uh you know, the

26:04

hardware matures because then you have

26:05

something that you can actually base it

26:07

against. You don't want to you actually

26:08

don't want to build a new robot all the

26:10

time, similar to like a car, right? You

26:12

you actually like having tier one

26:13

providers because it allows you to have

26:15

some consistency in your components.

26:17

Otherwise, if you have to manufacture

26:19

all of your own actuators and all of

26:21

your own internal components, it becomes

26:23

a huge drain on on on operational

26:24

resources.

26:25

>> That makes a lot of sense. So, even if

26:26

there is a hand that's better for a

26:28

specific task, you might want to default

26:30

to a generalized hand because then you

26:32

can just do a lot more with it besides

26:34

that one task.

26:35

>> And so, it it depends on the mix of

26:36

tasks. Uh so, an end toend robot, the

26:39

promise is that I could go to one work

26:42

cell here and then go to another work

26:43

cell here, dot totally different tasks

26:44

and use the same robot.

26:45

>> Yeah.

26:46

>> Right. That's the goal because that's

26:47

what a human can do. And so if we're

26:49

overly specialized, then we're still

26:50

stuck in specialists. So a generalist

26:51

allows you to go, you know, between the

26:53

field and that's that's where we want to

26:54

get to. And so understanding your mix of

26:56

tasks, the environment that helps us in

26:58

inform us on the hardware. Um yeah, I

27:01

think it'll go the direction that you're

27:02

describing, which would be super awesome

27:04

because there's going to be things like

27:05

assembly where okay, what would be the

27:07

best way to assemble this? Um and how do

27:09

you work backwards from there? I think

27:10

that that could definitely be a path

27:11

that we look forward to. Spencer, you've

27:13

clearly seen this whole industry evolve

27:15

very rapidly over the last couple years.

27:18

Is there something you're most excited

27:19

about or looking forward to? Like what's

27:21

the next thing that you're super pumped

27:23

about in

27:23

>> I am super excited for neural

27:25

simulation. You you'll hear this a lot.

27:27

Uh Cosmos is a world model and so it's a

27:29

neural simulator. It's been trained on

27:31

the dynamics of the world around it. um

27:33

these world models today uh are

27:36

improving and they're starting to show

27:39

extreme utility in as as part of our

27:42

policies. So when you look at Alpha Mayo

27:43

for autonomous vehicles that Jensen was

27:45

talking about, we can use these

27:47

reasoning models, these world models um

27:48

to actually be on board the car to

27:50

actually help us navigate through the

27:51

world. So the same thing could happen

27:52

for robots. There's in you know robotics

27:56

is um the oldest and newest industry in

27:58

the world in a lot of ways. The end

27:59

toend autonomy models are things where

28:02

the journey there is going to take us

28:04

quite a while and so we're going to see

28:06

quite a few evolutions of model

28:07

architectures between here and there.

28:08

Yeah.

28:09

>> And so we start with VLMs made a lot of

28:10

sense. Let's give robots the ability to

28:12

semantically understand their world and

28:14

reason about the world. The next step is

28:15

how do we make sure that the world

28:17

models are actually trained and

28:18

conditioned off of the types of inputs

28:20

that a robot would have. Meaning that as

28:21

a human I have perceptive input and

28:23

non-visual perceptive input. I have

28:25

contacts and I have action and all

28:26

these. So it's not trained on language.

28:28

We're not trained on language. We're

28:29

trained on all five senses. And so

28:31

language, visual, we need all the other

28:33

bits. And so as we start getting world

28:35

models that are able to um either be

28:37

conditioned off of these or output

28:38

these, we're going to start seeing an

28:40

influx of of totally new models and and

28:42

and capabilities. And so I'm super

28:43

excited for neural simulation for that.

28:45

Excited for neural simulation for data

28:47

generation and policy evaluation. It's

28:49

it's definitely a game changer for us.

28:51

I'm very excited.

28:52

>> What an exciting time to be alive.

28:53

>> Cosmos is going to be a it's an awesome

28:55

technology. I'm super excited. If you

28:56

guys haven't learned about it, you guys

28:57

go read about it. It's awesome.

28:58

>> I'm going to do just that. Thanks so

29:00

much for your time.

29:00

>> Wonderful meeting you, Alex. Thank you

29:01

for having me.

29:02

>> A huge thank you to Spencer Hang for

29:04

walking us through Nvidia's robotics

29:06

ecosystem, their three computer approach

29:08

to training, simulation, and inference,

29:10

and the biggest opportunities and

29:12

challenges in physical AI today. And if

29:15

you really want to understand robotics,

29:17

join me for NVIDIA GTC. You can register

29:20

for free with my links below and jump

29:22

into as many online sessions on robotics

29:25

and AI as you like. I'll announce the

29:27

winner of the RTX 5090 giveaway a few

29:30

days after the conference. So, make sure

29:32

to enter. Another huge thank you to

29:34

Nvidia for sponsoring my travel and my

29:36

media access to cover GTC Live and to

29:39

you for supporting the channel. Thanks

29:41

for watching and until next time, this

29:43

is Tickerol U. My name is Alex,

29:46

reminding you that the best investment

29:48

you can make is in you.

Interactive Summary

This video features an interview with Spencer Hang, product lead for robotic software at NVIDIA, discussing the future of physical AI. Hang explains NVIDIA's 'three-computer' stack approach: a computer for training models, one for simulating environments (Omniverse), and one for real-world deployment (Jetson/IGX/AGX). The conversation covers the challenges of robotics data, the importance of simulation for training and validation, and the transition from specialized robots to generalized ones that can learn new skills, mirroring human development.

Suggested questions

4 ready-made prompts