E23: I Spoke To The Man Building The Robotic Future.
1002 segments
I'm excited to share this exclusive
robotics interview with the investing
community. Most people still think
Nvidia only powers AI in data centers,
but you're about to get an inside look
at how they're putting that same stack
into robots and physical AI. I'm joined
by Spencer Hang, product lead for
robotic software at NVIDIA. Spencer
spent the last four years scaling some
of the most advanced robotic systems on
the planet. And he had some surprising
things to say about where physical AI is
headed next and how soon it could
happen. But that's just one of the many
technologies I'll be covering live at
GTC next week. GTC is Nvidia's massive
AI conference, showcasing the biggest
breakthroughs in everything from
robotics and self-driving cars to AI
agents and the chips that power them.
They have tons of sessions on robotics
with speakers from Nvidia, Agility, and
even Tesla. And anyone who signs up for
a free online session at GTC with my
link can enter to win an Nvidia RTX 5090
graphics card. Just attend any session,
take a screenshot as proof, and send it
to me after the conference using the
links below. GTC should be on every
investor's radar and so should Nvidia's
ecosystem for physical AI because the
next chat GPT moment won't be on your
screen. It'll be robots bringing AI into
the real world. Your time is valuable,
so let's get right into it. When I think
about the whole robotics industry, I
naively think about just building
humanoid robots, but there's obviously a
lot more to that. And you know, Jensen
announced a lot of really interesting
things about robotics during the keynote
earlier this week. So, can you walk us
through Nvidia's approach to robotics at
a high level?
>> Sure. So, you've you've heard Jensen
talk about the three computer solution.
And for robotics, the way that we think
of it is you need uh you need a computer
that allows you to train the brain,
>> right? So, this is something like a DGX.
You're training your video your your
vision language action model. You're
training up your your base model like a
VLM. Um, basically any model uh that
you're going to be using for cognition
is likely going to be trained on that
DJX. Yeah. But you still need to be able
to put that into a stack and test it
inside of a simulation or more
importantly for for humanoids and these
more autonomous skills. You want to be
able to train a skill in simulation. So
we need a computer to simulate the
world. And so you have a computer that's
that's there to train the brain, a
computer that simulates the real world.
And then we have a third computer which
is actually deployed in the real world.
And that would be IGX and AGX of Jetson.
And that gives you everything from the
brain to the body uh to the physical
apparatus inside of the world. And so
that's our three computer stack.
>> So the first one is really about
training the AI model, right? The second
one is really about making sure that
model gets as much practice as it can in
a digital world before it goes into a
physical world.
>> Yeah. And that simulated world could be
used not only for training but for
evaluation. So when you train a model uh
you know an LLM or these typical uh ML
models um you train it and then you
evaluate it. And with a simul with a
policy like a robotic policy, it's not
quite the same because you have to
interact with an environment and the
environment has to react back, right? I
poke something, it has to do something.
Uh, and so you need a simulated
environment. It's not necessarily just
uh am I classifying it as a cat
correctly. And so because of that, you
need to test it inside of something
that's a proxy to the real world. Now,
because we already have that proxy for
the real world called Omniverse, we can
also generate tons of synthetic data.
And so synthetic data compensates for
the lack of real data. And what we mean
by this for physical AI when we say that
we don't have lots of real data um LLM
started with the compendium of knowledge
that humans have written over the last
couple centuries we've spent most of our
life trying to make sure that we can
instill our knowledge for the next
generation. And so it was already there
for us to to start combing through and
turning into these these language models
that you know eventually turned into
chatbt. For physical AI we don't have
the same information for contact data.
We don't know how to we haven't captured
what is it like when you take a rigid
body like a finger um like a bone or or
or a metal hand and interact with
something very very soft. That
interaction that data doesn't exist
>> and that implies that sorry not to
interrupt you but that implies that um
the video data out there isn't enough
like
>> it's not yeah so what video data gives
us and the reason why reasoning is so
important this year is it's actually if
you think of it the video models were
trained to understand semantics. How
does the world work? How do things inter
how do they relate to each other? When I
think of if I ask you to build a
kitchen, you're not going to put, you
know, a 4x4 inside of the kitchen.
You're not going to put uh a a chair on
top of the table. You're not putting a
cutting board on the floor. You know,
semantically, there's places where these
objects are supposed to live in relation
to the environment that they're in. So,
what what the video model gives robotics
is the ability to have this cognitive
reasoning. It gives you semantic
reasoning. It gives you the ability to
understand and interpret the world. But
what it doesn't do is tell you how the
world is going to interact when you
start interacting with it.
>> That physical data is where the gap is.
That's that gap. And that's why we call
it physical AI is when you're trying to
start interacting with the world, those
reactions also matter. And so we need to
have lots and lots of data of how to
interact with the world. Otherwise, you
might grab an egg at the same strength
you would grab a baseball because
otherwise you have no you have no
interpretation. They're both balls,
right? They're both spherical objects.
But the materials themselves change. And
so we need that. That's why we need to
use SE simulation like Omniverse or or
even Cosmos as a world model.
>> How do you determine like when data is
when simulated data is good enough?
>> Sure. That's um that's kind of the
million-dollar question. So simulated
data synthetic data in general is is
more of an art than a science. And the
reason I say that is because going back
to LLM, when we have a corpus of data,
we could do things like data data
engineering, feature extraction. These
were because we had real data. We
understood okay once we have a data set
how can we start uh analyzing this data
set and saying well there's certain
characteristics and you know we call
them features that don't necessarily
have any impact perceptive impact on the
model. Um if that's the case then why
include it in the in the training data.
And so we were able to actually start
engineering the data and and creating a
good corpus of of training data that
results in a really well-trained model.
For physical AI we're lacking that. You
have lots of physical data that we can
collect but even real data we're not
sure what is good data or bad data. So
for instance, all of last year, you saw
lots of people doing tele operation to
uh open drawers and and you know handle
different objects inside of like kitchen
environments or various industrial
environments. And what they were trying
to figure out is one, how do I capture a
human demonstration that's clean enough
so that way I can train a policy? When I
say clean, I mean uh humans are
imperfect. And when we grab something,
if it's a demonstration, you want to do
it perfectly because you want the robot,
you don't want them to train off of bad
demonstrations. You want good
demonstrations for the most part. And so
what is a good demonstration also kind
of compounds once you get to synthetic
data because a good demonstration may
visually look good but it might actually
not improve the model itself because it
might be looking for different features
that aren't necessarily included in the
dimensions of that data. And so there's
a lot of um open questions on what types
of data dimensions do we need inside of
this data and what types of modalities
do we need? Video, visual, contact,
action. And so we capture all sorts of
data that's not just text or video
visual anymore. It's action data, the
motions itself, it's contact data. When
I grab something, what what exactly? And
so these are all things that that we're
learning right now.
>> That's exciting.
>> Um I'd love to work through a specific
example just to understand the endto-end
chain.
>> Sure.
>> So there's an awesome robot right down
the hall from us that's doing spinal
surgery. So, can we start with the AI
model training, walk through that piece,
then walk through how it would work in
simulation,
>> and then walk through what that would
mean for the physical robot in the final
account.
>> Sure. Um, spinal surgery is a a
challenging one. It's a good one to
choose. The the reason is because it's
rigid soft body.
>> Okay.
>> And so, what I'm going to describe is
not fully available today. It's it's
just going to describe the path that
we're on and the journey that we're
going to take.
>> Are those paths different depending on
the problem set? Sorry. I'm like, so
surgery versus industrial versus
>> warehousing, you know what I mean? This
is all very different.
>> It is. It is different, but don't
imagine it as the verticals. Imagine it
as the physical the physical problems.
Okay? So, for instance, I could be in an
industrial warehouse and I could be
picking up just boxes. And so, as long
as they're relatively they're rigid
boxes and they're nor you know, they're
they're known shape, known materials,
these are things that we could handle
today. If you want to do cable
management or wiring inside of a car,
that becomes much more difficult. But
that cable management and wiring is very
similar to threading needles and and
sutures for healthcare. Right? So if you
think of of it in the physical
characteristics of the problem, then the
verticals don't matter as much. And so
what we're trying to do is is figure out
between each vertical where are the
overlaps between the near-term problems.
So that way we can meet as many of the
customer needs as possible and because
we have to solve for it eventually. It's
just where do you start taking chunks?
And so, you know, to go end to end, we
could say, uh, let's let's start with
surgery. And I'll give you I'll give you
surgery, but I'm I'm going to pair it
right next to pick and place. Okay,
they're very very similar in a sense.
Um, because you're using some apparatus
to do some type of manipulation. So, the
first thing that you would do is I need
to understand what the task is. If it's
suturing or if it's picking up a box,
um, you're either going to be training a
policy and simulation where you're
fairly certain that you can train this
just, you know, off of behavior cloning
where I capture demonstrations. you
know, I watch I I put sensors on a
human. They teleop the robot a couple of
times and then we can generate more of
those actions and then train a policy
from that. So, we might do that for
either one of these. So, if you're
working with um with inner body, for
instance, it's all squishy things.
>> You know, they're it's not crunchy, it's
all squishy, and and it's not hard and
metallic. And I mean, maybe there's a
lot of metal in there depending on who
you are. Um, but the squishiness factor
makes it much more difficult because
when you're interacting with a, you
know, you take a probe and you put it
into a squishy thing and you kind of
move it around, it's elastic. And so how
it works and how it manip how it, you
know, interacts with the ob the the tool
itself really matters to the surgeon.
And so the simulation has to be
extremely high fidelity for that.
>> In the case of picking place, if I were
to choose just a box,
>> it's actually relatively easy cuz I'm
just taking two rigid bodies. I'm
grabbing a box and so it's relatively
easier in that sense.
>> Sure. But the process for training it is
basically the same. And so as long as
the technology catches up, so once we
have um proper physics simulation for
these really elastic soft bodies or
cable or you know different types of
rope, like everything's kind of a thread
in a lot of ways. Um if you were to
catch up, then the process staying the
same, the technology basically gives you
unblocks, you know, new uh new skills,
new areas and domains that you can start
applying. And this is why the fidelity
of the simulation matters so much I
assume because you know the the more
fine the action the more accurately you
need to capture just so much different
things about the data right from the
>> yeah sim the the goal is that if
simulation can be as close to reality as
possible then we could likely turn you
leveraging agents we could automate this
data generation process. So imagine for
a minute that you do a demonstration in
the real world and I can put it into a
pipeline that takes that one
demonstration and then turns it into
thousands of different data, you know,
different data outputs. It could be
augmented data, it could be multiplied
data, it could be, you know, all sorts
of things.
>> Doing this basically turns your data
flywheel from very limited to one eye
capture demonstration, that's what I
have to one to many. And we want to get
the one to many. So the higher fidelity
the simulation, the more complex
problems that we can start simulating,
which means that we can generate more
and more data.
>> Got it. That makes a lot of sense. So
sorry I kicked you off track a little
bit. So we're talking about pick and
place as well as spinal surgery. Uh walk
us through the next step.
>> Sure. So the first is data, capturing
data, uh generating data, augmenting
data. The second is training your model.
Now there's typically a few models that
might go in place for a robot. It's not,
you know, we're headed towards end to
end, but today it's it's kind of a
mixand match. So you'll have a
perception stack, something that is
classifying objects and poses. I want to
if I want to grab something, I want to
know what pose it is. So I know, you
know, how do I angle my my hand in order
to grab it? And then more more
importantly, where do you want me and
how do you want me to put that object?
So how I grab the object is also
influenced by how I need to place the
object. if I need to grab this and then
flip it upside down, maybe it's easier
to grab it upside down for the robot and
flip it versus doing this and and so
there's ways that um that you have to
think about what the robot is doing in
terms of its uh you know how it's it's
generating the trajectory. So you you
train a model for some of this. Maybe
you'll you'll train a skill inside a
simulation and then you can put them
together in a robot stack which would
have perception meaning I know how to
navigate around an environment. I can
see the environment. I know how to
perceive, you know, what's around me and
and identify obstacles, things like
that. I have a policy that once I get my
body, so imagine that a a skills policy
for manipulation, it's basically how do
I get my hands to where they need to be?
And so one part of the stack today is
how do you just move the body? The other
part of the stack is what do you do with
the hands? And so once you get your your
hands to the location, then it's okay, I
want to start doing a task. And so you
want to validate both of these things
inside of the robot stack before you
deploy it on robot. And so you can do
this in simulation. We call it software
in the loop testing.
>> Okay?
>> And so software in the loop testing is
where you simulate the robot and the
world. And then after that passes, we go
to this thing called hardware in the
loop testing, which is, you know, one
step before real deployment. Hardware in
the loop testing is where you simulate
the world, but you use the real hardware
component. So we use that third computer
and we feed the simulated data from the
second computer, the omniverse computer,
simulation computer into the onboard
edge computer. Oh,
>> and so the robot doesn't actually even
know that it's not out in the real
world.
>> It thinks it's doing spine surgery. It's
placing.
>> Yeah. And so we're going to feel and
then after that you can go into
deploying in the real world. And so this
is that end to end process data training
evaluation and then validation
deployment.
>> That's so interesting. So it's really a
matter of do you have like certain
specific buckets of skills that
determine what kind of tasks you can do?
like is it about building a skill
library to make a generalized robot or
is it like describe a little more about
how the capabilities for robots grow
over time?
>> Yeah, you you you um I think you you
framed it perfectly, the skills library.
So, we're going from specialist to
generalist. Think of a specialist as I
can do one very very specific thing very
very well.
>> Yeah. That I can do this millions of
times a day and I will not mess up. A
generalist needs to be robust to
changing environment circumstances, you
know, perturbations, things like that.
Yeah. And so if we capture all the data
from specialist, you can train a
generalist. And then the next step after
generalist is creating a generalist
specialist. So an equivalent would be a
child is a specialist in some ways,
right? They learn how to play with their
toys and they're really good at doing
some things, but they don't have enough
experience in the world to be able to
put all these skills together. Yeah.
>> I need to be posted for this. And so
then you go in and you get you get post
trained for that. And you get into the
real world. And once you're in the real
world, you still need to learn on the
job, right? And so a generalist is like
getting out of college. I'm I'm a a
fully functioning adult, but I'm not an
expert at anything. I'm just really good
at existing. I can I'm I can go into new
environments. I can learn new skills,
but I can learn. That's the important
part. I can learn a new skill. Right
now, we're at the point where we're
trying to train atomic skills. How to
grab things? Well, how are you
manipulating things? Very these are all
the same skills that a toddler or
three-year-old is trying to work on. And
then over time, you take these and you
start building them like Lego blocks
together. And so, the difference between
um you know, shaking a hand uh and you
know, maybe using a pool keel somewhat
similar in in action. And so there's all
these different actions that kind of
combine to create these composite skills
over time. And so we're we're doing
exactly what you're what you're talking
about. Think of skills library as one,
you could train a policy for it or two,
you want to just capture and generate as
much data for that skill.
>> Sure.
>> And then over time we can put them
together into these end policies which
is, you know, large multi-killed
policies.
>> Yeah, that that makes a ton of sense.
That's a that's a really exciting way to
think about it because a I think it
mirrors how like humans think about
their own learning process, their own
training process, and their own
validation process for lack of a better
word. Like how good am I at this skill,
right?
>> Um so validation is a question I'm
actually really interested in. Right? So
surgery versus pick and place. I need to
be much better with my hands to pass do
a successful surgery than to pick and
place a box. How do you know how do you
decide how good a robot is at a specific
skill? Sure. Not just that it can do it
but
>> yeah that's that's a good question. So
um one is is how do we evaluate? So for
instance we've released something uh
just recently called Isaac Lab Arena.
And so when you train a when you train a
skill in Isaac Lab which is our it's our
framework for robot learning. You want
to test that skill a b against a variety
of different environments right? So
imagine that um you know using
chopsticks.
>> Yeah that's a great
>> doesn't matter where you are. You're
going to use chop I could use chopsticks
to pick up pizza. I use it to you know
eat chips so I don't get my hands dirty.
Like you can use it's a skill. It's not
related to a specific food or a dish.
It's you can use it as a as a tool. And
so if you imagine I've taught something
to use a chopstick, but now I want to
have it do all sorts of objects, pick up
all sorts of different objects, you
know, like glass noodles and big pieces
of something and D. And so you want to
have all these environments laid out.
And so Isaac Lab Arena gives you the
ability to create all these different
environments and scenarios very easily
like Legos. And then you can create this
large library of scenarios to test
against. So that way as you're training
a policy you can see how it's performing
in not only one environment but in the
environments that you matter to you.
>> Yeah.
>> Um so that's that's one way. The second
is the hardware configuration really
matters.
>> Just because your policy can you know
you're likely able to train a policy
doesn't mean that the robot itself is
mechanically going to be able to do
whatever you need to do
>> and manipulate the chopsticks. what kind
of chopsticks
>> dexterity matters and that's why you
haven't seen many dextrous hands yet is
because a lot of the hands that we had
um they were lower than 22 degrees of
freedom um you know the human hand is is
pretty high up there you know even
between like you don't think about it
much but we use this palm area quite
often and most of the robots you see
they they grab and they they lose this
area in the palm they can't really use
it or rigid palm and and so
>> we're starting to see the mechatronics
become more advanced and start to mature
which means that the hardware can now
actually do what these policies should
be trained to do and so it's a mix of um
is your is your policy trained up enough
and you can say I don't have enough data
or I didn't train it correctly you know
things along those lines that's all
software bits and data bits on the other
side it's meatronically do I have the
right hardware in order to do it and
that's why companies like intuitive
surgical they have the hardware that can
do it and so now it becomes how do we
build policies around those hardware in
order to do certain things
>> and when you say policies you just mean
here are the not the rules but like the
general guardrails for how you should
manipulate or like
>> so a Um, a model like a perception model
uh would do things like classification
or pose estimation. So you feed in an
image and it goes okay this is what I'm
going to do. Um I'm just you you give it
an image and goes okay here's the pose
or you give an image here's
classification for a trained robot
policy. The reason why we call it a
policy is you know like what Pirates of
the Caribbean is like they're more like
guidelines.
>> The code is more what you'd call
guidelines than actual rules.
>> And so it's basically a set of
guidelines that say when you're in a
certain situation how would you react to
that situation? Right? So, I'm using
chopsticks and I've got this bowl full
of noodles. How am I going to approach
this versus if it's a bowl full of rice,
>> right? And so, that's the that's kind of
what it's it's trying to it's not a
black or white type of thing.
>> Yeah. And in surgery, you might really
care about that policy because how it
reacts to maybe something unexpected
during the surgery really matters,
right? So,
>> and functional safety, um the safety
boundaries are different between tasks
and and environments. So there are areas
where you have to be extremely safe. And
then there's areas where, you know, we
think of autonomous vehicles. If you
want to test an autonomous vehicle, you
want to have an autonomy autonomous
vehicle work. Um the safest is just to
build safety directly into it because it
has to be around human drivers. There's
no avoiding it. Um if I want a robot to
be safe today, I just move you to a
different room,
>> right? And so there's a there's a little
bit of u
>> the circumstance for that task. If it's
in a surgical room, then it needs to be
around humans. And so it has to be
extremely safe. It has to go through
plenty of certifications. If I'm just
trying to do a material movement, I'm
okay with it not being as safe. I just
make sure the environment itself is
safe. And so you don't have to build it
into the robot itself yet.
>> But eventually we will. It's just that
safety comes after fun, you know, after
the skills capability. Otherwise, what
what is it safe? The safest is just turn
it off.
>> Yeah. Right. And and I think that makes
that's actually a really interesting
like optimization problem too, right?
Because the safer you can make the
robot, the closer you can bring other
robots to it and still have it in
unison, you know,
>> and the more that you can put into a
single environment,
>> right? If you if you have a robot that's
only slightly safe in in some regards,
like it's safe in a very specific
setting, then you're limited by how many
you can put into any specific
environment, but if they are generally
safe, which is what our goal is
>> in end to end, you know, we we look at
humanoids because they're the hardest
problem. A humanoid has the problem of
locomotion, has the problem of dexterity
and manipulation, perception,
navigation, memory, balance, like uh and
then on top of that, because it has legs
and the upper body, it's this thing
called whole body control. Meaning that
you if you're grabbing a box, you bend
down and you pick it up, right? Like if
you're listening to the doctor, bend
down, pick it up, use your hips, right?
And that's actually whole body control.
Most robots aren't going to know that
out. They don't have this ability. They
only know either how to manipulate their
wheels and they can move around or then
they're doing it. That's why you don't
see very many robots that are walking
and drinking at the same time, right?
That's loco loco manipulation. And so
that's whole body control and we're
starting to get there. But humanoids
allow us to tackle these large problems.
That's why the ecosystem is is focusing
on this. If you can tackle the humanoid
problem, everything you build along the
way becomes plumbing and infrastructure
and tooling that you can then back
propagate into all of the industrial use
cases that are much more specialized or
narrowly scoped.
>> If you start with those, you actually
put yourself into into a corner. And
that's why we're we're tackling this
large problem. So you talked about so
tackling that problem means skills are
getting better over time. They're
linking together into bigger and bigger
skill sets so you can do more and more
right. One thing I'm really interested
in so when we talk about large language
models there are plenty of different
benchmarks for specific skills math
science literature are we going to see
equivalent robotic benchmarks?
>> Absolutely. The um the benchmarks that
you see today uh and this is exactly why
we built Isaac Lab Arena. So Isaac LLab
Arena is built on top of Isaac Lab and
it's basically uh an interface and a
framework for being able to design the
environment, the scenario, and the task,
right? Um I want to test a grasping task
inside of an industrial environment and
the scenario is that there's boxes that
are coming down a chute or something,
right? And so these three things are
like Lego blocks. If you could, you
know, manipulate these and create a
bunch of different scenarios from all
these existing Lego blocks, makes life a
lot easier. And so inside of the or
currently in the ecosystem there's
there's lots of benchmarks. Libro robo
bench behavior from you know Stanford.
These are all benchmarks that are
academic and used for testing the policy
themsel from an academic perspective.
One of the things that we're going to
start seeing more often is these
industrial benchmarks. I don't want to
just pick up a banana and place it on a
plate or you know any of those which are
absolutely necessary for the
state-of-the-art and and more frontier
testing. But once you get into
integration and I want to start using
this model to do things
>> I want to have my environment my
scenario my task and we're going to this
is where we're starting to to see the
environment or the ecosystem pick these
up start building their tasks so that
way they can test these these policies.
So you're going to see something very
very similar to math except maybe it'll
be micro assembly or maybe it'll be a
benchmark on um you know picking and
placing from a bin full of random
components and you have to do it in a
certain order or you have to do some
type of assembly task. So, you're going
to start seeing all sorts of these and
then we'll have categories of them and
it'll be very, very similar and it'll be
a whole library.
>> I'm really excited for that because
that's going to be very visually
engaging and I'm sure there will be
competitions around that whole
ecosystem.
>> And the cool part is whatever you do in
sim to some degree, you're going to need
to have it in real. And so, not only you
going to see uh these these scenarios
showing up in simulation to test, but
we're going to see the real world
equivalence of that because you need to
close the loop. So, you hear this often
in robotic close the loop.
>> Yeah. So you need to have a physical
space that you deploy these these things
onto these policies and these stacks
onto. You do the same task as you did in
simulation and you validate it in the
real world. And once you can validate in
the real world, we've now closed that
loop because now you've gone from uh
data and capture and all that all the
way to testing and validation. And now
this becomes that last bit of deployment
once we're like okay it works. What we
train does what we expected and now we
pull it. So you'll hear that quite often
and that's that's what makes physical AI
so hard is that we can't just leave the
validation in the data center in LLM
we're you know we have the privilege of
being able to test it in a data center
and leave it in a data center cuz its
whole life is going to be somewhat in a
data center and edge device
>> it's a SIM to SIM almost right
>> never really yeah exactly just like SIM
to SIM never really has to to actually
step into the real world and so that's I
think that's the the major challenge but
it's also the big fun of of this field
>> sure um does the loop ever go the other
way where it's like Hey, I have this
really difficult problem and I'm not
sure what kind of robot to even build to
solve it. So, I'm going to simulate that
problem first, see what kind of problem
like see the different solutions and
then build the robot that does the
solution the best or is
>> absolutely that you're asking all the
fun questions too. So the the what we're
seeing in aerospace for instance um they
can use simulation technologies and and
these AI agents that are able to modify
the design of thrusters and then
simulate you know the so similar to that
where they say we're trying to hit a
certain output right they want to have
some type of output for for their
engines and so it the agent is going to
keep going and optimizing and changing
until it gets to this until it gets to
this metric. something similar uh could
happen in robotics and this is something
that we're openly trying to uh trying to
research. Um first and foremost is the
embodiment the hand for instance the
hand the morphology of the hand. Do you
need three fingers? Do you need five
fingers? Is two fingers enough? Um what
exactly do we need to get the task done?
And it's so hard right now because um
like I said the the manipulator uh space
so like hands the ecosystem is is still
just is just beginning and we're
starting to see awesome hands coming to
market this year. And because of that, I
think what you're describing, being able
to look at the problem first and then
start evaluating, well, which robotic
components do I need in order to
accomplish this problem, it'll start
coming uh as as these uh you know, the
hardware matures because then you have
something that you can actually base it
against. You don't want to you actually
don't want to build a new robot all the
time, similar to like a car, right? You
you actually like having tier one
providers because it allows you to have
some consistency in your components.
Otherwise, if you have to manufacture
all of your own actuators and all of
your own internal components, it becomes
a huge drain on on on operational
resources.
>> That makes a lot of sense. So, even if
there is a hand that's better for a
specific task, you might want to default
to a generalized hand because then you
can just do a lot more with it besides
that one task.
>> And so, it it depends on the mix of
tasks. Uh so, an end toend robot, the
promise is that I could go to one work
cell here and then go to another work
cell here, dot totally different tasks
and use the same robot.
>> Yeah.
>> Right. That's the goal because that's
what a human can do. And so if we're
overly specialized, then we're still
stuck in specialists. So a generalist
allows you to go, you know, between the
field and that's that's where we want to
get to. And so understanding your mix of
tasks, the environment that helps us in
inform us on the hardware. Um yeah, I
think it'll go the direction that you're
describing, which would be super awesome
because there's going to be things like
assembly where okay, what would be the
best way to assemble this? Um and how do
you work backwards from there? I think
that that could definitely be a path
that we look forward to. Spencer, you've
clearly seen this whole industry evolve
very rapidly over the last couple years.
Is there something you're most excited
about or looking forward to? Like what's
the next thing that you're super pumped
about in
>> I am super excited for neural
simulation. You you'll hear this a lot.
Uh Cosmos is a world model and so it's a
neural simulator. It's been trained on
the dynamics of the world around it. um
these world models today uh are
improving and they're starting to show
extreme utility in as as part of our
policies. So when you look at Alpha Mayo
for autonomous vehicles that Jensen was
talking about, we can use these
reasoning models, these world models um
to actually be on board the car to
actually help us navigate through the
world. So the same thing could happen
for robots. There's in you know robotics
is um the oldest and newest industry in
the world in a lot of ways. The end
toend autonomy models are things where
the journey there is going to take us
quite a while and so we're going to see
quite a few evolutions of model
architectures between here and there.
Yeah.
>> And so we start with VLMs made a lot of
sense. Let's give robots the ability to
semantically understand their world and
reason about the world. The next step is
how do we make sure that the world
models are actually trained and
conditioned off of the types of inputs
that a robot would have. Meaning that as
a human I have perceptive input and
non-visual perceptive input. I have
contacts and I have action and all
these. So it's not trained on language.
We're not trained on language. We're
trained on all five senses. And so
language, visual, we need all the other
bits. And so as we start getting world
models that are able to um either be
conditioned off of these or output
these, we're going to start seeing an
influx of of totally new models and and
and capabilities. And so I'm super
excited for neural simulation for that.
Excited for neural simulation for data
generation and policy evaluation. It's
it's definitely a game changer for us.
I'm very excited.
>> What an exciting time to be alive.
>> Cosmos is going to be a it's an awesome
technology. I'm super excited. If you
guys haven't learned about it, you guys
go read about it. It's awesome.
>> I'm going to do just that. Thanks so
much for your time.
>> Wonderful meeting you, Alex. Thank you
for having me.
>> A huge thank you to Spencer Hang for
walking us through Nvidia's robotics
ecosystem, their three computer approach
to training, simulation, and inference,
and the biggest opportunities and
challenges in physical AI today. And if
you really want to understand robotics,
join me for NVIDIA GTC. You can register
for free with my links below and jump
into as many online sessions on robotics
and AI as you like. I'll announce the
winner of the RTX 5090 giveaway a few
days after the conference. So, make sure
to enter. Another huge thank you to
Nvidia for sponsoring my travel and my
media access to cover GTC Live and to
you for supporting the channel. Thanks
for watching and until next time, this
is Tickerol U. My name is Alex,
reminding you that the best investment
you can make is in you.
Ask follow-up questions or revisit key timestamps.
This video features an interview with Spencer Hang, product lead for robotic software at NVIDIA, discussing the future of physical AI. Hang explains NVIDIA's 'three-computer' stack approach: a computer for training models, one for simulating environments (Omniverse), and one for real-world deployment (Jetson/IGX/AGX). The conversation covers the challenges of robotics data, the importance of simulation for training and validation, and the transition from specialized robots to generalized ones that can learn new skills, mirroring human development.
Videos recently processed by our community