E21: NVIDIA'S HUGE AI Chip Breakthroughs Change Everything
766 segments
I'm really excited to show you some big
insights I just learned about Nvidia.
Most investors think Nvidia builds chips
for AI training, but I'm going to give
you an exclusive look at a very
different side of the story. I'm joined
by Dion Harris, Nvidia's senior director
of high performance computing, cloud,
and AI infrastructure go to market. Dion
has been with Nvidia for 9 years
deploying the hardware and software
infrastructure powering some of the
biggest AI models on the planet, not
just for training, but also for
inference. I asked Dion as many in-depth
questions as I could, and he had a few
surprising things to say about where the
AI market could be headed next. Your
time is valuable, so let's get right
into it. I'm going to jump right into
the hard questions if you don't mind.
You know, when I think about Nvidia, I
think Nvidia is really widely known for
its leadership in AI training. But for
investors who are newer to AI, could you
kind of explain how Nvidia's GPUs and
their infrastructure at large support
all the phases of AI?
>> Yeah. So that that's a very very sort of
key observation is like when AI really
first popped on the scene, there was
really a race to create, you know, the
sort of most capable foundational
models, right? Right? And so that's
where you had models like chat GPTt, you
had claude, you had a number of other
foundational models that were really
being built and trained at scale. And so
training is exactly like it sounds. It's
teaching the model foundational
knowledge about the world. That's where
you hear models being trained on the
entire content of the internet, for
example. And it's really just helping
the models learn and understand, you
know, semantic language, learn and
understand different meanings across
different modalities. And so that's
really foundational knowledge. And then
the next part of providing intelligence
is what we call post- training. And so
that's once you've taken a foundational
model that has very base foundational
knowledge. So just like the the name
describes, but then you inject
additional specialized knowledge. So it
might be post-training a foundational
model to understand your particular
industry. For example, healthcare has
very specific terminology, very specific
uses of certain words. So when you say
he sell in healthcare, it means
something different than when you say
sell in the legal profession for
example. So understanding the nuances of
you know a specific industry or specific
company that's really where you're
injecting more specialized intelligence
with within your post- training phase.
And then inference is really where you
put AI to use. In other words, you've
trained a model, you've fine-tuned a
model, and now you want to actually use
it with end users, with customers, with
partners, and really deploy and extract
the value out of AI. And so that's
really where we are right now in terms
of all the investment in AI
infrastructure. A lot of it in the early
days was to build training clusters. Now
that AI has reached sort of a critical
tipping point in terms of intelligence
and utility, we're seeing it deployed
for inference at scale. And I think the
real big insight you just said there is
it's not even just two steps, right?
It's not just training and inference.
The post-training step is incredibly
important as well. You know, I've been
diving deep into AI for a while now. And
one thing I've realized is that
inference is more than one step as well.
So it's not just training, post-
training, and inference. For example,
there's prefill and decode, each of
which have very different computing
requirements. Can you sort of explain
what these phases are, how Nvidia
addresses each [snorts] one
individually?
>> Yeah, sure. So, so like you described
inference itself can be decoupled into a
couple of different workloads. There's
prefill which is really where you're
taking the context of the query. You're
processing and understanding what the
user wants. So it might be just the
question or the prompt they enter. It
might be the document that they upload
so that you can draw upon that document
as source information to to to build a
response. or it may be previous
responses that the user has has
generated or created over the life of
engaging with that model. So the prefill
is really about understanding the
context for the question or for the
prompt being asked. Once it does that,
it can generate a single token and then
you move into the decode phase. The
decode phase is actually doing the auto
reggressive prediction of each token
that comes as a result of of the the AI
generation model. And so [snorts] the
reason why that's really interesting is
when you think about those specific
workloads, they have slightly different
infrastructure requirements if you will.
So when you think about the prefill or
the context heavy, it's really compute
heavy. It's really focused at
understanding all the different um
tokens that are being put into the
system to formulate contextual
awareness. However, when you look at
decode, it's much more memory latency
bound. And so it gives a lot more um
credence to having things like HBM or
high bandwidth memory to really generate
those tokens very quickly. Now the other
thing to really think about is there's
no one sizefits-all, right? Each model,
each user profile might have a different
balance or mix of preill and decode. And
so that what that's part of what makes
inference particularly challenging is
that you don't have a pre um you know
sort of prescriptive way of
understanding exactly what that mix
should be because it's going to vary
depending on users depending on the type
of requests that they're they're they're
asking. So for example, if they're doing
a deep research project, it's going to
be a lot more prefill heavy than decode
because you need to go through and sift
through all those PDFs that you might
upload to go and understand all the
contextual information. However, if
you're going to ask the AI to produce,
you know, a long, you know, in-depth
code base, it might be more decode heavy
because it really needs to understand,
you know, just all the interdependencies
and run through lots of reasoning chains
to create high quality code. So again,
it's going to vary depending on on the
use case and the actual user sort of the
user objective and intention.
>> Yeah. So let me let me make sure I
understand this just for myself too. So
prefill is really about understanding
all of the context in it once, right? So
all the PDFs I upload, if I have a very
detailed prompt, if I give a AI tons and
tons of links, you know, so the bigger
the context window and the more I fill
that up, the more compute intensive the
prefill step is as opposed to the decode
which is more about, you know,
understanding the tokens in sequence,
adding the next token, then reanalyzing
that whole sequence to add the next
token, which is very different from
understanding all the data at once. Is
is that like a fair highlevel overview?
>> Great great summary. So I appreciate
that. Yes, you always do a great job of
taking my long-winded examples and
making very very understandable and
distilled. So I appreciate that.
>> No, no, no. And sorry, I I want to just
make sure like we're really clear about
what we're talking about because, you
know, with those very different compute
requirements comes the need for very
different hardware, right? So I know
that Nvidia recently announced their
Reuben CPX GPU which is a GPU
specifically for the prefill step. So
can you tell us a little bit about the
CPX? When we think about CPX, it's
really purpose-built for what we call
the million context workloads. So that's
for things like advanced code
generation, right? Where you have lots
of data that you need to input. You
might have an entire application
codebase and if you want to understand
all the interdependencies how they
actually deliver you know sort of the
end toend optimization you would want to
understand the full million tokens when
you're generating the prompt or
generating the code. There's also things
like video generation right where you
need to have contextual consistency and
awareness. So, if you have a two-hour
video, if you have a million tokens, you
can make sure that as you move
throughout the scenes, you can, you
know, keep the scene integrity and make
sure the the characters have consistency
and different elements of the story play
through because you have this large
context window. So, again, these are
just some cutting edge use cases that
we're starting to see that will rely on
what we call the million context
workload.
>> Yeah. So, so what I'm really taking away
from this is prefill is very compute
inensive but not very memory intensive.
So the CPX, the goal of the CPX is to
provide all that compute but lower the
cost of the memory since you don't need
it at that step yet anyway. So then my
next question is will we see a separate
chip with the opposite specs for the
decode phase? Right? Something that's
very memory heavy but compute light. So
what I would say is when you look at
sort of how we've designed um our Rubin
platform you know most I would say 80%
of the workloads will do great on a
classic Vera Rubin platform but we think
for these large or what we call extreme
massive context workloads like I
mentioned like some of the code
generation video generation and others
those are ones where you want to
actually specialize where you can really
get some significant bang for the buck
by actually having a specific preill
built built processor
Got it. So, so Reuben is already great
at decode and it was the prefill step
that needed the a separate phase
optimized chip for a lack of a better
word.
>> Exactly. Exactly.
>> That's awesome. That really helps me
understand like the, you know, the whole
ecosystem. And I guess that's a really
good leadin to another question I had.
You know, when most investors think
about AI, they think about uh AI
performance really being driven by the
GPU, right? So when we're talking about
the Reuben versus the Reuben CPX, we're
talking about preill and decode being
driven by primarily these two GPUs,
right? But you know, Jensen just gave a
great keynote at GTCDC. And he talked
about how Nvidia co-designs at least
five other chips every generation,
right? A CPU, a DPU, NVLink. Can you
help explain how all these other chips
fit together? Yeah, know and that that's
that's a great question because when you
look at sort of the gener generational
leaps that we're describing in a lot of
our platforms um you can't get there
just by adding more transistors to a
single processor. um you know Moore's
law's law has has sort of tapered out
decades ago and so we recognize in order
to create these sort of massive leaps in
performance it requires what we call
extreme code design and that involves
not just looking at the GPU itself but
like you mentioned the CPU making sure
that you have tightly coupled um access
to memory and processing power across
the CPU and GPU it actually leverages
the the blue field or the data
processing unit as you're moving
information not just within the GPUs but
moving it to and from storage or getting
access to the node itself. So having
integrations with the software to make
sure it takes advantage of not just the
CPU and GPU but also the DPU. And then
of course when we think about what's
happening with the scale of AI, it's no
longer fitting into a single GPU or even
a single node. So the scale up
architecture is critically important
now. In other words, being able to have
several GPUs, CPUs, and processors
behave as one. And so that's where our
MVLink switch technology comes in. We
build a specific chip around the switch
switching technology that allows for
seamless scale up. And then, of course,
it's not just scale up, you have to be
able to scale out. And so, we've
developed our Spectrum X Ethernet
switching technology to scale out, you
know, to hundreds of thousands and
millions of GPUs. And then we also have
our Infiniban network that also allows
you to scale out. So again, it's really
just depending on how how customers
choose to scale, but giving both options
is really core to our platform. And I
think when you look at all of these
elements together, you know, that's
really the core of how we think about
building systems, not just on a single
chip, but looking at the full system to
compute to networking. And then on top
of that, you also have to think about
how do you integrate with the models?
How do you, you know, work with the open
source community to get them to build
models and software that takes advantage
of all the underlying hardware? So, I'll
give an example. If you, we we released
Blackwell with a a precision called
MVFP4, MVFP4 is useless unless you teach
the software to understand, you know,
how to use that lower precision smartly
or intelligently. And so, a lot of the
work that we do in terms of this extreme
code design doesn't just stop at the
hardware layer. it actually reaches into
the model developers and builders to
make sure that we're helping them
leverage all of the different hardware
innovations that we're making available
through our architecture. So, so like I
said, so there's codeesign happening at
the CPU, the GPU, the DPU, the
networking scale up as well as the
networking scale out in addition to the
models and applications themselves. And
that's why we we've described this as an
annual rhythm. And that is sort of, you
know, for two key reasons. models are
evolving so quickly. Therefore, we have
to evolve our platform just as quickly
to make sure that we're keeping pace and
unlocking sort of the next wave of use
cases. So for example, like I said, we
we talked about Blackwell last year and
we've already announced Blackwell Ultra
and of course we've rolled out Ver Ver
Rubin and we've already rolled out Ver
Ruben CPX. And so again, it's really
that that sort of yearly cadence that
gives us the ability to keep
leapfrogging ourselves and providing
more performance and more value. So that
that's really the the core core sort of
focus of our of our strategy and how we
want to, you know, deliver this to the
market. It's an insane pace and it's so
cool to see the evolution of the
hardware year after year after year. I'm
really excited uh for next March when I
hopefully get to touch the Reuben chips
for the first time, you know, and see
like the Reuben version of the stack
we've been going through ever since
Hopper. So, you know, you're describing
some massive technical wins here, but as
an investor, I think what we all really
want to understand is how AI relates to
real businesses, right? So, can we take
a step back and talk about why
businesses should care about these kinds
of inference speeds and compute
efficiencies we we've been talking about
this whole time? You know, inference is
how you extract the value from AI. In
other words, um building a model doesn't
create value unless you can use it and
apply it to solve business problems,
right? And so, inference is really where
the rubber hits the road in terms of
getting AI to solve a business problem.
And so once you take a step back and
say, okay, if you're leveraging AI to go
and solve a business problem and if
you're doing that at scale, this is what
we call an AI factory. And so, you know,
going back to, you know, the turn of the
century, the factories were about, you
know, putting raw materials in and
getting some finished product out.
Today, when we say factories, it's about
putting energy and electricity in and
systems and components in and getting
intelligence out. And so when we
describe some of these performance
improvements, think of it as, you know,
how much more intelligence can I produce
per dollar or per watt. And once you
think about in those terms, you really
quickly begin to see that efficiency is
really the biggest driver on how you're
going to get a return on your AI
investment. So to the extent that you
can improve your overall inference your
performance per watt for example if
you're a power limited data center which
most most uh data centers are today you
know you're trying to think okay how can
I get more intelligence out of that
power envelope and so a lot of these
sort of improvements that we describe
where we're describing the X factors
these [snorts] are really these
translate into actual dollars and cents
in terms of how many more tokens can be
generated and therefore how much value
can be extracted out of that AI factory.
And if you happen to be an AI factory
that's producing tokens and receiving
money for tokens, it is a direct
correlation in terms of how much
throughput can you generate for a given,
you know, power envelope or a given sort
of investment value. And that has a
direct correlation with how much revenue
and therefore profit you can generate
from that AI factory. That's really
interesting. So we should see that come
up in companies revenues and profit
margins as they start leveraging more
and more AI for more and more use cases
as the cost comes down. But profits and
margins are really something you see in
the rear view mirror, right? So one of
the questions I have is like I try to
look at forwardlooking indicators as an
investor. What benchmarks or metrics can
we focus on to better understand like
the real business value for inference in
real time? As you drive more
performance, more throughput per dollar,
per watt, that actually reduces the cost
per token. And when you reduce the cost
per token, you can actually embed that
AI into even more services, even more
use cases, and therefore deliver more
value to your end users. When you think
about AI, it's a lot more than LLM. So
it includes image classification. It in
includes video generation or diffusion
models. It includes um you know lots of
different types of of recommener systems
that are being used to serve ads and
content. And so when you think about you
know today where we are we're in a
fairly you know demand driven economy
means there's a huge demand for a lot of
these AI capabilities. But again you
have to be able to do it intelligently
and smartly. If you can drive the cost
down to zero, now you can you can
literally embed these AI APIs into every
application that you're running. And
therefore, that's when you really start
to see this ubiquitous use of AI. And so
that's really why when we think about
how we want to drive more performance
and more efficiency,
the cost per token going down by 10x
will actually increase the overall
utilization by 20x because now you have
a lot more use cases where you can
afford to embed these AI capabilities.
>> Yeah. Right. For for every, you know,
dollar the cost goes down, the demand
goes up by more than that same amount.
Right. because now exactly exactly maybe
use cases that couldn't afford it at all
can now jump in and so you're increasing
like the whole surface area of AI
overall. I think that's a really
important point to understand about
inference is there's a lot of levers,
there's a lot of complexity and so it
really is in some ways harder than
training. I think oftentimes people um
assume you know that NVDA has an
advantage because we've been doing
training for so long but we think our
true advantage lies in our ecosystem and
our software maturity that really was
going to really go and tackle inference
and make you know make our platform even
more valuable in inference than it is in
training in a lot of ways.
>> Yeah. And and I mean just sort of two
points there right? One, inference means
something very different than it did
three years ago when chat GP first came
out, right? One shot versus one shot
back then versus reasoning today. Now
inference is a much more compute
intensive workload. And two, something
that we hear Jensen say all the time is
inference and training are actually
going to one day be one process, not two
separate processes. What do you think
about that statement? I
>> I think it's a pretty fair statement and
I I would even correct it a little bit.
I would think that one day is is
actually today. And in fact, if you look
at how most reasoning models and and you
hit on this earlier, the way that you
create a reasoning model, you train it
with lots of inference. And it's so it's
it's giving you this iterative feedback
loop by giving it and and teaching it
how to reason and rationalizing by
leveraging inference outputs and then,
you know, feeding that back in that back
prop that happens during training. So it
really does leverage inference while
you're delivering the the training
capabilities as well. So those workloads
or processes are already starting to
merge to the point where they're they're
indistinguishable quite frankly.
>> You know, I find it so crazy how fast AI
is moving. Jensen was on a recent
podcast where he said that the demand
for inference will rise by more than a
billionx over the next few years. So,
you know, we talked a lot about Nvidia's
platforms today, but what is Nvidia
doing to keep up with the insane growth
in demand? Like, how can we expect
Nvidia to keep up with next year and the
year after that and the year after that?
>> Well, I I mean, if I had to put it into,
you know, two words, it would be extreme
code design. And one thing I just wanted
to highlight is, you know, when you
describe the benefits of this extreme
code-design approach, um the fact that
we demonstrated the inference max
results, we demonstrated a 10x perf per
per watt over our previous gen um hopper
platform. So 10x blackwell versus hopper
in one generation. There's no way you
can get 10x out of just, you know,
delivering more transistors in a in a
GPU. It really took the entire extreme
codeesign approach in terms of
leveraging scale up in VL72 which
allowed us to do lots of different
parallelization techniques. It also
created you know an opportunity for us
to leverage our Dynamo software by
leveraging disagregated serving and then
of course you know the the more
transistors and the better perfp4
while also maintaining the accuracy um
within the inference model. So all these
things together is what translated into
that 10x delivered performance. And so,
you know, just highlighting that is is
really sort of um unthinkable like you
never would have thought of getting 10x
in a generation over generation, you
know, sort of improvement. But this is
this is really what this um approach
brings. And like I said, we're not
focusing on that single GPU. We're
looking at the entire system and now
we're looking at the entire, you know,
infrastructure supply chain and pipeline
to drive even more efficiency. So, you
know, just a case in point, but um you
know, I thought that was a key point to
highlight. No, I think that makes a lot
of sense. You know, what I'm really
hearing you say is Nvidia doesn't just
rely on Moore's law, right? Like chips
aren't just getting whatever it is now
1.5 times better. Let's just call it two
times better every two years. It's
really about optimizing across the whole
stack. And when you take a step back and
do that, not even at the tray level or
the or even the rack level, but the data
center level, and you focus on
everything all at once, which is why you
guys co-design so many chips. That's how
you achieve 10x performance every year
as opposed to 2x performance every two.
>> Absolutely. Absolutely.
>> Makes a ton of sense and it really puts
I think the whole conversation in one
unified context, right? Like how do we
drive performance at the whole data
center scale
>> and then what makes it even more complex
is as if that isn't hard enough, we then
take it and say how can you disagregate
it completely? How can you run our GPUs
with another networking? How can you
take your GPUs and include it with our
new MB link fusion which allows you to
use our scaleup architecture. So we're
not only making this fully integrated
codeesign stack. We're also making it
you know sort of modular enough and
disagregated enough such that we can you
know plug in wherever the user is and
and so make sure that we can build a
solution that's right for them. Even
though we we deliver with speed of light
in terms of the full stack, but we want
to recognize that every user is
different and every user has different
business objectives. And so, you know,
as if it wasn't hard enough to build a
fully integrated stack, we're also
building it, you know, disagregated so
it can be consumed in so many different
ways. So, that that's just another layer
of complexity that we're we're kind of
imposing on ourselves because it's it
makes perfect sense from a data center
builder perspective. they're already
deploying MVL72 scaleup architecture for
the majority of their data center. Why
not look at standardizing as as a as a
way to scale up and scale out their
architecture. So it's a lot of
excitement but again I think it's just
another dimension by which we are you
know trying to create value is not just
giving you exactly what we build but
giving you the pieces that actually add
value for your your your deployments. In
fact, at GTC, we announced something
called DSX, which is basically um it's a
mixture of digital twin capabilities
along with gigascale AI factory
reference blueprints that helps the
entire ecosystem build to a common
design of figuring out how do we build
that gigawatt scale AI factory but as
efficiently as possible by leveraging a
lot of our core best practices as well
as building within the digital
environment the digital twins.
to really make sure we can build,
operate and design these systems much
more effectively. So all of that is
going to really power the next wave of
AI quite frankly
>> the the way everything fits together
both like you know the codeesign chips
then how that scales up to the data
center level and even across multiple
data centers then all the software and
control systems that sit on top of it.
Is there any one thing out of everything
we've talked about so far that really
excites you the most as you look to the
future?
>> Well, I think it's it's really this
notion that we are working AC as a
collective, right? We are literally
working across every partner ecosystem,
every developer ecosystem to bring sort
of these solutions together that will
hopefully power the next wave of AI. And
so Nvidia recognizing we we can't do it
by ourselves. Um and that's why we've
always had this sort of approach of
developer first. You know, from the very
early days of accelerated computing
becoming a new programming model, it
resided in identifying applications and
developers that could extract the value
out of that platform. So we take that
same approach today as we look at the
next wave of AI. How do we create the
conditions, you know, with new
processors like CPX? How do we create
the conditions with new software like
Dynamo or new you know sort of
architectures that can you know unlock a
whole new set of use cases for all of
the developers and so once we look at
that um you know through all the
customers we're talking talking to all
the feedback we're getting it's really
exciting to see how the light bulbs are
going off in their heads thinking what
would I be able to do if I could have
these different capabilities and so as
we sort of you know position our
products and platforms It's really
exciting to see how the light bulbs are
going off in the developers community
heads of how they're going to leverage
that to build things that we haven't
even thought of yet. So, so pretty
exciting stuff. Like you said, tiring
because it's a relentless phase, but you
know, again, this is definitely um
exciting times and and I think Nvidia is
is honored to be kind of at the
epicenter of of this whole whole
incredible transformation. And you know
what I've really learned and taken away
from this conversation is you know when
we think about AI and NVIDIA it's more
than just about training right there's
so much compute and there's so much
consideration that goes into the
inference side of things that Nvidia has
to build specialized chips for inference
special codees many chips to make
inference work at large scales. Uh I I
have a whole new appreciation for uh
everything that Nvidia does across the
entire stack when it comes to inference
specifically. So, I can't thank you
enough for your time, Dion. I'm super
excited about what's to come. Uh, and
thank you very much. A huge thank you to
Dion Harris for walking us through
Nvidia's hardware ecosystem and how it
powers every phase of AI from pre and
post training to prefill and decode for
inference and not just for large
language models, but for everything from
image and video generation to recommener
systems, robotics, and beyond. And of
course, thank you for watching and
supporting the channel. Until next time,
this is Tickerol U. My name is Alex,
reminding you that the best investment
you can make is in you.
Ask follow-up questions or revisit key timestamps.
This video features an in-depth conversation with Dion Harris from Nvidia regarding the company's critical role in AI infrastructure. The discussion highlights that while Nvidia is renowned for AI training, a significant focus is now on inference, which involves complex tasks like prefill and decode. The video explores how Nvidia uses extreme co-design across GPUs, CPUs, DPUs, and networking to maintain an annual release cadence and achieve 10x performance gains per generation. It emphasizes that efficiency is the core driver for business returns in what is increasingly being called 'AI factories,' where throughput and power consumption are key metrics.
Videos recently processed by our community