E22: NVIDIA'S HUGE AI Announcements Will Change Everything
702 segments
I'm excited to share this exclusive
interview with the investing community.
Most people think of Nvidia as a
hardware company that builds chips to
train massive AI models, but you're
about to get an inside look at a very
different side of the story. I'm joined
by Joe Dalaire, product lead of AI
infrastructure at Nvidia. Joe spent the
last 4 years deploying the hardware and
software behind some of the most
powerful AI models on the planet, and he
shared a few surprising insights about
where AI is headed next. But, that's
just one of the many technologies that
I'll be covering live at GTC in a few
weeks. GTC is Nvidia's massive AI
conference, showcasing the biggest
breakthroughs in robotics and
self-driving cars, AI agents and the
chips that power them, and a whole lot
more. And anyone who signs up for a free
online session at GTC with my link can
win an Nvidia RTX 5090 graphics card.
Just attend any session, take a
screenshot as proof, and send it to me
after the conference using the links
below. GTC should be on every investor's
radar, and so should Nvidia's ecosystem
for AI inference. Your time is valuable,
so let's get right into it.
I'm so happy to be here with you. Thanks
for taking the time, by the way.
>> Okay. Jensen talked about a lot of
awesome things at the keynote, and one
of the things that he talked about in
detail is that Nvidia actually
co-designed six different chips for the
Vera Rubin generation.
That's a lot to go through, so I'd love
to go through all of it with you,
starting from the GPU itself and working
all the way up to the rack-scale system
level, if that's okay. So, let's let's
just start with uh Rubin itself. What's
the difference between Blackwell and
Rubin?
Oh, so there's several different things
about Rubin that are are different than
Blackwell. So, we have the six chips
that you talked about. Uh all of it
co-designed together. So, what we did is
we looked at the data center
requirements,
and we worked our way backwards and
said, "What do we need in all these six
different chips to make sure that we get
the best performance, the best energy
efficiency, the lowest cost. Yeah. So,
that's what the fundamental thing about
Rubin is this extreme co-design. All
these chips
manufactured together, designed
together, working in concert for the
best performance.
>> And when you say you looked at the data
center requirements, are those being
driven by AI models today? Or what like
what's driving those requirements?
>> Absolutely. Models are definitely the
thing that are driving this compute
demand. And MOE models in particular,
mixture of experts, where they're
generating many, many tokens factors
more tokens because of the reasoning
that they do.
Also, the model sizes are growing as
well. So, they're getting more
intelligence from model size, from
reasoning.
So, that is just generating a tremendous
amount of
compute demand. And Rubin is designed to
address that. Got it. So, talk to me
about the difference between Blackwell
and Rubin, the GPUs specifically in
terms of power and performance.
So, in terms of power and performance,
for inference workloads, we will see up
to 10x better performance on Rubin
versus Blackwell.
Wow. 10x performance per watt. So, that
means that So, at given
fixed latency, you can see with those
parado charts that we've shown in
Jensen's keynotes, at a particular
latency, a very, you know, high latency,
that's a very uh
good for users of the model. So, yeah,
the 10x performance is across the rack
scale. Is it at the rack scale?
>> Rack scale architecture. So, here we
have the Blackwell Ultra generation
compute tray. And I can show you what
what we have here in terms of the
components and their breakdown.
So, we have two superchips. Two
superchips, okay.
>> Superchips have
two Blackwell Ultra GPUs,
and then one Grace CPU on on one
superchip, and then there's two of them
together, so four GPUs, two a
uh we also have ConnectX-8 super NICs
that are also part of this superchip. Uh
and that's going to be an important
distinction when we talk about Vera
Rubin later and how those have been
moved. Um but yeah, you can see that
this is a hybrid cooled. Okay.
>> So, these are cold plates doing the
liquid cooling on the superchips and all
their components. And then on the bottom
half of of the tray or the front half,
uh I should say, this is air cooled. So,
these are all What I'm actually looking
at is the tops of all fans, right?
>> Eight fans here.
>> Got it. So, eight fans and then we have
a uh BlueField DPU uh that is part of
this tray as well. That is for the
north-south traffic uh connecting the
storage, getting the data in to the the
the compute rack so that it feeds the
the GPUs. Got it. So, the yeah, the DPU
brings data in and out. Yes. And then
all the process all the magic happens in
the superchips themselves. Got it. So,
there's two kinds of network traffic.
North-south is inside the same rack.
East-west is connecting multiple racks.
Is that how we should think about it?
>> That's That's the proper way to think of
it, yes. Yes.
I thought NVIDIA was just a GPU
designer, but Grace is a CPU, right? So,
what is the CPU do?
So, the CPU it handles a lot of the
management. So, for example, like when
you're doing
uh you're trying to use inference and
you want your your model to make uh some
code for you. And you want it to make
maybe it makes a little application, a
Python application. It needs to run
that. Grace CPU can actually run that
application. The GPU wouldn't run an
application that's generated by by a
model.
Um but it is also doing uh other kinds
of things like database analytics and
those types of functions that are more
CPU-friendly.
Uh it's able to accelerate those types
of Oh, so really the whole idea is kind
of like you have the GPUs do what
they're the best at, then you have the
CPU to do things obviously that CPUs are
much better at GPUs at so that you can
sort of spread out the work over the
right chip for the job, right?
>> You also mentioned something called a
DPU. Can you walk us through what a DPU
>> So, DPU, BlueField DPU, data processing
unit. That's going to handle some of the
north-south traffic. North-south
traffic, yep. And when you're connected
to storage that's on a different rack,
there's going to be compression,
encryption,
that's all going to be managed by the
DPU that we have in BlueField BlueField
3. And the goal for that is just to make
sure the CPU and the GPU aren't doing
those things. That's correct. Offloading
Offloading all those functions from the
CPU and the GPU, accelerating those
functions in hardware,
so that you get the fastest data access
to feed the GPUs. That makes a lot of
sense. Okay, so those are three of the
six chips so far, right? The CPU, the
GPU, and the DPU. And the And the
ConnectX.
>> Yeah, talk to me a little more about
that.
>> ConnectX-8, this is your east-west
connectivity. So, this is your supernic
for connecting east-west. It also has
in-line encryption, those types of
functions for the east-west traffic
that's going to be connecting between
rack to rack of GPU racks. Got it. So,
we have the GPU, the CPU, the DPU, and
the ConnectX on this board. That's
correct. Where are the other two chips?
So, the NVLink switch is the the other
chip. And there there's a two here on
this switch tray.
This is NVLink 5 or the fifth generation
of NVLink.
And these are these are communicating to
the NVLink network at 1,800 GB per
second. 1,800 GB
>> 1.8 TB per second. So,
very high speed,
and that's really going to be the
the central nervous system of a
Blackwell GB200 NVL72. Got it. So, So,
these are two completely different
trays, right? So, this This compute
tray, that's where the magic happens in
terms of crunching the numbers. And then
this is the switch tray, which I think
you mentioned earlier is all about just
connecting all the GPUs together. So, it
connects all the GPUs together. Uh
there's several of these trays within a
rack. Yeah. Uh all the GPUs are 72 GPUs.
They have 72 GPUs in a rack. And it's
all-to-all connectivity. So, every GPU
has to be able to talk to every other
GPU at full bandwidth. And that's what
the switches achieve. So, 1.8 terabytes
per second, any GPU talking to any other
GPU. Is that why it's called a compute
fabric? Like when I think when I draw a
network diagram of That's Okay, got it.
So. So, yeah. They call it a compute
fabric not just because it's connecting
all the GPUs to each other. There's also
some compute functions in our NVLink
switch chips. So, we call that all
reduce or collective operations where in
training when certain operations need to
be shared across the network, instead of
sending it to all the GPUs, it will do
some of those operations within the
switch. Oh, wow. Okay, so the switch
isn't just connecting things, it's
actually also doing some Some
computation as well. That's awesome.
Okay, so I think we've covered five of
the chips now, right? Is that correct?
That's correct. Where What's the sixth
chip? Six is the uh Spectrum-X uh
What's Can we try to take a look at
those racks? Yeah, let's go take a look.
There's 10 trays up top. Those are the
compute trays. Nine networking trays.
Nine NVLink switch trays, I should say.
And their job is to connect all the GPUs
in the 10 above and the eight below
compute trays together, right? That's
correct. So, what's up there then?
So, that that is the top-of-rack uh 1
gigabit switch for telemetry. That's
telemetry? That's just telemetry. It's
just system management, managing
functions. It's low-speed Ethernet. It's
just a uh It's a just a management
system for the rack itself. It doesn't
It's not processing the compute data for
AI.
>> It's managing if a GPU goes down. It's
like Help me understand what telemetry
means and what that
>> Telemetry means like I'm just looking at
the the functions of the rack itself.
I'm looking at its uptime. I'm looking
at
>> Health and status, I guess.
>> Health and status checking, yes.
Diagnostics would also
>> And you mentioned that there's another
kind of rack that would sit next to
this.
So yeah, you will have your your group
of compute racks, GB200 compute racks,
and then you would also have racks
dedicated to Spectrum-X east-west
network switches.
We don't have that here, but
though that's how the the function would
be like a we call it a pod. You have
maybe eight GB200 racks, and then you'll
have a few
switch racks with Spectrum-X.
>> Yeah. So that's a great overview of the
Blackwell system, right?
>> That's right. Now, I want to understand
how what things changed from Blackwell
to Rubin. Okay. Can we
go over there and look at that?
>> at the at the trays.
So this is Looking at the components up
here on the wall,
we talked about in the compute tray, the
BlueField DPU, BlueField 4. So that
There you can see it on the wall. The
that that board is part of the module
system that slides in and out of the
compute tray for serviceability.
And then all likewise, the ConnectX-9 is
there in the middle.
And there's two ConnectX-9s that are on
that board
for a total of eight in every compute
tray. So every GPU is fed 1.6 terabits
per second for the ConnectX-9s.
And then we have the
the Spectrum-X photonics co-packaged
optics. This is really, really cool.
Yeah, what is that? So instead of having
SFP pluggable modules for for the
optics, they're actually built onto the
chip itself. Okay.
with it. So, this has a a huge gain in
energy efficiency, uh reliability, uh
and this factors more in terms of of
those two factors.
>> So, before we would have fiber optic
transceivers. The fiber optic optical
transceivers.
>> Yeah. So, the fiber optic cables on
either end, and those transceivers have
lasers in them. That's correct.
>> That need power, right? Like and that's
what you're getting rid of And we're
putting them packaging on the with the
chip.
>> What does that actually mean in terms of
like performance or power gains?
So, in terms of performance, the
performance would be the same. But, it's
going to be the uh the power reduction
and the uh reliability improvement. Cuz
uh those pluggable lasers can be very,
you know, sometimes very unreliable.
They have to be swapped out very
frequently. But, if it's co-packaged
here uh on on the chip, the reliability
goes up like uh I think 10x better
reliability.
>> So, it's a huge difference. And where in
the rack does that live?
So, that would be in its own switch tray
uh or a switch server. And that's a
separate rack.
>> That's the side rack, right?
>> That's the separate rack that's separate
from the the NVE 72. So, that's the
east-west traffic switch rack.
Awesome. So, Quantum MX, uh there's also
uh for InfiniBand, which is a
an alternative to Ethernet. There's also
a co-packaged optics for Quantum
InfiniBand as well.
>> So, those two chips are equivalent. One
is for Spectrum-X Ethernet, one is for
Quantum InfiniBand.
>> That's correct. And then you also have a
Spectrum-X Ethernet photonics switch.
So, that is the uh the co-packaged
optics chip is in there in the Ethernet
photonics switch. So, that's where the
photonics part is, the co-packaged
optics. Got it. But, these these go in
the side car. These go into uh switch
racks. Yeah. Got it.
As well as that one, right? If you're
doing Quantum InfiniBand as your
east-west traffic protocol, then you
would use the Infiniband as a side rack.
>> So, these are Sorry. These are
equivalents. One for Infiniband, one for
Ethernet, right? Correct. Got it. Yeah,
that's right. So, what we kind of just
talked about is what I would say is the
current state of the art for data
centers, right? Blackwell Ultra is the
one that's sort of the best in class in
data centers right now. And then Jensen
announced Vera Rubin, the six chips we
just talked about. We talked about the
Blackwell versions. This is a
substantially different compute tray
than the one we just saw. Can you walk
us through all the differences? Oh,
yeah, there's plenty. So,
uh what we've done is overall, it's a
modular design. Okay. So, that means
that there there's bays here and these
can just slide out and slide in and just
lock and latch. So, there's not a bunch
of wires and cabling to do all the
connectivity between all the components
that are on the tray.
Also, the hosing as well, that's been
streamlined.
>> Yeah. So, there's a manifold in the in
the middle,
um and it manages a lot of the uh
distribution of liquid. So, overall on
the GB300, there was 43 hoses. There was
a bay of fans here cuz it was a hybrid
cooled. Uh the the bottom half of GB300
was was fan cooled.
This is we have eliminated that uh and
because we're 100% liquid cooled now.
So, eight fans goes to zero fans, zero
hoses.
And then there's a bunch of cables that
have been removed as well. Uh so, it's
cable free. So,
this
I I'm trying to even piece together what
I'm looking at. So, these would be where
the two super chips were in the last
generation.
>> the super chips. They slide in and out.
They latch in.
Uh so, you have the two Rubins, uh one
Vera on them. So, uh one other important
point is because it's modular now and we
have all these bays that slide in and
out and it's all connectivity with
connectors instead of cabling,
putting this together and doing assembly
on it is like 20 times faster.
>> Sure. So, something that would take 2
hours to assemble the GB300 rack, now
you can do in 5 minutes on this
particular rack.
>> And that's And that's just assembly,
right? Like if I have a maintenance
issue and I need to
>> for maintenance, right? The
The amount of speed that you can do
serviceability increases that manyfold
as well.
No, it makes a ton of sense, right? If I
don't have all these wires and hoses, I
can just snap things out, fix fix
whatever the issue is, snap it back in.
And it's modular like so we'll talk
about some of the other pieces down
here. So, two super chips, Reuben, Vera.
We also have the CX 9's, ConnectX-9, the
next generation of that super nick are
over on these in boards in modules. So,
before they were connected to the bottom
of the super chip on GB300, but now
they're their own module and cards slide
in and out. So, you can service
different components now separately.
Yeah.
>> And then BlueField 4, the new generation
of the DPU, is also a module here that
slides in and out. Got it. So, this is
not just about performance, it's also
about more uptime, right? So, that's
another multiplier on the overall output
of an AI factory is how much uptime you
We call that good put. Like you want the
the the amount of time that you're
actually producing tokens, you want to
maximize that.
>> Yeah.
That makes sense. So, okay, this is the
equivalent compute tray. That's right.
And then there's also an equivalent
switch tray, right? That's correct. And
this looks a lot more streamlined, too.
So, walk me through the changes here.
So, in terms of the changes here,
you know, we have the the switches at
the top, 100% liquid cooled. There's
four switch chips. This is NVLink 6.
Okay.
>> Sixth generation NVLink, twice the speed
of what we had in the Blackwall.
>> Twice the speed. Wow. So, now it's 3.6
terabytes per second. And that's just
going to help us with our that
performance I talked about, 10x
performance per watt or per megawatt per
gigawatt, whatever value you want. Uh,
that's the increase in NVLink speed is
part of that contributes to that along
with some other GPU features that we can
talk about as well. And are there so is
it the same number of total GPUs in a
Blackwell rack versus a Rubin rack?
>> It is. So, it's a NBL72s. The 72
signifies the GPU count. So, GB200 NBL72
uh, and now we have Vera Rubin NBL72.
Same GPU count. Um, and it also makes it
so it's very compatible for our
customers to to move from one to the
other.
Uh, and that's part of the goal of
having the same GPU count, same kind of
MGX rack architecture.
Um, so that that's just makes it easier
for our customers. The ecosystem is, you
know, been working with these racks for
two generations now. Now we have a third
generation. They're just going to be
able to work very fast and deploy uh, at
a very high rate with our end customers.
>> No, it makes total sense. Okay, can we
go look at a Vera Rubin rack now?
>> Yes.
So, this is the Vera Rubin. Uh, this is
the Vera Rubin NBL72 rack. You can see
that there's, you know, it's very
similar in in form and in look to the
GB200. The the most uh, the biggest
difference is on the compute trays
you'll see there's no vents. So, there
was vents on the GB200 cuz the bottom
half of the compute tray still had fans.
Okay, yeah. And then we got rid of those
fans. It's all 100% liquid cooled on the
compute trays. That's why you see in the
face plate you don't see those vents
anymore. Got it. Uh, but overall still,
you know, still the nine uh, switch
trays,
still 10 compute trays on top and the
eight on the bottom. Same kind of still
telemetry on top. Still top of rack
telemetry with the one gig switch on
top.
>> Now here's the big question, right? From
Blackwell to Rubin
at the rack level, talk to me about the
performance gains at the rack level.
>> Performance gain at rack level is the
10x. 10x?
>> The 10x more tokens per second per
megawatt or per watt uh
then that's going to be a rack level
kind of uh performance metric. And
that's with a mixture of expert model,
something like Kimmy K2 thinking, uh
which is very large model over a
trillion parameters. Uh and that is
going to fit and be uh optimized in a
single rack uh with, you know, thanks to
NVLink switch, the experts in a mixture
of expert model are distributed across
the 72 GPUs. And uh that can uh factors
more performance in tokens per second.
So, here we have the Kyber rack. So,
this would be for the Rubin Ultra
generation. Subsequent to Rubin, which
is a 2026 product, in 2027 we'll have uh
Rubin Ultra.
So, that's going to be a different rack
architecture than we've had for the
previous three generations. Uh we're
putting much more compute. Yeah, I'm
noticing a lot more trays in this one.
So, we have
18 compute trays in each of these
canisters. So, there's four canisters,
up to 72 GPUs in each of the canisters.
So, you would have 288. 288 So, moving
from 144 to 288 or is that 72?
>> 72
>> to 288. Okay, so it's a 4x increase in
GPUs.
>> So, each of these canisters, the four I
talked about, is equivalent to the whole
rack over here.
>> So, there's four racks worth of GPUs
>> racks of NBL72s worth of uh compute in
here. So, very uh high compute density.
>> Yeah. Um and that's why the architecture
is different. It's a blade type of
architecture rather than a tray
architecture. Uh so, we have 18 uh
compute blades uh in each of the
canisters. Excuse me, sorry. These are
all compute then? This is all compute on
the front.
>> Yeah. On the back is where the switch
blades are. For the for the NVLink
connectivity. Got it. And so, that's
What is the performance leap that you
guys are expecting from Rubin to Rubin
Ultra in the Kyber rack?
So, we haven't given any of the
performance yet on Rubin Ultra, but it's
going to be
factors more performance as as usual
between our generations. Just because
you're going to have inc- performance
increases at the chip level, at the
superchip level, at the rack level, and
you're going to have four times as many
>> extreme co-design all all again, right?
Extreme co-design, all the chips being
designed for for greater performance,
working in concert,
being designed from scratch together.
Are we expecting extreme co-design of
all six chips for every generation from
now on? We should expect to see six new
chips? So, for every generation, there's
going to be a new generation of GPU for
for every year.
Now, whether all six are going to be
co-designed every year, that's that's
probably not going to be the case, but
you're going to see at the for the
flagship starting of each generation
like Rubin, six new chips,
some other new chips that go with Rubin
Ultra, but not the entirety of all six.
>> we might see the Vera CPU, but the Rubin
Ultra GPU.
>> Exactly.
>> Got it. Exactly. Got it. Yeah. I'm super
excited for this. I can't wait to see
what this looks like. When when can we
expect to learn a little more about
this? Is this something that we'll learn
about this year, next year?
So, yeah, it'll be something that Jensen
talks about, you know, in the in the
coming year. Uh,
I don't have a specific date, but yeah.
I'm super excited for it, man. What are
you looking forward to the most? Like,
what excites you the most as you see
like this rapid evolution year over year
and generation over generation? So, the
the amount of in- innovation at the with
the extreme co-design, that's what's
most impressive. Yeah. Right. So,
there's only so much, and Jensen talked
about this, that you can do moving from
one GPU generation to a next. Process
technology can only improve so much.
You know, it's not factors more
improvement in in the number of
transistors that you can go from one
generation to the next. So, for example,
between Vera Rubin and Blackwell, it's
about 70% more
transistors. Yeah. In terms of all the
different chips that we we co-design.
But, we're getting the 10x more
performance per watt. So, if you were
just Moore's law, it would only be a 70%
jump, not a 1,000% jump. From Yeah, not
a 10x, yeah. So, so this kind of
all these different chips being designed
together, working together to maximize
that performance, that's the most
amazing thing about this generation and
the future generations. Yeah. That's
really exciting.
Thanks so much for your time.
A huge thank you to Joe Balaware for
breaking down Nvidia's Blackwell
ecosystem, giving us an inside look at
Rubin, and explaining how it will all
make AI models faster, smarter, and more
efficient. Not just language models, but
everything from image and video models
to medicine, robotics, and so much more.
And if you want to really understand the
science behind this stock, join me at
Nvidia GTC. You can register for free
with my links below, and jump into as
many online sessions as you like. I'll
announce the winner of that RTX 5090
giveaway a few days after the
conference, so make sure to enter.
Another huge thank you to Nvidia for
sponsoring my travel and my media access
to cover GTC live, and to you for
supporting the channel. Thanks for
watching, and until next time, this is
ticker symbol you. My name is Alex,
reminding you that the best investment
you can make
is in you.
Ask follow-up questions or revisit key timestamps.
This video features an exclusive look at Nvidia's AI infrastructure, led by Joe Dalaire, who explores the transition from the Blackwell to the Vera Rubin chip generations. The discussion highlights Nvidia's philosophy of 'extreme co-design,' where six distinct chips—GPU, CPU, DPU, ConnectX, NVLink switch, and management components—are developed in unison to drastically enhance performance, energy efficiency, and data center scalability for modern AI models. The interview also covers technical improvements like 100% liquid cooling, modular design for faster serviceability, and future roadmap developments like the Rubin Ultra architecture.
Videos recently processed by our community