E21: NVIDIA'S HUGE AI Chip Breakthroughs Change Everything

Watch on YouTube

Now Playing

Transcript

766 segments

0:00

I'm really excited to show you some big

0:02

insights I just learned about Nvidia.

0:04

Most investors think Nvidia builds chips

0:06

for AI training, but I'm going to give

0:08

you an exclusive look at a very

0:10

different side of the story. I'm joined

0:12

by Dion Harris, Nvidia's senior director

0:14

of high performance computing, cloud,

0:17

and AI infrastructure go to market. Dion

0:19

has been with Nvidia for 9 years

0:22

deploying the hardware and software

0:23

infrastructure powering some of the

0:25

biggest AI models on the planet, not

0:28

just for training, but also for

0:29

inference. I asked Dion as many in-depth

0:32

questions as I could, and he had a few

0:34

surprising things to say about where the

0:36

AI market could be headed next. Your

0:39

time is valuable, so let's get right

0:41

into it. I'm going to jump right into

0:43

the hard questions if you don't mind.

0:44

You know, when I think about Nvidia, I

0:46

think Nvidia is really widely known for

0:49

its leadership in AI training. But for

0:51

investors who are newer to AI, could you

0:53

kind of explain how Nvidia's GPUs and

0:56

their infrastructure at large support

0:58

all the phases of AI?

1:00

>> Yeah. So that that's a very very sort of

1:03

key observation is like when AI really

1:05

first popped on the scene, there was

1:07

really a race to create, you know, the

1:09

sort of most capable foundational

1:11

models, right? Right? And so that's

1:12

where you had models like chat GPTt, you

1:13

had claude, you had a number of other

1:16

foundational models that were really

1:17

being built and trained at scale. And so

1:20

training is exactly like it sounds. It's

1:22

teaching the model foundational

1:24

knowledge about the world. That's where

1:26

you hear models being trained on the

1:28

entire content of the internet, for

1:29

example. And it's really just helping

1:31

the models learn and understand, you

1:34

know, semantic language, learn and

1:35

understand different meanings across

1:37

different modalities. And so that's

1:39

really foundational knowledge. And then

1:41

the next part of providing intelligence

1:44

is what we call post- training. And so

1:46

that's once you've taken a foundational

1:48

model that has very base foundational

1:50

knowledge. So just like the the name

1:52

describes, but then you inject

1:54

additional specialized knowledge. So it

1:56

might be post-training a foundational

1:58

model to understand your particular

2:00

industry. For example, healthcare has

2:02

very specific terminology, very specific

2:04

uses of certain words. So when you say

2:06

he sell in healthcare, it means

2:08

something different than when you say

2:10

sell in the legal profession for

2:12

example. So understanding the nuances of

2:16

you know a specific industry or specific

2:18

company that's really where you're

2:20

injecting more specialized intelligence

2:23

with within your post- training phase.

2:25

And then inference is really where you

2:28

put AI to use. In other words, you've

2:30

trained a model, you've fine-tuned a

2:32

model, and now you want to actually use

2:34

it with end users, with customers, with

2:36

partners, and really deploy and extract

2:38

the value out of AI. And so that's

2:40

really where we are right now in terms

2:42

of all the investment in AI

2:44

infrastructure. A lot of it in the early

2:46

days was to build training clusters. Now

2:48

that AI has reached sort of a critical

2:50

tipping point in terms of intelligence

2:52

and utility, we're seeing it deployed

2:55

for inference at scale. And I think the

2:58

real big insight you just said there is

3:00

it's not even just two steps, right?

3:02

It's not just training and inference.

3:04

The post-training step is incredibly

3:06

important as well. You know, I've been

3:07

diving deep into AI for a while now. And

3:09

one thing I've realized is that

3:11

inference is more than one step as well.

3:13

So it's not just training, post-

3:14

training, and inference. For example,

3:16

there's prefill and decode, each of

3:18

which have very different computing

3:20

requirements. Can you sort of explain

3:21

what these phases are, how Nvidia

3:23

addresses each [snorts] one

3:24

individually?

3:25

>> Yeah, sure. So, so like you described

3:27

inference itself can be decoupled into a

3:30

couple of different workloads. There's

3:32

prefill which is really where you're

3:34

taking the context of the query. You're

3:37

processing and understanding what the

3:39

user wants. So it might be just the

3:41

question or the prompt they enter. It

3:42

might be the document that they upload

3:44

so that you can draw upon that document

3:47

as source information to to to build a

3:50

response. or it may be previous

3:52

responses that the user has has

3:54

generated or created over the life of

3:56

engaging with that model. So the prefill

3:58

is really about understanding the

4:00

context for the question or for the

4:02

prompt being asked. Once it does that,

4:05

it can generate a single token and then

4:07

you move into the decode phase. The

4:09

decode phase is actually doing the auto

4:12

reggressive prediction of each token

4:14

that comes as a result of of the the AI

4:17

generation model. And so [snorts] the

4:19

reason why that's really interesting is

4:21

when you think about those specific

4:23

workloads, they have slightly different

4:26

infrastructure requirements if you will.

4:27

So when you think about the prefill or

4:29

the context heavy, it's really compute

4:31

heavy. It's really focused at

4:33

understanding all the different um

4:35

tokens that are being put into the

4:37

system to formulate contextual

4:38

awareness. However, when you look at

4:40

decode, it's much more memory latency

4:44

bound. And so it gives a lot more um

4:47

credence to having things like HBM or

4:49

high bandwidth memory to really generate

4:51

those tokens very quickly. Now the other

4:54

thing to really think about is there's

4:56

no one sizefits-all, right? Each model,

4:59

each user profile might have a different

5:01

balance or mix of preill and decode. And

5:04

so that what that's part of what makes

5:06

inference particularly challenging is

5:08

that you don't have a pre um you know

5:10

sort of prescriptive way of

5:12

understanding exactly what that mix

5:13

should be because it's going to vary

5:15

depending on users depending on the type

5:17

of requests that they're they're they're

5:19

asking. So for example, if they're doing

5:21

a deep research project, it's going to

5:23

be a lot more prefill heavy than decode

5:26

because you need to go through and sift

5:27

through all those PDFs that you might

5:29

upload to go and understand all the

5:31

contextual information. However, if

5:34

you're going to ask the AI to produce,

5:36

you know, a long, you know, in-depth

5:38

code base, it might be more decode heavy

5:40

because it really needs to understand,

5:42

you know, just all the interdependencies

5:44

and run through lots of reasoning chains

5:46

to create high quality code. So again,

5:49

it's going to vary depending on on the

5:51

use case and the actual user sort of the

5:53

user objective and intention.

5:55

>> Yeah. So let me let me make sure I

5:57

understand this just for myself too. So

5:59

prefill is really about understanding

6:02

all of the context in it once, right? So

6:06

all the PDFs I upload, if I have a very

6:08

detailed prompt, if I give a AI tons and

6:11

tons of links, you know, so the bigger

6:13

the context window and the more I fill

6:14

that up, the more compute intensive the

6:16

prefill step is as opposed to the decode

6:20

which is more about, you know,

6:22

understanding the tokens in sequence,

6:24

adding the next token, then reanalyzing

6:27

that whole sequence to add the next

6:28

token, which is very different from

6:30

understanding all the data at once. Is

6:32

is that like a fair highlevel overview?

6:35

>> Great great summary. So I appreciate

6:37

that. Yes, you always do a great job of

6:38

taking my long-winded examples and

6:41

making very very understandable and

6:44

distilled. So I appreciate that.

6:46

>> No, no, no. And sorry, I I want to just

6:48

make sure like we're really clear about

6:49

what we're talking about because, you

6:51

know, with those very different compute

6:53

requirements comes the need for very

6:55

different hardware, right? So I know

6:57

that Nvidia recently announced their

6:58

Reuben CPX GPU which is a GPU

7:02

specifically for the prefill step. So

7:05

can you tell us a little bit about the

7:06

CPX? When we think about CPX, it's

7:10

really purpose-built for what we call

7:12

the million context workloads. So that's

7:15

for things like advanced code

7:17

generation, right? Where you have lots

7:20

of data that you need to input. You

7:22

might have an entire application

7:24

codebase and if you want to understand

7:25

all the interdependencies how they

7:28

actually deliver you know sort of the

7:30

end toend optimization you would want to

7:33

understand the full million tokens when

7:36

you're generating the prompt or

7:37

generating the code. There's also things

7:38

like video generation right where you

7:41

need to have contextual consistency and

7:43

awareness. So, if you have a two-hour

7:45

video, if you have a million tokens, you

7:47

can make sure that as you move

7:49

throughout the scenes, you can, you

7:50

know, keep the scene integrity and make

7:52

sure the the characters have consistency

7:54

and different elements of the story play

7:56

through because you have this large

7:58

context window. So, again, these are

8:00

just some cutting edge use cases that

8:03

we're starting to see that will rely on

8:06

what we call the million context

8:08

workload.

8:09

>> Yeah. So, so what I'm really taking away

8:12

from this is prefill is very compute

8:15

inensive but not very memory intensive.

8:17

So the CPX, the goal of the CPX is to

8:20

provide all that compute but lower the

8:22

cost of the memory since you don't need

8:24

it at that step yet anyway. So then my

8:26

next question is will we see a separate

8:29

chip with the opposite specs for the

8:31

decode phase? Right? Something that's

8:33

very memory heavy but compute light. So

8:37

what I would say is when you look at

8:39

sort of how we've designed um our Rubin

8:42

platform you know most I would say 80%

8:44

of the workloads will do great on a

8:47

classic Vera Rubin platform but we think

8:49

for these large or what we call extreme

8:51

massive context workloads like I

8:53

mentioned like some of the code

8:54

generation video generation and others

8:57

those are ones where you want to

8:58

actually specialize where you can really

9:00

get some significant bang for the buck

9:02

by actually having a specific preill

9:05

built built processor

9:07

Got it. So, so Reuben is already great

9:09

at decode and it was the prefill step

9:12

that needed the a separate phase

9:15

optimized chip for a lack of a better

9:16

word.

9:17

>> Exactly. Exactly.

9:18

>> That's awesome. That really helps me

9:19

understand like the, you know, the whole

9:21

ecosystem. And I guess that's a really

9:23

good leadin to another question I had.

9:26

You know, when most investors think

9:28

about AI, they think about uh AI

9:30

performance really being driven by the

9:32

GPU, right? So when we're talking about

9:33

the Reuben versus the Reuben CPX, we're

9:36

talking about preill and decode being

9:39

driven by primarily these two GPUs,

9:41

right? But you know, Jensen just gave a

9:44

great keynote at GTCDC. And he talked

9:46

about how Nvidia co-designs at least

9:49

five other chips every generation,

9:51

right? A CPU, a DPU, NVLink. Can you

9:54

help explain how all these other chips

9:56

fit together? Yeah, know and that that's

9:58

that's a great question because when you

10:00

look at sort of the gener generational

10:02

leaps that we're describing in a lot of

10:04

our platforms um you can't get there

10:07

just by adding more transistors to a

10:08

single processor. um you know Moore's

10:11

law's law has has sort of tapered out

10:13

decades ago and so we recognize in order

10:16

to create these sort of massive leaps in

10:19

performance it requires what we call

10:21

extreme code design and that involves

10:23

not just looking at the GPU itself but

10:26

like you mentioned the CPU making sure

10:28

that you have tightly coupled um access

10:31

to memory and processing power across

10:32

the CPU and GPU it actually leverages

10:35

the the blue field or the data

10:36

processing unit as you're moving

10:38

information not just within the GPUs but

10:41

moving it to and from storage or getting

10:44

access to the node itself. So having

10:46

integrations with the software to make

10:48

sure it takes advantage of not just the

10:49

CPU and GPU but also the DPU. And then

10:52

of course when we think about what's

10:54

happening with the scale of AI, it's no

10:57

longer fitting into a single GPU or even

10:59

a single node. So the scale up

11:02

architecture is critically important

11:04

now. In other words, being able to have

11:06

several GPUs, CPUs, and processors

11:09

behave as one. And so that's where our

11:12

MVLink switch technology comes in. We

11:14

build a specific chip around the switch

11:17

switching technology that allows for

11:18

seamless scale up. And then, of course,

11:20

it's not just scale up, you have to be

11:22

able to scale out. And so, we've

11:24

developed our Spectrum X Ethernet

11:26

switching technology to scale out, you

11:28

know, to hundreds of thousands and

11:30

millions of GPUs. And then we also have

11:32

our Infiniban network that also allows

11:34

you to scale out. So again, it's really

11:35

just depending on how how customers

11:37

choose to scale, but giving both options

11:39

is really core to our platform. And I

11:42

think when you look at all of these

11:43

elements together, you know, that's

11:45

really the core of how we think about

11:46

building systems, not just on a single

11:48

chip, but looking at the full system to

11:51

compute to networking. And then on top

11:53

of that, you also have to think about

11:55

how do you integrate with the models?

11:57

How do you, you know, work with the open

11:59

source community to get them to build

12:01

models and software that takes advantage

12:04

of all the underlying hardware? So, I'll

12:06

give an example. If you, we we released

12:08

Blackwell with a a precision called

12:11

MVFP4, MVFP4 is useless unless you teach

12:15

the software to understand, you know,

12:17

how to use that lower precision smartly

12:19

or intelligently. And so, a lot of the

12:21

work that we do in terms of this extreme

12:23

code design doesn't just stop at the

12:25

hardware layer. it actually reaches into

12:27

the model developers and builders to

12:29

make sure that we're helping them

12:31

leverage all of the different hardware

12:33

innovations that we're making available

12:34

through our architecture. So, so like I

12:36

said, so there's codeesign happening at

12:38

the CPU, the GPU, the DPU, the

12:41

networking scale up as well as the

12:43

networking scale out in addition to the

12:45

models and applications themselves. And

12:47

that's why we we've described this as an

12:49

annual rhythm. And that is sort of, you

12:51

know, for two key reasons. models are

12:54

evolving so quickly. Therefore, we have

12:57

to evolve our platform just as quickly

12:59

to make sure that we're keeping pace and

13:01

unlocking sort of the next wave of use

13:03

cases. So for example, like I said, we

13:05

we talked about Blackwell last year and

13:08

we've already announced Blackwell Ultra

13:10

and of course we've rolled out Ver Ver

13:12

Rubin and we've already rolled out Ver

13:14

Ruben CPX. And so again, it's really

13:17

that that sort of yearly cadence that

13:19

gives us the ability to keep

13:21

leapfrogging ourselves and providing

13:23

more performance and more value. So that

13:25

that's really the the core core sort of

13:28

focus of our of our strategy and how we

13:29

want to, you know, deliver this to the

13:31

market. It's an insane pace and it's so

13:33

cool to see the evolution of the

13:34

hardware year after year after year. I'm

13:36

really excited uh for next March when I

13:39

hopefully get to touch the Reuben chips

13:42

for the first time, you know, and see

13:43

like the Reuben version of the stack

13:44

we've been going through ever since

13:46

Hopper. So, you know, you're describing

13:48

some massive technical wins here, but as

13:51

an investor, I think what we all really

13:53

want to understand is how AI relates to

13:56

real businesses, right? So, can we take

13:58

a step back and talk about why

14:00

businesses should care about these kinds

14:02

of inference speeds and compute

14:04

efficiencies we we've been talking about

14:05

this whole time? You know, inference is

14:07

how you extract the value from AI. In

14:10

other words, um building a model doesn't

14:14

create value unless you can use it and

14:16

apply it to solve business problems,

14:18

right? And so, inference is really where

14:21

the rubber hits the road in terms of

14:22

getting AI to solve a business problem.

14:25

And so once you take a step back and

14:26

say, okay, if you're leveraging AI to go

14:29

and solve a business problem and if

14:31

you're doing that at scale, this is what

14:33

we call an AI factory. And so, you know,

14:37

going back to, you know, the turn of the

14:39

century, the factories were about, you

14:41

know, putting raw materials in and

14:43

getting some finished product out.

14:46

Today, when we say factories, it's about

14:48

putting energy and electricity in and

14:50

systems and components in and getting

14:52

intelligence out. And so when we

14:54

describe some of these performance

14:55

improvements, think of it as, you know,

14:58

how much more intelligence can I produce

15:00

per dollar or per watt. And once you

15:04

think about in those terms, you really

15:06

quickly begin to see that efficiency is

15:08

really the biggest driver on how you're

15:11

going to get a return on your AI

15:13

investment. So to the extent that you

15:15

can improve your overall inference your

15:17

performance per watt for example if

15:19

you're a power limited data center which

15:21

most most uh data centers are today you

15:23

know you're trying to think okay how can

15:25

I get more intelligence out of that

15:28

power envelope and so a lot of these

15:30

sort of improvements that we describe

15:31

where we're describing the X factors

15:33

these [snorts] are really these

15:35

translate into actual dollars and cents

15:37

in terms of how many more tokens can be

15:38

generated and therefore how much value

15:40

can be extracted out of that AI factory.

15:43

And if you happen to be an AI factory

15:46

that's producing tokens and receiving

15:48

money for tokens, it is a direct

15:50

correlation in terms of how much

15:52

throughput can you generate for a given,

15:54

you know, power envelope or a given sort

15:56

of investment value. And that has a

15:59

direct correlation with how much revenue

16:01

and therefore profit you can generate

16:03

from that AI factory. That's really

16:05

interesting. So we should see that come

16:07

up in companies revenues and profit

16:10

margins as they start leveraging more

16:11

and more AI for more and more use cases

16:14

as the cost comes down. But profits and

16:17

margins are really something you see in

16:19

the rear view mirror, right? So one of

16:21

the questions I have is like I try to

16:23

look at forwardlooking indicators as an

16:25

investor. What benchmarks or metrics can

16:27

we focus on to better understand like

16:29

the real business value for inference in

16:31

real time? As you drive more

16:33

performance, more throughput per dollar,

16:36

per watt, that actually reduces the cost

16:39

per token. And when you reduce the cost

16:42

per token, you can actually embed that

16:46

AI into even more services, even more

16:49

use cases, and therefore deliver more

16:52

value to your end users. When you think

16:54

about AI, it's a lot more than LLM. So

16:57

it includes image classification. It in

16:59

includes video generation or diffusion

17:01

models. It includes um you know lots of

17:04

different types of of recommener systems

17:06

that are being used to serve ads and

17:09

content. And so when you think about you

17:11

know today where we are we're in a

17:13

fairly you know demand driven economy

17:16

means there's a huge demand for a lot of

17:18

these AI capabilities. But again you

17:21

have to be able to do it intelligently

17:22

and smartly. If you can drive the cost

17:25

down to zero, now you can you can

17:27

literally embed these AI APIs into every

17:30

application that you're running. And

17:32

therefore, that's when you really start

17:33

to see this ubiquitous use of AI. And so

17:36

that's really why when we think about

17:38

how we want to drive more performance

17:40

and more efficiency,

17:42

the cost per token going down by 10x

17:45

will actually increase the overall

17:47

utilization by 20x because now you have

17:50

a lot more use cases where you can

17:52

afford to embed these AI capabilities.

17:55

>> Yeah. Right. For for every, you know,

17:57

dollar the cost goes down, the demand

18:00

goes up by more than that same amount.

18:02

Right. because now exactly exactly maybe

18:04

use cases that couldn't afford it at all

18:05

can now jump in and so you're increasing

18:08

like the whole surface area of AI

18:09

overall. I think that's a really

18:11

important point to understand about

18:13

inference is there's a lot of levers,

18:15

there's a lot of complexity and so it

18:18

really is in some ways harder than

18:21

training. I think oftentimes people um

18:23

assume you know that NVDA has an

18:25

advantage because we've been doing

18:26

training for so long but we think our

18:29

true advantage lies in our ecosystem and

18:31

our software maturity that really was

18:34

going to really go and tackle inference

18:35

and make you know make our platform even

18:37

more valuable in inference than it is in

18:40

training in a lot of ways.

18:41

>> Yeah. And and I mean just sort of two

18:44

points there right? One, inference means

18:46

something very different than it did

18:47

three years ago when chat GP first came

18:50

out, right? One shot versus one shot

18:52

back then versus reasoning today. Now

18:54

inference is a much more compute

18:55

intensive workload. And two, something

18:58

that we hear Jensen say all the time is

19:00

inference and training are actually

19:02

going to one day be one process, not two

19:05

separate processes. What do you think

19:07

about that statement? I

19:08

>> I think it's a pretty fair statement and

19:10

I I would even correct it a little bit.

19:12

I would think that one day is is

19:14

actually today. And in fact, if you look

19:16

at how most reasoning models and and you

19:18

hit on this earlier, the way that you

19:19

create a reasoning model, you train it

19:22

with lots of inference. And it's so it's

19:24

it's giving you this iterative feedback

19:26

loop by giving it and and teaching it

19:29

how to reason and rationalizing by

19:31

leveraging inference outputs and then,

19:34

you know, feeding that back in that back

19:35

prop that happens during training. So it

19:37

really does leverage inference while

19:39

you're delivering the the training

19:41

capabilities as well. So those workloads

19:44

or processes are already starting to

19:46

merge to the point where they're they're

19:48

indistinguishable quite frankly.

19:50

>> You know, I find it so crazy how fast AI

19:54

is moving. Jensen was on a recent

19:56

podcast where he said that the demand

19:58

for inference will rise by more than a

19:59

billionx over the next few years. So,

20:02

you know, we talked a lot about Nvidia's

20:04

platforms today, but what is Nvidia

20:06

doing to keep up with the insane growth

20:08

in demand? Like, how can we expect

20:10

Nvidia to keep up with next year and the

20:13

year after that and the year after that?

20:15

>> Well, I I mean, if I had to put it into,

20:17

you know, two words, it would be extreme

20:19

code design. And one thing I just wanted

20:21

to highlight is, you know, when you

20:22

describe the benefits of this extreme

20:24

code-design approach, um the fact that

20:26

we demonstrated the inference max

20:28

results, we demonstrated a 10x perf per

20:31

per watt over our previous gen um hopper

20:33

platform. So 10x blackwell versus hopper

20:37

in one generation. There's no way you

20:40

can get 10x out of just, you know,

20:42

delivering more transistors in a in a

20:44

GPU. It really took the entire extreme

20:47

codeesign approach in terms of

20:49

leveraging scale up in VL72 which

20:51

allowed us to do lots of different

20:52

parallelization techniques. It also

20:55

created you know an opportunity for us

20:56

to leverage our Dynamo software by

20:58

leveraging disagregated serving and then

21:00

of course you know the the more

21:02

transistors and the better perfp4

21:06

while also maintaining the accuracy um

21:09

within the inference model. So all these

21:11

things together is what translated into

21:13

that 10x delivered performance. And so,

21:16

you know, just highlighting that is is

21:18

really sort of um unthinkable like you

21:21

never would have thought of getting 10x

21:23

in a generation over generation, you

21:25

know, sort of improvement. But this is

21:27

this is really what this um approach

21:29

brings. And like I said, we're not

21:31

focusing on that single GPU. We're

21:33

looking at the entire system and now

21:35

we're looking at the entire, you know,

21:36

infrastructure supply chain and pipeline

21:38

to drive even more efficiency. So, you

21:41

know, just a case in point, but um you

21:42

know, I thought that was a key point to

21:44

highlight. No, I think that makes a lot

21:46

of sense. You know, what I'm really

21:47

hearing you say is Nvidia doesn't just

21:48

rely on Moore's law, right? Like chips

21:50

aren't just getting whatever it is now

21:52

1.5 times better. Let's just call it two

21:55

times better every two years. It's

21:56

really about optimizing across the whole

21:58

stack. And when you take a step back and

22:00

do that, not even at the tray level or

22:03

the or even the rack level, but the data

22:05

center level, and you focus on

22:07

everything all at once, which is why you

22:09

guys co-design so many chips. That's how

22:12

you achieve 10x performance every year

22:15

as opposed to 2x performance every two.

22:18

>> Absolutely. Absolutely.

22:20

>> Makes a ton of sense and it really puts

22:21

I think the whole conversation in one

22:23

unified context, right? Like how do we

22:25

drive performance at the whole data

22:27

center scale

22:28

>> and then what makes it even more complex

22:30

is as if that isn't hard enough, we then

22:32

take it and say how can you disagregate

22:34

it completely? How can you run our GPUs

22:37

with another networking? How can you

22:40

take your GPUs and include it with our

22:42

new MB link fusion which allows you to

22:44

use our scaleup architecture. So we're

22:46

not only making this fully integrated

22:48

codeesign stack. We're also making it

22:51

you know sort of modular enough and

22:54

disagregated enough such that we can you

22:57

know plug in wherever the user is and

22:59

and so make sure that we can build a

23:01

solution that's right for them. Even

23:02

though we we deliver with speed of light

23:04

in terms of the full stack, but we want

23:06

to recognize that every user is

23:08

different and every user has different

23:10

business objectives. And so, you know,

23:12

as if it wasn't hard enough to build a

23:13

fully integrated stack, we're also

23:16

building it, you know, disagregated so

23:18

it can be consumed in so many different

23:19

ways. So, that that's just another layer

23:21

of complexity that we're we're kind of

23:23

imposing on ourselves because it's it

23:25

makes perfect sense from a data center

23:27

builder perspective. they're already

23:29

deploying MVL72 scaleup architecture for

23:32

the majority of their data center. Why

23:34

not look at standardizing as as a as a

23:36

way to scale up and scale out their

23:38

architecture. So it's a lot of

23:40

excitement but again I think it's just

23:41

another dimension by which we are you

23:44

know trying to create value is not just

23:46

giving you exactly what we build but

23:48

giving you the pieces that actually add

23:49

value for your your your deployments. In

23:52

fact, at GTC, we announced something

23:54

called DSX, which is basically um it's a

23:58

mixture of digital twin capabilities

24:00

along with gigascale AI factory

24:03

reference blueprints that helps the

24:05

entire ecosystem build to a common

24:08

design of figuring out how do we build

24:10

that gigawatt scale AI factory but as

24:13

efficiently as possible by leveraging a

24:15

lot of our core best practices as well

24:18

as building within the digital

24:20

environment the digital twins.

24:21

to really make sure we can build,

24:23

operate and design these systems much

24:25

more effectively. So all of that is

24:27

going to really power the next wave of

24:29

AI quite frankly

24:31

>> the the way everything fits together

24:33

both like you know the codeesign chips

24:35

then how that scales up to the data

24:38

center level and even across multiple

24:40

data centers then all the software and

24:42

control systems that sit on top of it.

24:44

Is there any one thing out of everything

24:47

we've talked about so far that really

24:49

excites you the most as you look to the

24:50

future?

24:52

>> Well, I think it's it's really this

24:54

notion that we are working AC as a

24:57

collective, right? We are literally

24:59

working across every partner ecosystem,

25:03

every developer ecosystem to bring sort

25:06

of these solutions together that will

25:07

hopefully power the next wave of AI. And

25:10

so Nvidia recognizing we we can't do it

25:12

by ourselves. Um and that's why we've

25:14

always had this sort of approach of

25:16

developer first. You know, from the very

25:18

early days of accelerated computing

25:20

becoming a new programming model, it

25:23

resided in identifying applications and

25:26

developers that could extract the value

25:28

out of that platform. So we take that

25:30

same approach today as we look at the

25:31

next wave of AI. How do we create the

25:33

conditions, you know, with new

25:35

processors like CPX? How do we create

25:37

the conditions with new software like

25:39

Dynamo or new you know sort of

25:42

architectures that can you know unlock a

25:44

whole new set of use cases for all of

25:46

the developers and so once we look at

25:49

that um you know through all the

25:51

customers we're talking talking to all

25:53

the feedback we're getting it's really

25:55

exciting to see how the light bulbs are

25:57

going off in their heads thinking what

25:58

would I be able to do if I could have

26:00

these different capabilities and so as

26:03

we sort of you know position our

26:05

products and platforms It's really

26:07

exciting to see how the light bulbs are

26:09

going off in the developers community

26:11

heads of how they're going to leverage

26:12

that to build things that we haven't

26:14

even thought of yet. So, so pretty

26:16

exciting stuff. Like you said, tiring

26:19

because it's a relentless phase, but you

26:21

know, again, this is definitely um

26:24

exciting times and and I think Nvidia is

26:26

is honored to be kind of at the

26:28

epicenter of of this whole whole

26:30

incredible transformation. And you know

26:32

what I've really learned and taken away

26:34

from this conversation is you know when

26:36

we think about AI and NVIDIA it's more

26:38

than just about training right there's

26:40

so much compute and there's so much

26:41

consideration that goes into the

26:43

inference side of things that Nvidia has

26:45

to build specialized chips for inference

26:48

special codees many chips to make

26:50

inference work at large scales. Uh I I

26:52

have a whole new appreciation for uh

26:55

everything that Nvidia does across the

26:56

entire stack when it comes to inference

26:59

specifically. So, I can't thank you

27:00

enough for your time, Dion. I'm super

27:02

excited about what's to come. Uh, and

27:05

thank you very much. A huge thank you to

27:07

Dion Harris for walking us through

27:09

Nvidia's hardware ecosystem and how it

27:12

powers every phase of AI from pre and

27:14

post training to prefill and decode for

27:17

inference and not just for large

27:19

language models, but for everything from

27:21

image and video generation to recommener

27:23

systems, robotics, and beyond. And of

27:26

course, thank you for watching and

27:28

supporting the channel. Until next time,

27:30

this is Tickerol U. My name is Alex,

27:33

reminding you that the best investment

27:35

you can make is in you.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This video features an in-depth conversation with Dion Harris from Nvidia regarding the company's critical role in AI infrastructure. The discussion highlights that while Nvidia is renowned for AI training, a significant focus is now on inference, which involves complex tasks like prefill and decode. The video explores how Nvidia uses extreme co-design across GPUs, CPUs, DPUs, and networking to maintain an annual release cadence and achieve 10x performance gains per generation. It emphasizes that efficiency is the core driver for business returns in what is increasingly being called 'AI factories,' where throughput and power consumption are key metrics.