Chip design from the bottom up

Chip design from the bottom up – Reiner Pope

Watch on YouTube

Now Playing

Chip design from the bottom up – Reiner Pope

Transcript

756 segments

0:00

I'm back with Reiner Pope, CEO of MatX, a new AI chip company.

0:04

Last time we were talking about what happens inside a data center.

0:07

Now I want to understand what happens inside an AI chip.

0:10

How does a chip actually work? Full disclosure, by the way:

0:13

I am an angel investor in MatX. So hopefully you have designed a good chip.

0:19

Hope so. I'll start with the smallest fundamental unit of chip design, and we'll

0:25

build up to what an actual production chip is and what its components are.

0:30

At the very bottom level of a chip, the primitives we work with are logic gates,

0:37

very simple things like AND, OR, and NOT. These are connected together by wires that have to

0:42

be laid out physically as metal traces on a chip. The main function that AI chips want to compute

0:50

is the multiplication of matrices. Inside that, the fundamental primitive

0:54

is a multiply-accumulate of pairs of numbers. We're going to demonstrate what that calculation

1:01

looks like by hand, and then infer what a circuit would look like for that.

1:08

It'll be easiest if I do a multiply-accumulate of a four-bit number with another four-bit number.

1:20

The clearest primitive is actually multiply-accumulate.

1:24

So there's a multiply of these two terms, and then we're going to add in an eight-bit number.

1:40

Can I ask a clarifying question? Why is this the natural primitive for

1:49

whatever computation happens inside a computer? There are a few reasons.

1:55

It's a little bit more efficient, but the reason it's natural for AI chips is that if

2:00

you look at what's happening during a matrix multiply… What is a matrix multiply in short?

2:09

There's a for-loop over i, over j, and over k, of output [i,

2:21

k] += input [i, j] x other input [j, k]. A multiply-accumulate happens at every

2:36

single step of a matrix multiply. The other observation is that the

2:40

precision will almost always be higher in the accumulation step than in the multiplication step.

2:45

This is specific to AI chips. You're multiplying low-precision numbers,

2:49

and then when you accumulate, errors accumulate quickly, so you need more precision there.

2:54

This is why we've chosen to do a four-bit multiplication and an eight-bit addition.

2:58

Let me make sure I understood that. There are two ways to understand that.

3:01

One is that the value will be larger than the inputs.

3:06

The other is that if it was a floating-point number it would be…

3:10

Maybe that part is less intuitive to me. But maybe it's the same principle?

3:14

It really is the same principle. The separate principle is that as

3:20

you're summing up this number, you're summing up a whole bunch of numbers, so you've got a

3:23

lot of rounding errors accumulating. Whereas in this case, there's only

3:28

one multiplication in the chain, so there aren't a lot of rounding

3:30

errors accumulating in the multiplication. Why are you summing up a whole bunch of numbers?

3:33

There’s just two numbers there. This summation is repeated j many times.

3:39

Any errors accumulate. I see. So how would we perform this calculation by hand?

3:45

As a human, we would probably separate it into two, but we can do

3:48

it all in one using long multiplication. For the multiplication term first, we're

3:55

going to multiply this four-bit number by every single bit position in the other four-bit number.

4:03

We write that out. First, 1001 multiplied by this bit position.

4:10

That is the number itself. Then shifted across by one,

4:14

we're multiplying by 0. That gives us an all-0 number.

4:20

Shifted across one more to multiply by this one, we get 1001.

4:28

Finally, for this last bit position, we get an all-0 number again.

4:37

This gives us a bunch of terms that we have to add for the multiplication.

4:43

While we're doing that summation, we might as well add in the actual accumulator term as well.

4:50

So we just copy that directly across. So this is the sum.

5:00

It's a five-way sum that we want to compute.

5:07

What logic gates did it take us to get to this intermediate step?

5:12

We needed to produce all 16 of these partial products.

5:17

How do I produce one of these partial products? Let's take this number 1, for example here.

5:26

We produce it by multiplying this number by this one over here.

5:31

We can produce that with an AND gate. This number is 1 if both this bit

5:35

is 1 and this bit is 1. If either of them is 0,

5:37

then the multiplication of 0 times anything is 0. To produce all of this, we ended up

5:45

consuming 16 AND gates. In the general case,

5:50

if I were doing a p bit multiply times a q bit multiply, this will be p times q many ANDs.

6:10

Finally, I sum them. Most of the work is going to happen in the summing.

6:15

Let me describe the other logic gate that we use here.

6:20

AND is almost the simplest logic gate that exists on a chip.

6:23

It's almost the smallest. At the other extreme, the very largest logic gate you'll typically use

6:30

is something called a full adder. Coming from software, you might think

6:39

that a full adder adds 32-bit numbers together. In this case, it just adds three single-bit

6:43

numbers together, so you can think of it as adding 0, 1, and 1 together.

6:50

When I add these together, the result can be 0, 1, 2, or 3, so I can express that

6:54

in binary using just two bits. As input, it has three bits.

6:58

As output, it has two bits. The number 2 in binary is 10.

7:06

This is also known as a 3→2 compressor because it takes three bits of input

7:17

and produces two bits of output. Just to make sure I understood: the

7:19

two inputs are an X and a Y value and then some carry that came in…

7:27

The three inputs are all bits in the same bit position, like three bits in a column here.

7:35

The two outputs, I've drawn them vertically here and horizontally here to match this

7:41

vertical versus horizontal layout. This expresses that things in the same

7:45

column are in the same bit position, whereas things in adjacent columns are different.

7:50

This is a carry out, whereas this was the sum. So if the inputs in the full adder were, say,

7:56

101, then the output would be 10. If it were 111, it'd be 11.

7:59

If it were 000, it'd be 00. If it were 010, it'd still be 01.

8:06

Got it. Yeah. It's essentially just counting the

8:09

number of things and expressing that in binary. This circuit captures what we as humans naturally

8:17

do when we're summing along a column. I'll show one iteration of using

8:24

the full adder to sum. The way I sum here is going

8:27

to be a little unnatural for humans. We would sum along the column and then

8:31

remember the carry, but instead of remembering the carry, we'll explicitly write it out.

8:39

We proceed from the rightmost column toward the left.

8:42

On the rightmost column, we sum the 1 and the 1, and that produces a zero here and a carry of one.

8:51

We've used this full adder circuit on this pair of bits and produced a pair of bits as output.

8:59

Now we can do the same thing with this column. We have a column of four numbers, so we'll take

9:04

the first three of them, run a full adder on them, and that gives us a 0 and a 0 as output.

9:12

The sum of these is 00. That's the full adder applied to all these bits.

9:18

As I've used up bits, I'll cross them out to indicate that I've handled them.

9:25

Let's keep going a little bit more. I take these three numbers, I add

9:31

them, and that gives me a 1 and a 0. I've dealt with these three numbers.

9:36

Now I take these three numbers and add them, and that gives me a 1 and a 0,

9:44

and I've dealt with these numbers. The way to view this is that I have

9:51

this whole grid of numbers that need to be added. I'm going to keep applying full adders to all the

9:57

bits here, constantly removing three numbers from a column and writing out two numbers as output.

10:03

Keep going over and over again until I eventually get just one single number coming out.

10:16

This approach is called a Dadda multiplier. This is the standard for how you do

10:31

area-efficient multipliers using full adders. Let's try to quantify the circuit size of

10:38

this so we have a sense of how big things are and can compare them later.

10:44

How many full adders did I use? How many numbers did I start with?

10:50

I have the 16 partial products, which is a product of all of these terms with all of these terms,

10:57

plus the eight terms that I'm adding here. I started off with 24 bits.

11:06

Eventually I produced eight bits on the output. In every step, I was crossing off three numbers

11:14

and writing two numbers out as a result. Every single use of a full adder eliminates

11:20

one of the bits here. So how many full adders?

11:23

It must be 24 minus 8, so there were 16 full adders in this circuit.

11:30

This is true in the general case as well. There will be p times q many

11:34

full adders in this circuit. Let me make sure I understand the logic of that.

11:39

The input bits, 24, is p x q, plus p + q. The output bits are just p + q.

11:42

So p x q plus p + q, minus p + q equals p x q. That's right. I think this explains,

12:00

or at least hints at, the second reason we chose to do a multiply-accumulate.

12:06

The first reason is that it's what shows up in matrix multiplication.

12:10

The second is that it gave us this very slick, simple p x q, very simple algebra.

12:17

We've described this whole procedure. Every single atomic step that I took

12:22

here becomes a logic gate, and then the wires are connected together.

12:26

When I had these three inputs that I used to produce these two outputs, if I think of mapping

12:31

this to a physical device, there would be a wire connecting all three of these things together

12:38

into a logic gate that produced this output. This is the main primitive, at different bit

12:46

widths, that's inside an AI chip. We're going to build up from here

12:54

to how you would use it to run all the other operations you might want.

12:58

This might be the wrong time to ask, but whenever Nvidia reports that this chip

13:02

can do X many FP4 or half as many FP8, it seems to imply those circuits are fungible,

13:11

that there's not a dedicated FP4 versus FP8. But the way you're mapping it out here, it seems

13:17

like if it has to be mapped out in the logic, you would need a dedicated FP4 multiply-accumulate and

13:27

then a dedicated FP8 accumulate. Can you "funge" them?

13:33

As drawn, they're not particularly fungible. This is actually one of the main choices you

13:37

have to make when designing a chip: how much of FP4 and how much of FP8 do I have?

13:42

Sometimes I'll make that consideration from the point of view of the customer requirement.

13:47

Another angle is to equalize the power budget between FP4 and FP8.

13:55

When they report those numbers and it just happens to be the case that it does 2x as many FP4 as FP8,

14:02

they're just choosing to give equivalent die areas to all the floating points,

14:08

and as a result it ends up being…? Why is the ratio exactly 2x?

14:13

Part of it is that surely it won't be exactly equivalent to die area.

14:20

There's a data movement reason. We'll maybe come back to this when we

14:24

look at how it goes into and out of memories. There's something really nice from a software

14:28

level about the fact that I can pack two four-bit numbers into the same

14:32

storage as an eight-bit number. When I store that to memory,

14:36

the sizing of the buses that I wire out within the chip makes that work out really nicely.

14:44

Come to think of it, it's not just 2x. The amount of area it takes sounds like

14:54

it's quadratic with the bit length. That's why smaller precision is even

15:00

more favorable than you might think. This is a really big reason.

15:04

In fact, Nvidia made a change. Historically, up until B100 or

15:10

B200, every time you halved the bit precision, you doubled the FLOP count.

15:18

For the reason you said, because of this quadratic scaling, that ratio is actually slightly wrong.

15:21

You should get an even bigger speedup than you might otherwise think.

15:28

Nvidia’s product specs have started acknowledging that in B300 and beyond, where the FP4 is three

15:34

times faster than the FP8. Though it should be 4x.

15:38

Yeah. What I've shown here is the simplest case of integer multiply.

15:42

When you're dealing with floating point, as you do in FP4 and FP8, there's this other term,

15:49

the exponent, that complicates the calculation. What can we see already from this?

15:54

I think the big observation you've made is that there's this quadratic scaling with bit width,

16:00

which is very effective and is the single reason low-precision

16:04

arithmetic has worked so well for neural nets. The other thing we're going to do now is compare

16:11

the area spent on the multiplication itself with all the circuitry around it.

16:20

We'll walk back in time a little bit and see how GPUs prior to Tensor Cores worked,

16:28

which is in fact the same way CPUs worked. Where do we stick this multiply-accumulate unit?

16:37

Generically, I'll describe a CUDA core or a CPU. You'll have some register file which stores some

16:47

number of entries, maybe eight entries of, in this case, 4-bit numbers, but typically 32-bit numbers.

17:03

Inside the CUDA core, I'll have some register file of some depth,

17:06

and then I'll have my multiply-accumulate circuit. What it's going to do is take three arbitrary

17:18

registers from this register file, perform the multiply-accumulate,

17:21

and then write back to the register file. It's going to write to this one,

17:25

but it was able to read from this one, this one, and another random one.

17:31

It will take three inputs like this. This is the core data path of many processors.

17:39

Most processors look like this. You've got some set of registers, and

17:42

then you've got some set of logic units, or ALUs. We want to analyze the cost of the data movement

17:50

from the register file to the ALU and back. Ultimately, there's going to be some

17:58

circuit that says, "Well, I don't always have to select this guy.

18:01

I might select any of the registers at any point in time."

18:04

The first question is: how can I build a circuit? The circuit I'm going to look for is a mux.

18:13

In this case, it's going to have eight inputs, one from each entry of the register file,

18:17

and it's going to have one output, which is actually producing this output.

18:26

What is the cost of this thing? All we have to build it out of is AND and OR.

18:32

How do we build it? We do the dumbest thing possible.

18:37

We form a mask. When we want to read the third entry, we're going to AND every

18:44

single entry with either 1 or 0 based on whether that's what we want to read,

18:48

and then we're going to OR all of them together. Just to make sure I understand the basics.

18:52

What the mux is doing is just selecting an input? Just selecting, invisible to software.

18:59

You say "I want input number three," and that means there's a mux here.

19:02

So what is the cost of this mux? An n-input

19:10

mux operating on p bits. I have n rows. That's just eight

19:21

rows, and each row is p bits wide. I have to AND every single bit,

19:25

so I get n x p many AND gates. For every single input I have to decide

19:32

whether I'm going to mask it out or not. Then I'm going to OR them all together.

19:38

There's going to be n – 1 times p many OR gates. I've got all of these different things,

19:47

almost all of them are 0s, but I need to collapse them from my eight options down into one option.

19:54

Every step, I need to OR one row into an existing row.

19:59

It's actually funny that you don't think at the level of hardware.

20:04

You just think, "Oh, I'll just select element three," and something as simple as that is

20:11

in and of itself quite a complicated circuit. This is the first step of all of the hidden

20:15

data movement costs that show up. We're just going to compare.

20:23

I have to pay this cost. I've got one mux here, and in fact

20:26

I have two more copies of that for each of the three inputs to my multiply-accumulate operation.

20:30

I have this cost, which is 3 x n x p AND gates over here, compared to p x q gates in the actual

20:42

circuit that is doing the thing I care about. If we plug in actual numbers, with n being

20:49

eight, I get 24 x p gates just in the data movement, compared to—if q is four—4 x p

21:00

gates just in the multiply-adder. Where is the three coming from?

21:05

Three different inputs here. What I'm hinting at is that all of this work,

21:14

which scales as the size of the register file—and this is a very small register

21:18

file—all of this work just moving the data from the register file to the logic unit is many,

21:26

many times more expensive than the logic unit. In the most recent ClusterMAX report, SemiAnalysis

21:31

ranked almost 100 different GPU clouds. Crusoe was one of only five that made the gold tier.

21:36

SemiAnalysis found that gold-tier providers like Crusoe had a total cost of ownership that

21:41

was 5% to 15% lower than silver-tier ones, even when they had identical GPU pricing.

21:46

This makes sense because total cost of ownership is downstream of a bunch of different things that

21:50

don't necessarily show up in the sticker price, but that Crusoe has optimized.

21:54

Things like how well you detect faults and how quickly you replace failed nodes.

21:58

For example, Crusoe was one of the first clouds to adopt NVSentinel, NVIDIA's own GPU monitoring

22:04

and self-healing software for enhanced GPU uptime, utilization, and reliability.

22:09

This lets Crusoe make use of everything that NVIDIA has learned about why chips fail across all

22:14

their different fleets and deployments, so that Crusoe can catch faults earlier in the process.

22:18

And once they identify a failure, Crusoe can swap in a healthy node in less than 10 minutes.

22:23

Because they’re not running bare metal, Crusoe doesn't have to spend time installing an

22:27

operating system or configuring drivers. They can just spin up a new VM on an

22:31

already running and pre-qualified host. If you want to learn more about this or

22:34

the other reasons that Crusoe made gold tier, go to crusoe.ai/dwarkesh.

22:41

It may be helpful to just see what a mux looks like,

22:44

maybe a two-bit or a four-bit mux. We’ll do a two-way mux.

22:53

We've got two different numbers, we’ve got these two inputs.

23:03

These are the inputs that are being selected between, and the selector can either be "I want

23:16

this one" or "I want the other one." This is a one-hot encoding.

23:24

This is what we start with. Let's focus on this case.

23:34

This is the actual input we got, and we want to produce this guy as the result.

23:42

Very laboriously, we AND this bit with all of these.

23:50

That produces ANDing this bit with this row. Likewise, we AND this bit with this row.

23:55

That produces all 0s. There are four ANDs here. Finally,

24:07

we OR these two together, and this gives a 1. We OR these two together, this gives a 1.

24:09

We OR these two together, it gives a 0. We OR these two together and it gives a 1.

24:15

Those are the four ORs. This actually ends up

24:22

looking a little bit like addition. We did exactly the same set of ANDs.

24:28

We've ANDed all of these things together, but then instead of collapsing it by using

24:34

full adder circuits, we just get a very simple collapsing with OR gates.

24:38

But that doesn't look like n times p. This was with n=2 inputs.

24:48

In the general case, we will have n rows, and we'll have p bits per row.

25:03

That gives us the n times p many AND gates. In this circuit I've described, almost all of

25:11

the cost, seven-eighths of the cost, is in reading and writing the register file, and only a tiny

25:19

fraction of the cost is in the logic unit itself. This is the problem to solve.

25:24

This essentially was the state of play prior to the Volta generation of Nvidia GPUs.

25:31

This kind of thing is what was inside the CUDA cores.

25:37

This problem statement is what motivated the introduction of Tensor Cores,

25:41

which are more generically called systolic arrays. Think about how we're going to solve this problem.

25:46

We're spending almost all of our circuit area on something that we

25:48

really don't care about and is hidden to the software programmer, and the thing that we

25:52

actually care about is not much of the area. Make this one bigger somehow while keeping

25:57

this at the same size. That's the goal.

26:01

The evolution was that we had baked this much into hardware at this stage.

26:06

This single line is a multiply-accumulate, and this single thing was baked into hardware.

26:11

The idea of a systolic array is to go two levels of loops up and bake this

26:17

entire loop out here into hardware. The idea is that if we have a much

26:22

bigger granularity fixed-function piece of logic, maybe the taxes we

26:27

pay on the input and output are much smaller. Interesting. It sounds like you're suggesting

26:30

that if you go up one step in the matrix multiply loop, you can tilt the balance

26:41

more towards compute than communication. That's right. There are two effects we're

26:46

going to take advantage of here. One is that we can do more stuff

26:49

per every trip through our register file. The other is that in some of this loop, we can

26:58

take advantage of certain things staying fixed. Visually, we're going to look at this

27:07

matrix multiplication. This portion of the loop

27:09

corresponds to a matrix-vector multiplication. We'll take a matrix and multiply it by a vector.

27:21

How do we do this? Every column gets

27:23

multiplied by the vector and then summed. We're going to sum along columns.

27:28

This 0 and 3 gets multiplied by the 3 and 7 and gets summed, and then the 1 and 2

27:32

gets multiplied by the 3 and 7 and gets summed. There is a multiply-accumulate associated with

27:38

every single one of these entries in the matrix. We'll draw out these four multiply-accumulates.

27:53

Just to make sure I understand why there are four multiply-accumulates: each entry in the

28:01

column that corresponds to the output vector is a dot product, and in this

28:09

case it will be two multiplications and then the addition of those two multiplications.

28:16

You're accumulating... Really there's only one addition per

28:20

dot product, but we like to start with zero. But it includes the initialization of zero.

28:23

Yeah. We want to have quadratically more compute.

28:30

We have x times y as much compute as we had before.

28:41

But we want to aim for having only x times as much communication.

28:52

The intention is to get this advantage term going as y.

28:57

We've laid down the multiplications. We want to bring in a vector of size two, and

29:06

that is already in line with our columns target. That's fine. However, we need to manage the

29:11

communication of this matrix, which exceeds our budget of x.

29:18

The idea is that in an AI context, this matrix is going to stay fixed for a long period of time.

29:29

We've got some register files sitting over here. The amount of stuff coming out of this register

29:36

file… this is the term that we want to go as x, in some sense.

29:43

We don't want to bring this full matrix in from the register file every cycle,

29:47

because that would cost too much in terms of wiring from the register file.

29:55

Our key trick is that this matrix can be stored locally to the systolic array.

30:04

We'll store these numbers 0, 1, 2, and 3 in a gate called a register that physically

30:15

stores these numbers, and we're going to reuse these numbers over and over again

30:19

for a large number of different vectors. The optimization here is that the nature of

30:23

matrix multiplication is that you can store this square quadratic thing directly where the logic is

30:35

happening, which has an extra dimension compared to the inputs that you keep swapping in and out.

30:45

That’s right. This is the nature of

30:46

what a matrix multiplication is. You do a lot of multiplication

30:52

to get one value out. A dot product is the

30:55

result of a lot of multiplications. So that optimization means you can

30:59

stuff a lot of multiplication in before you get some value out of it.

31:03

That's right. Just to complete the picture of concretely how that looks:

31:09

I swapped the 3 and the 2 here. Just like this 0 and 3 is going

31:17

to multiply by the 3 and the 7, we're going to form a dot product along columns here.

31:26

We're going to feed a 3 and a 7 in here. This feeds into this multiplication and also

31:35

feeds into this multiplication. Likewise, the 3 feeds

31:39

into here and also into here. Then we're going to sum along here.

31:49

Starting at the top of a column, we feed in 0s, and then coming out the bottom we get results.

31:55

Visually, there's a dot product performed along columns in a matrix, and that maps exactly to

32:03

what is done spatially in the systolic array. This is one dot product summed vertically,

32:09

and this is a second dot product also summed vertically.

32:13

What is the data that needs to go into and out of the register file?

32:17

We have x amount of data coming out on the output, and we also

32:23

have x amount of data coming from the input. With respect to the input and output vectors

32:33

at least, we've met our goal of having only x as much data going in and out of the register file.

32:44

This leaves open the question: I said the weight matrix is stored locally in the systolic

32:51

array, so how did it get there in the first place? At some point, you need to boot your chip and

32:56

populate this data, so where did that come from? The trick is that we just do it very slowly.

33:02

We very slowly trickle-feed it into the systolic array.

33:07

The simplest strategy is that we run this daisy chain: feed a number in here,

33:13

and on the next clock cycle it will move down to the next entry of the systolic array.

33:18

We can do that in every column in parallel, this is also going to come from here,

33:26

and that gives us another factor of approximately x units of bandwidth coming in.

33:32

Would you mind repeating that sentence one more time?

33:36

We know that we're going to be bringing numbers only rarely into the matrix.

33:43

We just want to come up with any construction at all such that the amount of wiring that

33:48

crosses the boundary of the systolic array is bounded to x and not go as xy.

33:56

A particularly simple strategy is that we bring a number into the top row of

34:03

the systolic array in one clock cycle. Then, for y consecutive clock cycles,

34:11

we bring in the top row every time and shift all the other rows down by one.

34:16

This keeps the wiring that needs to come from this expensive register file

34:22

only down to a factor of x rather than xy. I see. There are two questions in terms

34:27

of communication: communication time and communication bandwidth.

34:33

You're saying that since we're only going to be loading this in once, let's minimize bandwidth,

34:39

because bandwidth equals die area. We load it in slowly over smaller

34:45

lanes because we're just going to keep this value in there for a while.

34:48

Exactly. It's interesting to me that when we were talking

34:52

last time about inference across many chips, the big high-level thing we're trying to optimize for

34:58

is increasing the amount of compute per memory bandwidth, that is to say, per communication.

35:03

Here also, we're trying to increase the amount of actual multiplies

35:08

or additions relative to transporting information from registers to the logic.

35:16

In both cases, you're trying to maximize compute relative to communication.

35:20

This shows up all the way up and down the stack. This is close to the bottom, to the gates.

35:28

There's a version that's maybe even closer to the gates in the precision of the number

35:34

format that you choose to use. We saw that same effect.

35:39

There's a squared versus linear term going on both purely in the precision of the ALU,

35:47

but also in the size of the matrix. This unit is the next bigger unit.

35:55

We had the multiplication circuit, and on top of that we have a pretty large systolic array.

36:01

I drew it as 2x2, but older TPUs were described as 128x128 of this circuit shown here.

36:12

This ends up being the most efficient known circuit for implementing a matrix multiply.

36:21

We've talked about how it seems obvious that you should try to

36:26

maximize compute relative to communication. What are non-obvious trade-offs that

36:35

keep you up at night, about whether you should do X or Y and it's not obvious what the answer is?

36:41

Most of the decisions in chip design are sizing decisions.

36:48

Already in what we've drawn so far… AI chips all have this circuit in them.

36:56

They have a systolic array, and somewhere near it a register file providing inputs and outputs.

37:06

Even within this scope, the sizing questions you have are: how big should I make my systolic array,

37:11

and how big should I make the register file? These two questions are coupled.

37:25

One way to think of it is to set a budget for what percentage of chip

37:32

area you want to spend on data movement. Maybe I say that I want this to be 10%

37:36

and the systolic array to be 90%. Then I can size my register file.

37:42

Bigger register files are more flexible. They allow me to get more application-level

37:47

performance out, but they take away from the area spent on the systolic array.

37:55

I recently ran an essay contest where I asked people to write about what I consider to be some

37:59

of the biggest open questions about AI. The submission window closed last week,

38:03

so I used Cursor to create a couple of different interfaces to help me review the entries.

38:07

One interface anonymizes submissions and hides the necessary information.

38:10

It lets me group responses by question, add notes, and record my scores.

38:14

The other interface helps me review entrants who also want to be considered

38:17

for the researcher role that I'm hiring for. The UI puts the applicant's essay right next

38:21

to their resume and their personal website so that I can see everything at once.

38:25

Cursor's harness is really good at helping these models see and improve their UIs.

38:29

I watched it render these interfaces in the built-in browser, take screenshots,

38:33

click through sections, and keep iterating. At this point,

38:36

Cursor is where I do most of my work. Whether I'm reading and visualizing a bunch

38:40

of research papers, or coding up an interface to review applications, or making flashcards for my

38:45

blackboard lectures, Cursor just makes it very easy for an AI to look at whatever I'm looking at

38:50

and help me understand it and work with me on it. So whatever you're working on,

38:54

you should do it in Cursor. Go to cursor.com/dwarkesh.

39:00

Where does the clock cycle of a chip come in? What determines what that is?

39:08

And what is a clock cycle of a chip? At baseline, it's worth observing

39:12

that chips are incredibly parallel. You've got 100 billion transistors in a chip.

39:16

A key thing you need to do whenever you have massive parallelism is synchronize

39:21

between the different parallel units. In software, typically you have these very

39:27

expensive synchronization methods like a mutex. One thread will finish what it's doing,

39:34

grab a lock stored somewhere in memory, and notify the other thread that it's done.

39:39

On chips, we take a very different approach. Every nanosecond or so, all circuitry in the chip

39:50

will pause for a moment and synchronize. That is the clock cycle.

39:58

The entire chip typically goes in lockstep to the next operation in one fell swoop.

40:06

What this looks like in circuitry is that the clock is mediated by registers, which

40:17

are these storage devices we've drawn elsewhere. The way to think of it is: I have some storage

40:24

holding a bit, which might be 0 or 1. Then I have some cloud of logic,

40:29

which maybe is this systolic array or multiplier. I have a bunch of inputs feeding into this cloud

40:42

of logic, and eventually there's going to be some output register that it writes to.

40:48

There is a global clock signal driving all these registers.

40:57

At a certain instance in time, when the clock strikes, whatever value happens to be on that

41:05

wire at that instant is what gets stored. The challenge is that I would like to have

41:15

my clock speed run as fast as possible. If I run at two gigahertz, I get twice as

41:19

many operations done per second as if I run at one gigahertz.

41:24

But what that ends up meaning is that I'm very sensitive to the delay through this cloud of

41:30

logic, because any computation happening in there needs to finish before the next clock cycle hits.

41:40

A major point of optimization on any chip is to make this delay as short as possible.

41:53

Interesting. The constraint here seems to be that if you add too much logic,

42:02

you might risk missing the clock cycle. But if you don't add enough, you're

42:07

leaving potential compute on the table. Is there ever a situation where you take

42:15

a probabilistic chance that a computation finishes, or is it strictly that it either

42:23

finishes by the clock cycle or it doesn't? In standard chip design, you margin it such

42:28

that there is a probability, but it's many standard deviations out.

42:34

For all intents and purposes, it is a reliable part and will always meet the clock.

42:40

There are some weird exceptions, like clock domain crossings where you go from one clock to another.

42:45

Then you actually do have to reason about this probability.

42:47

But in the main path, you margin it such that you'll get there 25% of the clock cycle in

42:55

advance, making it very unlikely that it misses. Where the clocks synchronize, where the

43:07

registers are, is this something you determine as a chip designer?

43:11

Or is it an artifact where you want a certain sequence of logic, and the software you use to

43:18

convert your Verilog into what you send to TSMC just determines that to make it work,

43:25

you have to put a register here, here, and here, making sure no single step makes the

43:33

whole chip's clock cycle longer than it has to be? Inserting them is actually a huge part of the

43:38

work of designing a chip. It's done by a combination

43:42

of manual and automatic methods. To show the very dumb version of

43:48

what you can do here, you can take this logic and split it in half.

43:53

Instead of just one cloud of logic, I can have two smaller clouds of logic that do the same thing,

44:02

but split them up by a register. If you split it in the middle,

44:13

you can hit twice the clock frequency. That's great, you get twice the performance,

44:17

but at the cost of an extra register, which means more storage.

44:22

Stepping back, why do we need to synchronize the whole chip?

44:25

If you imagine playing Factorio or something, there's no global clock cycle.

44:28

Things are just done when they're done. There's iron on the plate, and you can

44:31

take it if you want. Taking that analogy,

44:36

the thing you need to be mindful of is if I have two different paths through some logic.

44:41

Say I have to do computation f here and computation g here,

44:47

and they're going to meet for computation h. There's going to be manufacturing variance.

44:59

In some chips f will take a little longer; in some chips g will take a little longer.

45:03

If I have a signal propagating through, and the results from f and g have to meet up at h,

45:12

what can go wrong is that f gets there early and it meets the previous

45:16

value of g, or the next value of g. Ah. And h needs to know when to start,

45:21

when the next iteration has… Exactly.

45:24

This explains why different chips made at the same process node, the same TSMC

45:32

technology, can have different clock cycles. Two chips made at 3 nm might have different clock

45:38

cycles based on whether they were able to optimize to ensure no single critical path is so long that

45:48

it slows down the whole chip's clock cycle. That's right. This optimization I showed here

45:53

is called pipeline register insertion. We've inserted a register in the

45:59

middle of the pipeline. This is a pure trade-off

46:02

between clock speed and area. That is the easy case.

46:08

There is a harder case too. I drew out a pipeline of logic,

46:14

but in other cases you may have some calculation which actually feeds back in on itself.

46:22

It runs some function f and then writes back to itself.

46:27

For example, this might be an addition where you're adding a number every clock cycle.

46:43

This little circuit essentially sums all the numbers presented on different clock cycles.

46:49

The challenge is, if this plus takes too long, what can I do?

46:54

If I try to put a pipeline register right in the middle of it, it

47:05

changes the computation that's done. Instead of forming a running sum

47:08

of everything that comes in, I will actually have two different running sums.

47:12

I'll end up with a running sum of the even numbers and a running sum of the odd numbers.

47:18

This constraint—where I have a loop in my logic, which all chips have somewhere—is the hardest

47:25

thing to address and sets the clock cycle. I don't understand why it would be

47:29

a problem to have that. I'm not even sure what it

47:30

would mean to have a register there. Is it a sort of atomic operation?

47:36

Well, plus is not really atomic. As you just demonstrated.

47:41

It took a whole lot of work to do a summation. You can take the early parts of that work,

47:45

stick a register in the middle, and then take the late parts of that work.

47:48

Okay. TSMC offers a PDK which specifies the primitives of

47:58

logic they can grant you in the chip. It's up to them to determine that no

48:03

primitive is bigger than the clock cycle they're hoping a process node targets.

48:10

But other than that, can't you just take all the primitives from TSMC and keep adding

48:19

registers between them as much as needed until you get to your desired clock cycle?

48:24

As a logic designer, the chip architect sets the clock cycle.

48:28

For example, the primitives you get from TSMC are on the order of AND gates or full adders.

48:36

It depends a lot on voltage and which library you choose, but generally you can have about 10, 20,

48:46

or 30 of these sequentially in a clock cycle. These primitives are very fast,

48:51

maybe 10 picoseconds. As a logic designer, in principle,

48:57

if you just had a register and an AND gate in a loop, you could get an insanely fast clock speed,

49:07

more than four, five, or six gigahertz. But if you take this really simple circuit

49:16

and look at the area you're spending here… This is called one gate equivalent

49:23

in size, so unit of one in area. This thing is maybe a unit of eight in area.

49:31

Again, almost all your cost becomes synchronization or

49:34

communication cost compared to the actual logic. This would be a case where you've gone too far.

49:40

You've made your clock speed really fast at the cost of spending almost

49:43

all of your area on pipeline registers. Interesting. So you're hinting at a dynamic

49:50

where you can have a really fast clock speed but you're not getting that much work done.

49:56

You can have low latency but low throughput. It hurts your throughput, in fact, because

50:05

the throughput of your chip is the product of how much you get done per clock cycle—which

50:11

is based on area efficiency—times how many clocks you get per second.

50:15

This is actually so similar to the thing we were discussing last time about batch size,

50:21

where if you have a low batch size, any one user can receive their next token really fast,

50:28

but the total number of tokens processed in, say, an hour will be lower than it could otherwise be.

50:34

Exactly. You get less parallelism out if you drive your clock speed up really high.

50:39

Language models are starting to compete against the best human forecasters.

50:42

I sat down with two senior Jane Streeters, Ron Minsky and Dan Pontecorvo, and asked: at some

50:48

point, does AI just do what Jane Street does? There's a world that we should take seriously

50:52

where, you know, we're going to build large language models or some other AI systems that

50:56

are like strictly smarter than all humans on the planet and more capable at all cognitive tasks.

51:02

Trading in particular feels to me as like kind of AGI complete, sort of like NP complete,

51:08

because at the end of the day trading involves figuring out what things are worth, which means

51:15

making predictions about the future. Jane Street isn't betting against AI.

51:18

They just signed a six billion dollar compute deal.

51:20

But Ron's view is that the edge keeps moving. I have never been more desperate to hire more

51:25

engineers and more traders than I am today. You know, you have the usual thing of like

51:29

the other hard parts that we don't yet know how to automate.

51:31

Well, that ends up being where the competitive edge lies.

51:34

You can find these open positions and watch the full interview at janestreet.com/dwarkesh.

51:41

I remember talking to an FPGA engineer at Jane Street, Clark, who helped me prep for

51:48

the previous interview we did together. He was explaining why they use FPGAs.

51:55

I imagine that for high-frequency trading, throughput is less important than latency,

52:01

so having very specific control over the clock

52:03

cycle in a deterministic way is the most important thing.

52:08

Maybe it'd be interesting to talk about why you can't just achieve that with an ASIC,

52:14

or why you might use an FPGA to have deterministic clock cycles for high-frequency trading.

52:25

Let's consider the business case for an FPGA versus an ASIC.

52:31

FPGAs and ASICs use largely the same conceptual model.

52:36

You have a series of gates built from small primitives—ANDs, ORs, XORs—connected together

52:48

with wires running in a fixed clock cycle. Anything you can express in an FPGA you

52:53

can express in an ASIC too. It will be about an order of

52:55

magnitude cheaper and have better energy efficiency on an ASIC than an FPGA.

53:01

The trade-off is that the first FPGA costs you $10,000, whereas the first ASIC you make costs $30

53:08

million because it requires an entire tape-out. The business use case for an FPGA is when you want

53:16

something that has very deterministic latency, fast runtime, and high parallelism,

53:25

but you are going to change the workload frequently, maybe every month.

53:29

You don't want to pay that tape-out cost every time.

53:34

How does an FPGA actually emulate the ASIC programming model in a fixed piece of hardware?

53:42

At its core, it has the two components we just talked about.

53:54

It has registers as storage devices, and it has lookup tables

54:03

(LUTs) which provide all of the gates. Then there's a third component.

54:11

We have a swarm of these registers and LUTs, and they are connected by a big set of muxes.

54:30

In front of every single one of these, we have a mux which selects an input from everywhere else.

54:44

We have a whole bunch of different options feeding into all of these things.

54:54

What this allows is essentially that when I program my FPGA, I can take all of these

55:01

components and superimpose a particular wiring which goes through this LUT, feed it into another

55:10

LUT, send it to this register, and then feed it into another LUT, or something like that.

55:16

What I've drawn in orange is how you… FPGA means Field-Programmable Gate Array.

55:22

The orange is what has been programmed in the field, whereas the white is all the

55:26

wires that must exist in the FPGA in order to actually make the device in the first place.

55:30

What does it mean to be programmed in the field? Programmed in the field means the device is

55:34

being deployed in a data center. It's sitting out in the world,

55:36

and then you can come and program it. Ah, not field as in like electric field.

55:39

Field as in like out there in the world, ok. If I look at how the field programming comes

55:48

out of the first lookup table and goes into a second one, how does that work?

55:57

Where are the wires that make that happen? I got a little bit lazy in drawing all of these.

56:03

Every single device here has a mux sitting in front of it, which can select from all of the

56:11

nearby circuits that are available. The actual configuration of the

56:19

FPGA amounts to the mux control. In this mux, we have the data inputs,

56:28

and we have the control that selects. There's a little storage device sitting next

56:34

to every single one of these muxes saying, "This is where you're going to source your input from."

56:43

Programming it consists of configuring every single one of these muxes.

56:47

That makes sense. What is happening inside of the lookup table?

56:51

The lookup table is also going to have a little bit of control telling it what to do.

56:56

Its purpose is to configurably take the role of an AND gate,

57:03

OR gate, XOR, or any of those different things. There are many ways you could consider doing that.

57:10

The way it's done in traditional FPGAs… A lookup table has four bits

57:25

of input and one bit of output. How many different functions are

57:31

there from four bits to one bit? There are 16 different functions.

57:37

You can tabulate this as 16 different numbers. You've got a table of 0111001, 16 entries.

57:52

This table is stored in this blue configuration bit.

58:00

It views these four bits as binary, looks up the relevant row of the table, and emits that bit.

58:08

This is essentially a truth-table view of lookup tables.

58:11

Okay, so if you think about an AND gate, OR gate, NOR gate, XOR gate, these all take as input…

58:18

Those are two-input functions. Sometimes we have a three-input function, like a three-way XOR,

58:25

or a four-way XOR. In this case,

58:27

does it just depend on how big it is? The typical size for LUTs is four inputs.

58:32

It's sort of just a sweet spot. There’s another compute vs. communication

58:37

trade-off here. If it has too

58:39

few inputs, you need to use more LUTs. Basically the lookup table is a truth table.

58:47

With a truth table, you can program in any gate you want.

58:50

So instead of a lookup table, you can just think of it as a programmable gate.

58:53

That's right. One of the things you can do here is you can see where the rule of thumb

58:59

that an FPGA is an order of magnitude more expensive than an ASIC comes from.

59:04

You count how many gates would be inside this lookup table.

59:10

We can view this lookup table essentially as one of these muxes.

59:14

It has to select between 16 different values,

59:19

so it's a mux with n=16 options and p=1 bits. As we saw earlier, this circuit

59:37

costs n times p many gates. So it costs np, which is 16,

59:45

AND gates, and also 16 ORs. This circuit being the mux?

59:52

Exactly, the mux. The mux that goes into the lookup table?

59:56

The lookup table itself you can think of as being a big mux that selects from all 16

60:02

rows down to one output. That's the lookup table.

60:05

But the way you've drawn it here, there's a mux and then a lookup table.

60:08

It's muxes all the way down. There is a second mux that is inside here.

60:12

This mux is this mux. And the other mux is just saying…

60:18

where it came from in this mess of gates. Right, and the second mux is, "Okay,

60:23

now you have one value, but that value is still a four-bit value."

60:29

Yeah, I've selected four bits from the soup. Then I use those four bits to select which

60:35

entry in the lookup table I'm going to use. Suppose in the first mux you're pulling

60:45

from eight nearby registers as input. That's a total of 32 bits going in.

60:53

Out of that, four bits come out. Those four bits go into the second mux,

60:58

which is inside the lookup table. In this case, these registers

61:02

are single-bit registers. If there are eight nearby

61:06

registers and lookup tables, then I have eight bits total coming in nearby.

61:11

I select from eight down to four different values. There are actually four different muxes, a little

61:21

mux associated with each of these input bits. Each of them is selecting one out of eight.

61:26

Where are those eight coming from? Nearby registers and other LUTs.

61:30

Each register is one bit. Yes.

61:33

I guess AMD or whoever makes these FPGAs still has to be opinionated about which

61:42

registers are connected to which registers. You can program in the actual gates,

61:47

but they add a wire in the connect… the communication topology, right?

61:53

You get flexibility at a local grain. There's a nearby neighborhood you can select from,

62:00

but for more coarse, long-distance connections, they form an opinion on that.

62:06

And the reason it's 10x slower is why? If you look at the cost of building this

62:11

lookup table, it's 32 gates. It can give me the equivalent

62:17

of—what’s an interesting thing I can do here—a four-way AND gate.

62:26

A four-way AND means AND, AND, and then an AND of an AND.

62:36

This is a circuit I could implement in an ASIC directly using three AND gates.

62:41

Using a LUT, I can also implement it, but it's going to take 32 gates instead of three.

62:47

So the overhead is really coming from the fact that there's a more concise

62:58

way to describe a truth table than listing out every single possible combination of inputs,

63:06

which is just to write out the gate. Yes, to place down the polysilicon

63:11

and the wires and so on. Interesting. One important point you

63:15

made to me is that the reason they prefer FPGAs to CPUs is that they get deterministic clock cycles.

63:21

They know when a packet will come in and go out. Why isn't that a guarantee in CPUs?

63:28

You can actually design a CPU that has deterministic latency as well.

63:34

In fact, the processors inside a lot of AI chips also have deterministic latency.

63:40

Groq has advertised this. TPUs have that in the core as well.

63:46

The challenge is getting deterministic latency and high speed at the same time.

63:53

Non-deterministic latency comes from specific design choices in a CPU.

64:03

It's actually possible to remove those design choices and make a CPU with deterministic latency,

64:08

but those are not very attractive in the market, so people don't make those CPUs anymore.

64:16

In some sense, deterministic latency is a simpler starting point, and some chip designers have

64:24

added things in to make it non-deterministic. To take a concrete example, probably the most

64:33

important source of non-determinism on a CPU is the CPU cache itself.

64:43

In a CPU, you have the CPU die itself, and then DDR memory off on the side.

64:54

You have a cache system inside that remembers recent accesses to DDR and stores them.

65:09

When I'm running through my CPU instructions, every time

65:12

I have an instruction that accesses memory, it first checks if the data was stored in the cache.

65:17

If not, it fetches it from DDR. This is a huge optimization.

65:23

The cache is two orders of magnitude faster than the DDR.

65:27

If you never used the cache, basically all programs would run a hundred times slower.

65:34

The presence of a cache is absolutely necessary for a CPU to run at a reasonable speed.

65:39

But whether or not you get a cache hit depends on the ambient environment of the CPU:

65:45

what other programs are running, what has run recently, and what the random number

65:48

generator inside the cache system is doing. That is a big source of non-determinism

65:53

in the runtime of a CPU. That is the memory system for a CPU.

66:01

The big thing you can do differently is, instead of having the hardware say,

66:06

"I’m going to read memory" and then the hardware decides whether or not it comes

66:11

from the cache, you can bake this decision into software, a different design philosophy.

66:22

You see this in TPUs, for example. I'll draw the same diagram,

66:28

but I'll call it a scratchpad. The main difference is… This would be a TPU,

66:36

and you have HBM in this case rather than DDR, but it's still an off-chip memory.

66:43

Instead of the software saying "first access memory" and letting the hardware decide,

66:48

you have one kind of instruction that goes to the scratchpad and a totally different kind

66:53

of instruction that goes to HBM. This style is generically known

66:58

as scratchpad instead of cache. The key distinction is that you have

67:04

one kind of instruction that says "read or write scratchpad," and a totally different instruction

67:08

that says "read or write HBM." So scratchpad being the cache.

67:12

Yeah, this thing here is the scratchpad. Stepping way back: people say computers have the

67:22

"von Neumann architecture", where there's this serial processing of information.

67:28

Maybe it's just because we've been talking about parallel

67:30

accelerators, but the FPGA is super parallel. The AI accelerators, the TPUs, are super parallel.

67:41

Even CPUs are super parallel if you think about all the cores they have.

67:46

In what sense is modern hardware actually the von Neumann architecture?

67:50

Is that actually a fair way to describe modern hardware?

67:52

I think it's a fair way to describe CPUs. The amount of parallelism you get on a CPU is

67:58

about 100 cores times maybe 16-way vector units, so about 1,000-way parallelism on a CPU.

68:06

One question: there is a die being used for the CPU, and if there are fewer threads,

68:15

just as a matter of transistor voltages switching on and off, is it that there's

68:22

literally one control flow—a small part of the die—where voltages are switching on and off?

68:28

How do you actually occupy the die area of a CPU… If there are so few cores, what are

68:35

you spending all of the die on? Yeah, what is happening there?

68:37

The cores are just much bigger and more complicated.

68:44

We should compare a CPU core, which takes up one one-hundredth of the die, to a LUT.

68:50

A LUT is only 16 gates. It's clear why there are so many

68:54

more LUTs in an FPGA than cores in a CPU. But why are there more CUDA cores,

69:04

for example, than CPU cores? What's the difference between a CPU and a GPU?

69:13

Inside the CPU, one big use of the area is the cache.

69:23

Very little is actually the ALUs. Mostly it's these register

69:28

files rather than the logic units. Both of those have equivalents in a GPU,

69:36

so that's not a big difference. But the thing that does not have

69:39

an equivalent in a GPU is the branch predictor. There is a whole big area in the CPU which is

69:49

just a bunch of predictors saying when the next branch will be and where the branch target is.

69:58

Stripping a lot of that out, as well as making these register files tighter,

70:06

drives a lot of the GPU gains over the CPU. What is the purpose of the branch predictor?

70:10

To execute both branches at once, or what does it do?

70:13

The issue is that when I've got a series of instructions, if I have a branch,

70:27

the actual step of processing an instruction takes a really long time.

70:32

It takes maybe five nanoseconds. The time to notice that I've got a branch,

70:43

evaluate whether the Boolean is true, update the program counter to the new target,

70:49

and then read from the instruction memory could take five nanoseconds to finish.

70:56

So in reality, this may finish somewhere down here.

71:03

I want to run a clock speed that is much faster than what five nanoseconds allows.

71:07

Five nanoseconds is a 200 MHz clock speed. I would like to run at one or two gigahertz.

71:13

So I need to run other instructions while the branch is being evaluated.

71:21

I just want to keep running the instructions that happen after me.

71:27

But that might have been wrong. If the branch ended up being taken, then

71:30

I need to know that instead of evaluating these instructions, I actually need to jump to wherever

71:34

the target is and run those instructions instead. The purpose of the branch predictor is to predict,

71:42

five cycles earlier, that a branch is going to happen, before you even get to that instruction.

71:49

If I think about how the brain works versus what you're describing here,

71:52

at a high level the differences might be that while you can do structured sparsity in these

71:59

accelerators and save yourself some area that you would have otherwise had to dedicate to gates,

72:07

in the brain there's unstructured sparsity. Any neuron can connect to any other neuron, and

72:12

not in ways where they have the column aligned. Then there's the fact that memory

72:18

and compute are co-located. Although I guess you could say in a way the memory

72:22

and compute are co-located on these dies too. This is exactly the co-location,

72:25

in some sense, of the memory and compute. So maybe that isn't a big difference.

72:30

Another big difference is that the clock cycle on the brain is much slower than on computers.

72:40

Partly that's to preserve energy, because the faster the clock cycle, the bigger the voltage

72:47

needs to be in order for the signal to settle and to identify what state a transistor is in.

72:54

That’s right. I don't know if

72:56

you have any commentary on what the brain might be doing versus how these chips work.

73:06

Let's take the clock speed one first. The clock speed is quite high on a

73:16

chip because that drives higher throughput. When we compare a GPU running some workload,

73:26

it's running batch size 1,000. Whereas the brain is not running

73:29

batch size 1,000, there's only one of me. You could imagine saying, "Take a GPU and

73:35

instead of running at a gigahertz, run it at a megahertz," and that would start

73:39

to look a little more like the equivalent things you're talking about in the brain.

73:45

But in the way silicon works, that does not give you a 1,000x advantage in energy efficiency.

73:58

What it ends up looking like is you just run this circuit once to stabilization,

74:07

and then it will sit idle for a long period of time.

74:10

It doesn't consume a lot of energy while it's sitting

74:12

idle because most of the energy is consumed in toggling bits from zero to one and back.

74:22

Let's talk about the energy consumption of a circuit like this.

74:26

The way to think of a bit being stored is that you've deposited some charge in a capacitor

74:32

sitting somewhere in the chip implicitly. It becomes charged when the bit becomes a one,

74:38

and then it becomes discharged when it next goes to a zero.

74:41

That cycle of charging the capacitor and then dumping that charge out

74:45

to ground is where the energy is consumed. This is called the dynamic or switching power,

74:49

and it's most of the energy consumption of a chip. There is some other energy consumption just

74:54

coming from the fact that insulators aren't perfect, but we'll discard that.

75:00

Most of the energy consumption comes from toggling from zero to one and back to zero.

75:08

If you run a chip much slower and you only clock it once every thousand clock cycles,

75:12

you will have 1,000 times fewer transitions. It will be about 1,000 times

75:16

less energy consumption. But it's not a substantial

75:20

advantage in energy efficiency. Okay, so you described how a TPU

75:25

works at a high level. What is the difference

75:29

at a high level between how a GPU and a TPU work? There is a high-level organization principle that

75:37

is different, and then inside the cores things are different.

75:41

Looking at the high level, we'll take a GPU and a TPU and see what

75:50

the top-level block structure looks like. If you think of this as the whole chip

75:55

in each case, the organization of the GPU is mostly a bunch of

76:03

almost-identical units, which are the SMs. They've got an L2 memory in the middle,

76:12

and then a bunch more of these SMs on the bottom. So there is a fairly regular grid of cores.

76:24

If we look at a TPU in comparison, you end up with much coarser-grained units of logic.

76:34

You end up with just a few matrix units, which are the big systolic arrays.

76:48

In the middle you've got some vector unit, and then you've got your matrix units at the bottom.

77:00

These matrix units with a vector unit in the middle make up the whole TPU chip.

77:06

You can think of scaling this thing down into a really tiny unit with a

77:10

smaller matrix unit and a smaller vector unit, and that is sort of what an SM is.

77:16

From a very high-level point of view, the GPU has a lot of tiny TPUs tiled across the whole chip.

77:24

Oh, interesting. You're suggesting the tensor core within a streaming SM is analogous to an MXU?

77:31

Yeah, it's all very similar. I see. If you had more lack of structure,

77:38

having a bunch of tiny TPUs makes a lot of sense. Whereas if you just have huge matrix

77:44

multiplications, you might want to avoid the cost of having individual SMs with their own

77:54

registers and warp schedulers. Why not just make a huge thing

77:58

and amortize those costs across the whole thing? This shows up in how large you can grow things.

78:04

We've seen this theme, especially with the systolic array, where a larger systolic

78:08

array amortizes the register file costs better. This design allows you to have larger systolic

78:14

arrays, whereas the GPU design constrains you to having small units of everything.

78:21

There is a trade-off, however. Because of this coarse-grained

78:29

separation of things, you need to move a lot of data from the vector unit to the matrix units,

78:37

through just two lines of perimeter here. If you look at the equivalent thing

78:46

in a GPU, you've got vector units everywhere, and you can move data through many different lines.

78:55

The amount of data you can move between a vector unit and a matrix unit is actually

79:00

much higher in a GPU than in a TPU. Instead of having to move all the data

79:08

through just two lines, you're moving it through 16 lines of wiring in a GPU.

79:16

Right. But also you might have to move across less area.

79:21

Which is an energy saving as well. So if you can operate entirely within

79:27

an SM, the data movement is much smaller. But the moment you want to operate across SMs,

79:33

it becomes more complicated and expensive. So you don’t have to comment, but one might

79:37

expect that a thing MatX might try to do is get the GPU-like smaller structure of

79:45

systolic arrays surrounded by SRAM, but at the same time make it so that the things you need in

79:55

an SM to support the CUDA architecture—which take a bunch of space—you might discard.

80:05

We've talked publicly about something we call a splittable systolic array, which

80:09

in some sense you can think of as big systolic arrays that can be small systolic arrays too.

80:13

Cool. Okay, I think that's a good note to close on.

80:16

Reiner, thank you so much. Thanks, Dwarkesh.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

In this technical discussion, Reiner Pope, CEO of MatX, breaks down the architecture of AI chips, explaining the fundamental role of multiply-accumulate operations in matrix multiplication and how systolic arrays improve computational efficiency. The conversation covers the trade-offs between compute and communication in hardware design, the function of clock cycles and registers, and the differences between CPU, GPU, TPU, and FPGA architectures. Additionally, the dialogue touches on the importance of optimizing memory bandwidth and how modern AI accelerators aim to maximize compute performance.