HomeVideos

Dwarkesh Goes Inside Jane Street's Latest AI Data Center

Now Playing

Dwarkesh Goes Inside Jane Street's Latest AI Data Center

Transcript

497 segments

0:00

Today I'm getting a tour of one one of

0:02

Jane Street's training data centers here

0:04

in Texas from [music] Ron Minsky who

0:06

co-heads a technology group and Daniel

0:09

Pontecorvo who has the physical

0:10

engineering team. Thanks for coming out

0:12

here to do this. Thank you. Well, let's

0:14

get started.

0:17

So here's one of where we have our our

0:19

training cluster of GP300s and VL72.

0:22

>> Are you allowed to say what is currently

0:24

happening on this cluster?

0:26

>> What actual training jobs are running

0:27

right now?

0:28

>> High-level. We do a bunch of different

0:30

kinds of models. Some of the models are

0:33

for training LLMs. Some of the models

0:35

are for training all sorts of custom

0:37

architectures that are more adapted to

0:39

the trading problems in the trading data

0:41

sets. Earlier you were explaining to me

0:44

that this is originally not a facility

0:46

that was built to handle 200 kilowatt

0:49

racks. And so you had to retrofit it to

0:51

be liquid cooled and everything. You

0:53

know, these cabinets, these GP300

0:54

cabinets, consume at peak about 140 kW

0:57

each. Compare that to traditional

0:59

air-cooled you're talking about 10 to 40

1:01

kW. It's a lot more. So on the perimeter

1:03

here you can see some of this air

1:04

cooling equipment here that would have

1:06

have fed a traditional air cooling. Some

1:08

of it remains, we use some of it for the

1:11

air air-cooled load that we have. About

1:12

15% of these cabinets are air-cooled. In

1:15

here inside the GPU you can see how the

1:18

how the cooling kind of flows from these

1:20

quick disconnects on the back

1:22

taking a fluid in, routing it to these

1:24

cold plates that are sitting on top of

1:25

the GPU, and then coming back out at a

1:27

warmer temperature.

1:29

So when you slide this sled in, it

1:30

automatically connects to liquid supply

1:33

and return and 54-volt power. Right? So

1:37

these just slide in,

1:38

powered and cooling all within the sled.

1:41

There are still some components in here

1:42

that are air-cooled, but about 85 to 90%

1:45

is cooled via those cold plates of the

1:47

heat load. To what degree have we had

1:50

concerns about leaks, right? I feel like

1:52

you spent, you know, decades worrying

1:54

about not having water in our data

1:56

centers and now we're like putting it on

1:58

on purpose. Like

1:59

How big of a deal is that? So, inside

2:02

they have something there's underneath

2:04

there's there's

2:05

these ropes set to detect leaks. So,

2:07

there's a management side of the switch

2:09

of this server that will send out an

2:11

alert if they they sense a leak in

2:12

there. Furthermore, there's leak

2:14

detection underneath the floor. If it

2:15

drips and falls under the floor, we'll

2:17

be able to sense it there and isolate

2:19

via valve. But, it is true that if

2:21

something here fails, you are at risk of

2:24

destroying the server. How often is

2:26

there a leak? Not often, but this stuff

2:28

is new. You know, it's yet to be seen

2:30

over time how this how this works out.

2:32

So, I guess it's a surprising to me that

2:33

like it was not a problem to get this

2:37

get a liquid cooling going in a facility

2:39

that was originally built for air

2:40

cooling and like lower power densities.

2:42

I don't know how Why did it work?

2:43

>> How do you define problem? I mean, it

2:44

was an engineering challenge.

2:46

>> I feel like you're hearing the version

2:47

of the story after all the problems have

2:49

been worked out. These guys have spent a

2:51

huge amount of effort figuring out like

2:54

this place would feel in a kind of

2:55

intermediate point where we knew we had

2:58

to scale up a lot, but we didn't know

3:00

what the shape of the coming compute

3:02

was. And so, I think one of the things

3:03

that the guys here did really well was

3:05

like take to heart the importance of

3:07

optionality of like, "Oh, yeah, there's

3:09

a lot of different futures and we need

3:10

to build some stuff that will work for

3:12

multiple different ones." And I think

3:14

that works really well, but required a

3:15

lot of hard thinking and planning to

3:17

make it land well.

3:22

You know, one thing I'll say is there's

3:24

a couple of ways to do liquid cooling.

3:25

So, we have fluid coming from the roof

3:28

from the chillers on the roof down here,

3:30

maybe about 18° C.

3:32

>> [music]

3:32

>> We use the same fluid for the air

3:34

cooling, so it's fungible within the

3:35

data center. We can move the fluid

3:36

around. Oh, that's nice.

3:38

So, what we do is we send this in. This

3:40

device here makes sure that every single

3:43

cabinet has the right amount of flow.

3:45

You don't want the ones at the beginning

3:46

of the row to receive too much flow and

3:48

the ones at the end to be starved. So,

3:50

what these devices, these valves, they

3:52

control how much fluid goes to each

3:53

cabinet so that they're balanced.

3:56

How How do they measure whether they

3:57

have the right amount of fluid?

3:58

>> Ultrasonic flow meter here. So,

4:00

ultrasonically it's measuring the fluid

4:02

and measuring how much flow in liters

4:04

per minute and capping that at some rate

4:07

that we predetermine based on the heat

4:09

load that it's rejecting.

4:10

Um that liquid comes in, it goes to this

4:13

heat exchanger here. So, there's inside

4:14

that heat ex- inside that CDU is a heat

4:17

exchanger that transfers heat between

4:19

this building loop, building cooling

4:21

loop, and a what we call a technical

4:23

water loop inside there which is needs

4:25

to be very, very clean and filtered down

4:27

to 25 microns so you don't plug the cold

4:29

plates on the GPUs. You want very good

4:32

efficient heat transfer there.

4:34

Um it's filled with a liquid uh a mix of

4:36

distilled or deionized water and uh

4:39

propylene glycol, 25% of propylene

4:41

glycol. Um that's to inhibit any

4:43

bacteria or algae growth. Um and that

4:46

bacteria could grow in there if you

4:47

don't have the right ratios and plug the

4:50

cold plates and plug with the

4:52

uh the heat exchange between the GPU and

4:54

the cold plate. I don't love the world

4:56

where we have to worry about bacteria

4:58

growing in our servers.

4:59

>> So, you have you have leaks, you have

5:00

bacteria, all these different new things

5:02

to worry about. Um making sure you have

5:04

proper flow between all the devices. Uh

5:07

air-cooled data centers, you you know,

5:08

you put the cabinet in and just flood

5:10

the room with air and and and heat and

5:12

and transfer that heat. So. Was there

5:14

just an area underneath that you could

5:16

have used? Raised floors traditionally

5:18

were used to to supply air. Uh so,

5:20

there's ways you could supply air. That

5:21

air would come out here. A lot of the

5:23

the new solutions are going to overhead

5:25

piping cuz it takes time to build these

5:27

raised floors and it's it slows down

5:29

projects. So, a lot of the piping

5:31

system's going above overhead now for

5:33

for speed of deployment. But, we like

5:35

this here because you see that blue uh

5:37

that blue wire there, that'll sense a

5:39

leak. If one of these connections is

5:41

dripping, it'll touch that that rope

5:43

there and send a signal to say that

5:46

there's a leak. So, we're able to

5:47

contain a leak under the floor and and

5:49

and measure it where overhead it's kind

5:51

of right into your data center.

5:53

So, we have

5:55

4,032 GPUs here in 56 racks. What we try

5:59

to do on the on the power side is make

6:02

sure we balance our power. You can't

6:04

overload certain areas, so you kind of

6:06

you can see how we have this bus way

6:08

here distributing power and we're very

6:10

conscious about how many racks are on

6:12

each bus.

6:13

Make sure you don't go over amperage and

6:14

trip a breakers and you could be in the

6:16

middle of a training run and overload a

6:18

breaker and you'd have to go back to

6:20

some book worth. What is the hourly

6:22

price

6:24

on a black wall rack? There's two ways

6:25

of pricing it, right? There's how much

6:27

does the hardware itself cost and the

6:29

power and all of that and then there's

6:31

the opportunity cost.

6:32

>> That's right. That's right. And we

6:33

actually think about opportunity costs

6:34

when we think about all of this compute

6:36

stuff very intensively and like

6:39

because we're in a world where compute

6:41

is relatively inelastic, you end up in

6:43

places where there's a real crunch even

6:45

internally where it's like, "Oh, it's

6:46

you know, it takes time to get new

6:49

compute online and available." And you

6:51

can get to a case where it's like people

6:53

are all kind of bidding for the same

6:55

compute and you're like, "Wow, it's

6:56

become incredibly expensive." Because

6:59

the stuff that we get out of this is

7:00

super valuable to the business. And so

7:02

the opportunity cost tends to dominate

7:04

the hardware cost.

7:05

Even though the hardware costs are not

7:06

small.

7:07

>> [laughter]

7:08

>> Yeah, so that's the other question. If

7:09

this facility is connected to the grid

7:13

and you presumably had asked the grid

7:14

beforehand for a certain amount of power

7:18

and now you move to much denser compute,

7:21

how are you and this is still like using

7:24

the grid it's not behind the meter. How

7:25

are you able to get the power in here?

7:26

So, because as the compute is moving

7:28

denser, what is what you know, if we

7:29

have some power allocated from the

7:31

utility,

7:32

we end up just using

7:34

less space within the data hall, right?

7:36

So, you know, like everything just gets

7:37

smaller and you can see in this data

7:39

hall it's got a lot of space that we

7:41

don't need.

7:42

You know, so you're trying to respect

7:43

that that power capacity you have from

7:45

the utility whether it's utility or

7:47

behind the meter, you still have to

7:48

respect that overall value, but you want

7:50

to ride as close to it as possible.

7:52

That's why you can afford to put a

7:53

podcast studio in this place.

7:56

Although there are reasons to want

7:58

higher density like the networking

8:00

setups themselves are incredibly

8:02

complicated and like I don't know one

8:04

thing I always feel when I go into one

8:05

of our data centers is like what a

8:07

beautiful job people have done getting

8:08

the wiring right. That's actually quite

8:10

hard and the more you have stuff splayed

8:13

out the harder it is to get all the

8:14

wiring done and it's also worth noting

8:16

like most of the wires you see here out

8:18

of the cages are fiber, but the stuff

8:21

inside of the the fastest stuff is all

8:25

is all copper. Light moves more slowly

8:27

in fiber than electrons move in copper.

8:30

So you really are at many different

8:31

levels of optimizing for the latency of

8:33

all of these networking rigs. There's

8:35

about 8,000 km of fiber in this

8:37

deployment.

8:39

>> [music]

8:41

>> And there you go. This is what happens

8:43

when you go really dense.

8:45

But you you you do have enough power to

8:47

fill everything there out?

8:48

>> So it depends how we move power around,

8:50

you know, one of one of the ideas of of

8:52

being flexible and fungible with our

8:53

power and cooling is that we overbuild

8:55

our distribution. So while we're limited

8:57

on the upper end,

8:59

you're able to move power around by

9:01

loading up different rows, right? So we

9:03

have these UPS's that supply power to

9:05

our site, but when we distribute that

9:07

power out if we can move it around Yeah.

9:10

it allows us to say hey we're going to

9:12

you know, we're going to grow some CPU

9:13

here or we're going to grow some GPU

9:15

here. So you know, we have some headroom

9:17

in there to do other things and this is

9:19

just opportunity areas for us to do

9:20

that. We want to be ready to go if the

9:22

business needs some more computer where

9:24

we have a place to put it. So you you do

9:26

some amount of pre-building for future

9:29

you know, future opportunities that come

9:31

up.

9:32

So what are these things? So these are

9:34

breaker panels. These these distribute

9:36

out to those bus bars that you've seen,

9:37

so

9:38

this is where, you know, power comes in

9:40

and you're going to have to break that

9:41

power out to to go in different paths.

9:43

So, there's a lot of distribution, you

9:44

can see all this overhead conduit that

9:46

we had to put in. This is all carrying

9:48

power cables out to the data hall. So,

9:50

you're you're very conscious when you're

9:51

distributing power. It's it's less

9:53

fungible than than than cooling. With

9:55

the cooling, you can kind of oversize

9:56

the pipes and move it around. With with

9:58

power, you have breakers and and current

10:00

limits. So, you're very you have to be

10:02

very careful with with what you load up

10:04

and where and making sure that you don't

10:06

end up in a situation where you're

10:07

tripping a breaker. We have protections

10:10

in place, so if we we trip one of those

10:12

four buses, we're still good, so there's

10:14

some redundancy,

10:15

but still it's still an interruption,

10:17

something we want to avoid.

10:18

>> What would cause you to trip a breaker?

10:20

Just high current. So, adding too much

10:22

load on a single busway or single

10:25

connection, or pushing too far in the

10:28

over subscription and saying, "Oh, well,

10:30

you know, we think we're going to be

10:31

over subscribed by 10% and actually

10:32

things kind of shoot up to 50% or 20%."

10:36

There There are times when you do have

10:37

time to respond,

10:39

but if you're you're too far over the

10:41

limit, the breaker's going to trip and

10:42

you're going to And is the

10:44

is the current pattern determined by

10:46

software or by

10:47

by hardware alone? It's It's by

10:49

hardware. It's controlled somewhat by by

10:51

software. I think

10:52

So, what Nvidia is doing, they have this

10:54

kind of LPS system, it's a a load

10:57

management system that they're rolling

10:58

out in some of their new cabinets.

10:59

Really, what they want to do is keep

11:01

that load profile flat. They're building

11:02

in more bulk capacitance in those power

11:05

cells, those capacitors in there, and

11:06

they're also trying to get the software

11:08

to allow it the

11:09

the peak load and the average load to be

11:11

much tighter, so you have this flat

11:13

profile. A place where software comes in

11:14

is monitoring. We actually put a huge

11:16

amount of effort into building our own

11:17

monitoring tools so that we can in one

11:20

pane of glass, as they like to say, like

11:21

we can monitor every aspect of the

11:23

system and look for problems and

11:25

sometimes even drive reactions to the

11:27

system where like sometimes we might

11:28

want to shut off a workload if it is

11:31

drawing too much power. And having like

11:33

a unified system that can see all the

11:35

things has been a huge kind of step up

11:38

in being able to run these things in a

11:39

highly reliable way.

11:40

>> So, that that software system Ron

11:41

mentioned is actually pulling

11:43

information from these breakers and

11:45

performing some logic behind the scenes.

11:47

It's It's topology aware. And what it

11:49

will do, like Ron said, it can shut down

11:51

nodes, so we don't trip that. Because

11:52

you want to run as close to the edge as

11:53

you can, cuz like the hardware is

11:55

incredibly valuable. So, you do want to

11:56

oversubscribe. You do want to kind of

11:58

run near that edge, but you also need to

12:00

be safe, and so you build in these

12:01

safeguards to let you pull back in a

12:03

controlled way. If somebody

12:05

switch one of these switches, would

12:06

there some training workload stop right

12:08

now? Yeah, I mean, this site's been been

12:11

>> [laughter]

12:12

>> The site is currently live, so yes.

12:16

>> [music]

12:19

>> So, here you can see a little bit of the

12:20

scale of of of our liquid cooling. So,

12:22

you know, these are we we call it buffer

12:23

tanks. So, they're here to to help with

12:25

a situation where maybe there's a

12:27

interruption in power and the chillers

12:29

on the roof free start and lose cooling.

12:31

This is almost like a thermal battery

12:34

storing some energy for us to keep those

12:36

GPUs cool while the chillers come back

12:38

online. Also, as these workloads come up

12:41

and down, you're going to have a kind of

12:43

a movement of temperature. So, this

12:44

helps dampen that that effect out. These

12:46

are the traditional air cooling units

12:48

that you'll see in almost every air

12:49

cooling data center. Really just pulling

12:52

that hot air back and supplying it back

12:53

into the data hall. Those wheels up

12:55

there are are valves. So, you mentioned

12:57

leaks, right? What happens if there's

12:58

leak? Well, you have places where you

13:00

can isolate the system to kind of work

13:02

on it, to fix a leak or something. These

13:05

orange things just have more chain in

13:07

them. So, they're not hitting you in the

13:09

head. Now, you can reach them with a

13:10

ladder, take more chain out, and turn

13:13

the valve closed or open. That's great.

13:15

Such a simple solution. You know, it's

13:17

it's interesting as these data centers

13:19

and and the compute gets so dense, where

13:21

the place where you have the compute is

13:23

getting smaller and smaller, and the

13:25

places where you're supporting that

13:27

computer is getting larger and larger.

13:29

The infrastructure, the transformers,

13:31

the the chillers are getting bigger and

13:33

bigger. So, the sites are are very much

13:35

this much compute and this much

13:36

infrastructure to support that compute.

13:38

What did the analogous system to this 20

13:40

years back look like? So, yeah, it was

13:41

like dramatically more primitive. Like

13:44

early on we literally just had like

13:46

computers like sitting out with us in

13:48

the room with all the people. Like the

13:50

one of our computer clusters we called

13:52

the hive and I remember the first

13:53

version of the hive was literally like

13:55

six Dell boxes stacked on top of each

13:57

other like at the end of, you know, the

13:59

row. That's why they're called hive

14:00

bucks. Yes, that's right. Yeah, like

14:03

hive was the name like a very like

14:05

actually the first the first work that I

14:07

did at Jane Street was doing like

14:08

quantitative research on trading

14:09

strategy strategies and I was like, oh,

14:12

yeah, I guess we need like a cluster and

14:13

like that pile of six Dell boxes was our

14:15

first cluster. And the trading systems

14:18

themselves we also had there and that

14:21

actually was like

14:22

more important to have out with us and

14:24

it took us time to convince ourselves

14:26

okay to like actually go and put it in

14:28

another room in a rack because we

14:30

actually wanted the ability to make sure

14:32

we could turn the damn thing off. Like

14:34

if something went wrong just the comfort

14:35

of like I could unplug it was there.

14:38

And, you know, it took time to convince

14:40

people that we had enough control over

14:41

the systems and we understood enough

14:43

about where things were and just would

14:44

be able to find them if we had to go in

14:46

there that things were cleanly labeled

14:47

enough

14:48

>> Yeah. that we were comfortable taking

14:49

these things and moving them in the

14:50

back. Yeah. I mean, there were ups and

14:52

downs. Like literally at some point, you

14:54

know, one of the people who was like

14:55

cleaning the office like unplugged one

14:58

of the trading systems in the middle of

14:59

the day like as they were vacuuming. So,

15:02

you know, in the end it is in fact

15:03

better to have it all in a data center,

15:05

but I don't know, early on it was more

15:06

of a shoestring operation and we were

15:07

just kind of figuring things out.

15:09

>> did did not need to be even back then

15:10

co-located with the exchanges?

15:13

So, an important thing to understand

15:14

about our early uh trading is like we

15:17

were super not fast, right? Like there

15:20

there

15:21

trading like latencies matter at lots of

15:23

different orders of magnitude and like

15:25

sometimes it matters whether you're

15:26

responding in you know, seconds or

15:29

milliseconds, sometimes microseconds.

15:32

Like these days the very fastest systems

15:34

we care about you you're talking about

15:36

whether you can turn around a packet in

15:37

under 100 nanoseconds.

15:38

>> Yeah. Okay, definitely want to ask you

15:39

more about that when we get to the

15:40

podcast studio. Yeah, interesting.

Interactive Summary

This video features a tour of a Jane Street training data center in Texas, where experts discuss the transition to high-density, liquid-cooled computing infrastructure. The team explains the technical challenges of retrofitting an air-cooled facility for liquid-cooled GPU racks, the importance of power management and monitoring, and the evolution of their hardware operations from early "shoestring" setups to sophisticated, high-performance data centers.

Suggested questions

4 ready-made prompts