Dwarkesh Goes Inside Jane Street's Latest AI Data Center
497 segments
Today I'm getting a tour of one one of
Jane Street's training data centers here
in Texas from [music] Ron Minsky who
co-heads a technology group and Daniel
Pontecorvo who has the physical
engineering team. Thanks for coming out
here to do this. Thank you. Well, let's
get started.
So here's one of where we have our our
training cluster of GP300s and VL72.
>> Are you allowed to say what is currently
happening on this cluster?
>> What actual training jobs are running
right now?
>> High-level. We do a bunch of different
kinds of models. Some of the models are
for training LLMs. Some of the models
are for training all sorts of custom
architectures that are more adapted to
the trading problems in the trading data
sets. Earlier you were explaining to me
that this is originally not a facility
that was built to handle 200 kilowatt
racks. And so you had to retrofit it to
be liquid cooled and everything. You
know, these cabinets, these GP300
cabinets, consume at peak about 140 kW
each. Compare that to traditional
air-cooled you're talking about 10 to 40
kW. It's a lot more. So on the perimeter
here you can see some of this air
cooling equipment here that would have
have fed a traditional air cooling. Some
of it remains, we use some of it for the
air air-cooled load that we have. About
15% of these cabinets are air-cooled. In
here inside the GPU you can see how the
how the cooling kind of flows from these
quick disconnects on the back
taking a fluid in, routing it to these
cold plates that are sitting on top of
the GPU, and then coming back out at a
warmer temperature.
So when you slide this sled in, it
automatically connects to liquid supply
and return and 54-volt power. Right? So
these just slide in,
powered and cooling all within the sled.
There are still some components in here
that are air-cooled, but about 85 to 90%
is cooled via those cold plates of the
heat load. To what degree have we had
concerns about leaks, right? I feel like
you spent, you know, decades worrying
about not having water in our data
centers and now we're like putting it on
on purpose. Like
How big of a deal is that? So, inside
they have something there's underneath
there's there's
these ropes set to detect leaks. So,
there's a management side of the switch
of this server that will send out an
alert if they they sense a leak in
there. Furthermore, there's leak
detection underneath the floor. If it
drips and falls under the floor, we'll
be able to sense it there and isolate
via valve. But, it is true that if
something here fails, you are at risk of
destroying the server. How often is
there a leak? Not often, but this stuff
is new. You know, it's yet to be seen
over time how this how this works out.
So, I guess it's a surprising to me that
like it was not a problem to get this
get a liquid cooling going in a facility
that was originally built for air
cooling and like lower power densities.
I don't know how Why did it work?
>> How do you define problem? I mean, it
was an engineering challenge.
>> I feel like you're hearing the version
of the story after all the problems have
been worked out. These guys have spent a
huge amount of effort figuring out like
this place would feel in a kind of
intermediate point where we knew we had
to scale up a lot, but we didn't know
what the shape of the coming compute
was. And so, I think one of the things
that the guys here did really well was
like take to heart the importance of
optionality of like, "Oh, yeah, there's
a lot of different futures and we need
to build some stuff that will work for
multiple different ones." And I think
that works really well, but required a
lot of hard thinking and planning to
make it land well.
You know, one thing I'll say is there's
a couple of ways to do liquid cooling.
So, we have fluid coming from the roof
from the chillers on the roof down here,
maybe about 18° C.
>> [music]
>> We use the same fluid for the air
cooling, so it's fungible within the
data center. We can move the fluid
around. Oh, that's nice.
So, what we do is we send this in. This
device here makes sure that every single
cabinet has the right amount of flow.
You don't want the ones at the beginning
of the row to receive too much flow and
the ones at the end to be starved. So,
what these devices, these valves, they
control how much fluid goes to each
cabinet so that they're balanced.
How How do they measure whether they
have the right amount of fluid?
>> Ultrasonic flow meter here. So,
ultrasonically it's measuring the fluid
and measuring how much flow in liters
per minute and capping that at some rate
that we predetermine based on the heat
load that it's rejecting.
Um that liquid comes in, it goes to this
heat exchanger here. So, there's inside
that heat ex- inside that CDU is a heat
exchanger that transfers heat between
this building loop, building cooling
loop, and a what we call a technical
water loop inside there which is needs
to be very, very clean and filtered down
to 25 microns so you don't plug the cold
plates on the GPUs. You want very good
efficient heat transfer there.
Um it's filled with a liquid uh a mix of
distilled or deionized water and uh
propylene glycol, 25% of propylene
glycol. Um that's to inhibit any
bacteria or algae growth. Um and that
bacteria could grow in there if you
don't have the right ratios and plug the
cold plates and plug with the
uh the heat exchange between the GPU and
the cold plate. I don't love the world
where we have to worry about bacteria
growing in our servers.
>> So, you have you have leaks, you have
bacteria, all these different new things
to worry about. Um making sure you have
proper flow between all the devices. Uh
air-cooled data centers, you you know,
you put the cabinet in and just flood
the room with air and and and heat and
and transfer that heat. So. Was there
just an area underneath that you could
have used? Raised floors traditionally
were used to to supply air. Uh so,
there's ways you could supply air. That
air would come out here. A lot of the
the new solutions are going to overhead
piping cuz it takes time to build these
raised floors and it's it slows down
projects. So, a lot of the piping
system's going above overhead now for
for speed of deployment. But, we like
this here because you see that blue uh
that blue wire there, that'll sense a
leak. If one of these connections is
dripping, it'll touch that that rope
there and send a signal to say that
there's a leak. So, we're able to
contain a leak under the floor and and
and measure it where overhead it's kind
of right into your data center.
So, we have
4,032 GPUs here in 56 racks. What we try
to do on the on the power side is make
sure we balance our power. You can't
overload certain areas, so you kind of
you can see how we have this bus way
here distributing power and we're very
conscious about how many racks are on
each bus.
Make sure you don't go over amperage and
trip a breakers and you could be in the
middle of a training run and overload a
breaker and you'd have to go back to
some book worth. What is the hourly
price
on a black wall rack? There's two ways
of pricing it, right? There's how much
does the hardware itself cost and the
power and all of that and then there's
the opportunity cost.
>> That's right. That's right. And we
actually think about opportunity costs
when we think about all of this compute
stuff very intensively and like
because we're in a world where compute
is relatively inelastic, you end up in
places where there's a real crunch even
internally where it's like, "Oh, it's
you know, it takes time to get new
compute online and available." And you
can get to a case where it's like people
are all kind of bidding for the same
compute and you're like, "Wow, it's
become incredibly expensive." Because
the stuff that we get out of this is
super valuable to the business. And so
the opportunity cost tends to dominate
the hardware cost.
Even though the hardware costs are not
small.
>> [laughter]
>> Yeah, so that's the other question. If
this facility is connected to the grid
and you presumably had asked the grid
beforehand for a certain amount of power
and now you move to much denser compute,
how are you and this is still like using
the grid it's not behind the meter. How
are you able to get the power in here?
So, because as the compute is moving
denser, what is what you know, if we
have some power allocated from the
utility,
we end up just using
less space within the data hall, right?
So, you know, like everything just gets
smaller and you can see in this data
hall it's got a lot of space that we
don't need.
You know, so you're trying to respect
that that power capacity you have from
the utility whether it's utility or
behind the meter, you still have to
respect that overall value, but you want
to ride as close to it as possible.
That's why you can afford to put a
podcast studio in this place.
Although there are reasons to want
higher density like the networking
setups themselves are incredibly
complicated and like I don't know one
thing I always feel when I go into one
of our data centers is like what a
beautiful job people have done getting
the wiring right. That's actually quite
hard and the more you have stuff splayed
out the harder it is to get all the
wiring done and it's also worth noting
like most of the wires you see here out
of the cages are fiber, but the stuff
inside of the the fastest stuff is all
is all copper. Light moves more slowly
in fiber than electrons move in copper.
So you really are at many different
levels of optimizing for the latency of
all of these networking rigs. There's
about 8,000 km of fiber in this
deployment.
>> [music]
>> And there you go. This is what happens
when you go really dense.
But you you you do have enough power to
fill everything there out?
>> So it depends how we move power around,
you know, one of one of the ideas of of
being flexible and fungible with our
power and cooling is that we overbuild
our distribution. So while we're limited
on the upper end,
you're able to move power around by
loading up different rows, right? So we
have these UPS's that supply power to
our site, but when we distribute that
power out if we can move it around Yeah.
it allows us to say hey we're going to
you know, we're going to grow some CPU
here or we're going to grow some GPU
here. So you know, we have some headroom
in there to do other things and this is
just opportunity areas for us to do
that. We want to be ready to go if the
business needs some more computer where
we have a place to put it. So you you do
some amount of pre-building for future
you know, future opportunities that come
up.
So what are these things? So these are
breaker panels. These these distribute
out to those bus bars that you've seen,
so
this is where, you know, power comes in
and you're going to have to break that
power out to to go in different paths.
So, there's a lot of distribution, you
can see all this overhead conduit that
we had to put in. This is all carrying
power cables out to the data hall. So,
you're you're very conscious when you're
distributing power. It's it's less
fungible than than than cooling. With
the cooling, you can kind of oversize
the pipes and move it around. With with
power, you have breakers and and current
limits. So, you're very you have to be
very careful with with what you load up
and where and making sure that you don't
end up in a situation where you're
tripping a breaker. We have protections
in place, so if we we trip one of those
four buses, we're still good, so there's
some redundancy,
but still it's still an interruption,
something we want to avoid.
>> What would cause you to trip a breaker?
Just high current. So, adding too much
load on a single busway or single
connection, or pushing too far in the
over subscription and saying, "Oh, well,
you know, we think we're going to be
over subscribed by 10% and actually
things kind of shoot up to 50% or 20%."
There There are times when you do have
time to respond,
but if you're you're too far over the
limit, the breaker's going to trip and
you're going to And is the
is the current pattern determined by
software or by
by hardware alone? It's It's by
hardware. It's controlled somewhat by by
software. I think
So, what Nvidia is doing, they have this
kind of LPS system, it's a a load
management system that they're rolling
out in some of their new cabinets.
Really, what they want to do is keep
that load profile flat. They're building
in more bulk capacitance in those power
cells, those capacitors in there, and
they're also trying to get the software
to allow it the
the peak load and the average load to be
much tighter, so you have this flat
profile. A place where software comes in
is monitoring. We actually put a huge
amount of effort into building our own
monitoring tools so that we can in one
pane of glass, as they like to say, like
we can monitor every aspect of the
system and look for problems and
sometimes even drive reactions to the
system where like sometimes we might
want to shut off a workload if it is
drawing too much power. And having like
a unified system that can see all the
things has been a huge kind of step up
in being able to run these things in a
highly reliable way.
>> So, that that software system Ron
mentioned is actually pulling
information from these breakers and
performing some logic behind the scenes.
It's It's topology aware. And what it
will do, like Ron said, it can shut down
nodes, so we don't trip that. Because
you want to run as close to the edge as
you can, cuz like the hardware is
incredibly valuable. So, you do want to
oversubscribe. You do want to kind of
run near that edge, but you also need to
be safe, and so you build in these
safeguards to let you pull back in a
controlled way. If somebody
switch one of these switches, would
there some training workload stop right
now? Yeah, I mean, this site's been been
>> [laughter]
>> The site is currently live, so yes.
>> [music]
>> So, here you can see a little bit of the
scale of of of our liquid cooling. So,
you know, these are we we call it buffer
tanks. So, they're here to to help with
a situation where maybe there's a
interruption in power and the chillers
on the roof free start and lose cooling.
This is almost like a thermal battery
storing some energy for us to keep those
GPUs cool while the chillers come back
online. Also, as these workloads come up
and down, you're going to have a kind of
a movement of temperature. So, this
helps dampen that that effect out. These
are the traditional air cooling units
that you'll see in almost every air
cooling data center. Really just pulling
that hot air back and supplying it back
into the data hall. Those wheels up
there are are valves. So, you mentioned
leaks, right? What happens if there's
leak? Well, you have places where you
can isolate the system to kind of work
on it, to fix a leak or something. These
orange things just have more chain in
them. So, they're not hitting you in the
head. Now, you can reach them with a
ladder, take more chain out, and turn
the valve closed or open. That's great.
Such a simple solution. You know, it's
it's interesting as these data centers
and and the compute gets so dense, where
the place where you have the compute is
getting smaller and smaller, and the
places where you're supporting that
computer is getting larger and larger.
The infrastructure, the transformers,
the the chillers are getting bigger and
bigger. So, the sites are are very much
this much compute and this much
infrastructure to support that compute.
What did the analogous system to this 20
years back look like? So, yeah, it was
like dramatically more primitive. Like
early on we literally just had like
computers like sitting out with us in
the room with all the people. Like the
one of our computer clusters we called
the hive and I remember the first
version of the hive was literally like
six Dell boxes stacked on top of each
other like at the end of, you know, the
row. That's why they're called hive
bucks. Yes, that's right. Yeah, like
hive was the name like a very like
actually the first the first work that I
did at Jane Street was doing like
quantitative research on trading
strategy strategies and I was like, oh,
yeah, I guess we need like a cluster and
like that pile of six Dell boxes was our
first cluster. And the trading systems
themselves we also had there and that
actually was like
more important to have out with us and
it took us time to convince ourselves
okay to like actually go and put it in
another room in a rack because we
actually wanted the ability to make sure
we could turn the damn thing off. Like
if something went wrong just the comfort
of like I could unplug it was there.
And, you know, it took time to convince
people that we had enough control over
the systems and we understood enough
about where things were and just would
be able to find them if we had to go in
there that things were cleanly labeled
enough
>> Yeah. that we were comfortable taking
these things and moving them in the
back. Yeah. I mean, there were ups and
downs. Like literally at some point, you
know, one of the people who was like
cleaning the office like unplugged one
of the trading systems in the middle of
the day like as they were vacuuming. So,
you know, in the end it is in fact
better to have it all in a data center,
but I don't know, early on it was more
of a shoestring operation and we were
just kind of figuring things out.
>> did did not need to be even back then
co-located with the exchanges?
So, an important thing to understand
about our early uh trading is like we
were super not fast, right? Like there
there
trading like latencies matter at lots of
different orders of magnitude and like
sometimes it matters whether you're
responding in you know, seconds or
milliseconds, sometimes microseconds.
Like these days the very fastest systems
we care about you you're talking about
whether you can turn around a packet in
under 100 nanoseconds.
>> Yeah. Okay, definitely want to ask you
more about that when we get to the
podcast studio. Yeah, interesting.
Ask follow-up questions or revisit key timestamps.
This video features a tour of a Jane Street training data center in Texas, where experts discuss the transition to high-density, liquid-cooled computing infrastructure. The team explains the technical challenges of retrofitting an air-cooled facility for liquid-cooled GPU racks, the importance of power management and monitoring, and the evolution of their hardware operations from early "shoestring" setups to sophisticated, high-performance data centers.
Videos recently processed by our community