Energy Demand in AI
349 segments
Frontier models like GPT, Gro, Claude,
and Gemini that run in data centers all
over the world need something in common,
power. In order to understand the
magnitude of the demand in energy, we'll
need to understand what it takes to
train one large language model and then
expand our scope to see the rest of the
industry. So, let's start with this one
question. What does it cost to train one
large language model? Most private
companies don't release the exact specs
on energy consumption on their AI
models. So in order to understand what
goes on in training a large language
model, we need to start with a model
that we do know to some extent by
estimation and that's OpenAI's GPT4.
GPT4 is assumed to be 1.7 trillion
parameter model that was pre-trained on
13 trillion tokens of data which
required around 20 septillion
floatingpoint operations. Meaning in
order to train this giant model, OpenAI
likely used up to 25,000 A100 GPUs that
in total took them about 3 months to
train. And each of these A100 GPUs
consumes up to 400 watts of power. And
once you stack 25,000 of them, the
energy demand starts to add up really
fast. Typically, when we have that many
GPUs working at the same time, the
challenge is how we can actually
parallelize the training efficiently.
So, similar to how having 25,000 chefs
making one giant meal at once, you need
an efficient way to group them in a way
that maximizes the type of work ahead.
And in the case of large language
models, the type of work we need to do
is matrix operations. specifically
what's called mattal or matrix
multiplications. If you have to do a
matrix multiplication of a 100,000 by
100,000, the total number of operations
required for this is around 2.5
quadrillion floatingoint operations or 2
* 10 ^ 15 flops. To do that many
mathematical operations, you can use one
single GPU to solve this, but that would
probably take you a very long time. But
maybe you could group multiple GPUs at
once and make it go faster. But the
ultimate question is this. If I have
25,000 GPUs available, do I just group
all 25,000 GPUs in one and hit the go
button? Typically, in AI training,
Nvidia chips are grouped in eight. So,
for the A100 GPUs, you can group eight
of them into a single hardware topology
called Nvidia HDX server that looks
something like this. And one HGX server
that holds eight A100 GPUs consume up to
3 to 6 kW. And keep in mind that the HDX
A100 unit is considered to be an older
hardware as the industry now moved on to
hopper architecture like HDX H100 or
even Blackwell architecture like HDXB200
which draws up more than 10 kW. But
these are newer architecture. So even
though it draws up more power, you'll
need less of them since the chip itself
is faster and efficient in comparison to
the A100 GPUs. In any case, now that we
have one Nvidia HDX that has eight A100
GPUs, instead of running the 100,000 by
100,000 tensor operations with one
single GPU, you can now parallelize this
tensor operation by using one HX server
that has eight GPUs installed. And the
protocol that allows this communication
between eight Nvidia GPUs is called
NVLink. NVLink was introduced back in
2014 as a protocol that helps this exact
scenario, being able to parallelize
tensor operations. So that in this case,
the 100,000x 100,000 matrix operation
can be shared across eight GPUs to run
in parallel. And you might be asking, I
mean, why stop at one HDX server? We're
in America, so let's go big and stack up
a 100 of these and go get some bacon.
While grouping eight GPUs in one
topology of hardware is made possible by
using NVLink protocol. Stacking multiple
of these HGX servers starts to make less
sense because of what's called
interconnect. Meaning as you extend
beyond high-speed mesh within one
internal mesh of GPUs, the interconnect
bandwidth and latency between nodes can
often lead to diminishing returns.
However, one clever way to get around
this is by putting the focus on the
model architecture. Meaning instead of
trying to make the tensor operations
more parallel, we can also find
parallelism from the model architecture.
For example, GPD4 is thought to have 120
layers of neural network. And just like
earlier, we can parallelize not just on
the tensor level, but also on the
architectural level as well. We can
divide up the neural network into 15
pipelines and each pipeline dedicates
one HDX server. So that now eight GPUs
can essentially parallelize the tensor
operations accordingly. So a simple math
tells us that 15 pipelines with eight
GPUs each comes to 120 GPUs that can run
one instance of GPD4. One question you
might have here is that why don't we
just dedicate one HDX server for each
layer of the GPD4. And that's mostly
because every layer varies in sizes. So
we don't want to get to underutilizing a
dedicated HX server for a very small
layer. And now that we have tensor
parallelism and pipeline parallelism
done, the question is this. OpenAI had
25,000 GPUs, but we saw that one single
instance of GPT4 can practically run on
120 GPUs. What do we do with the
remaining 24,880
GPUs that's sitting idle? In order to
parallelize between each instances of
GPT4, we can add another parallelism
done, but now on the data. In other
words, data parallelism. For training,
you can essentially replicate the GPT4
model hundreds of time and basically
batch process the 13 trillion tokens for
pre-training. And once all this
pre-training is completed, you can use
different gradient algorithms to average
out the weights between all those
instances. Since we have around 25,000
GPUs available to train GPT4 and each
GPT4 instance that trains needs around
120 GPUs, a simple math shows us that we
can have up to around 200 replications
of the GPT4. And now we're utilizing all
25,000 GPUs to pre-train the GPT4 model.
Now that we understand why we need this
much GPUs to train one large language
model, let's get to the heart of the
video, which is energy. Since one HX
server can connect eight GPUs, in order
to run all 25, A100 GPUs, we need to
around 3,125
of these HX servers and each HX server
consumes up to around 6.5 kW. And we saw
that from earlier. Training GPT4
required 20 septilian floatingpoint
operations that took about 3 months
given their infrastructure of 3,125
HX servers. So we have 6.5 kW * 24 hours
* 90 days * 3,125 server which brings up
to around 43,875,000
kwatt hours or 44,000 gawatt hours give
or take. To put that into perspective,
that's about what a small city of 50,000
people consume in a month. So if you
think about it from this chart, each dot
here that shows a model could easily
represent energy spent for training
equivalent to a small city's monthly
energy consumption. And keep in mind
that the GPT4 model is from back in
2023. And since then, we have seen a
huge improvement in infrastructure
interconnect protocols and also the
underlying AI model architecture. And
all of these could improve the energy
demand to train the model. But training
is only a portion of the total energy
that's needed for the AI model. Meaning
deploying the train model to production
to be used by general public to make
inference requires more energy than what
we just saw in the training energy
demand. So the next natural question is
this. how much electricity is needed to
actually run these AI models in
production. But first, let's talk about
jobs from Woven. I've been looking to
hire a software developer in my previous
company, and one thing I always found
was that candidates always had different
skill sets, and some people were really
good at code reviews, and others were
good at system debugging, and now with
AI, agentic programming. So, coming up
with coding evaluations for each role
took a lot of time and effort to build
scenarios and give feedback. It just
wasn't fun for everyone involved in the
process. Woven is a humanpowered
technical assessment tool that makes
hiring streamlined. So if you're looking
to hire engineers, Woven is offering 14
days free trial with 20% off of your
first hire. Check the link in the
description. Okay, back to the question.
How much energy is needed to run these
AI models? OpenAI is projected to have
more than 700 million weekly active
users using chatpt. So if we were to use
the earlier GPD4 model, the GPD4 model
is estimated to run on a cluster of 128
GPUs. So using the same math we used
earlier for training, we have 8GPU
tensor parallelism in one server
connected by NVL link with 16-way
pipeline parallelism. We find that we
can now run the GPT4 model in production
using 128 GPUs. But that's just one
instance of GPT4 running. Meaning to
service the 700 million active users
that use Chat GPT, we're going to need a
lot more computing, which means a lot
more power. OpenAI now receives over 2.5
billion promps per day. And the energy
consumption for processing one single
query with GPT4 can spend around 0.3 W
hours. Although that depends on the type
of work. But if we project out how much
energy this might need, 2.5 billion
prompts with 0.3 watt hours brings us to
750 megawatt hours per day just in
energy cost. In comparison to earlier
math where training the GPT4 model for
90 days costs around 44 gawatt hours in
total, deploying the GPT4 for 700
million users that runs 2.5 billion
prompts per day at 0.3 watt hours, we
come to the total number of around 67
gawatt hours for 90 days of serving GPT4
on JPT. And keep in mind that AI
companies typically serve multiple AI
models rather than just one. For
example, enthropic serves claw for
sonnet, claw for opus as well as
backwards compatible models like claw
3.7, 3.5 and more. So you can see why
the energy needed here is compounding
really fast and not only in training but
more importantly in running the models.
Another energy overhead that goes into
both training and deploying these models
is cooling. And this is measured in a
metric called power usage effectiveness
or PUE. Essentially for every wattage
that we use in a data center, what
multiple do you need to consider when
you're using cooling? So for OpenAI, the
GPT4 model could have had a POE of let's
say 1.2, which brings our energy
consumption in training from 44 gawatt
hour to 52.8 gatt hour. And the energy
consumption for the deployment of the
model goes from 67 gwatt hours in 90
days of serving 700 million users to 80
gatt hours. Back in 2023, we saw 176
terowatt hours from data center usages,
which represents around 4.4% of the
entire US electricity consumption. And
what's crazy is that this number is
expected to grow by the year 2030. Some
projections have been made for up to 8
to 10% of the entire US electricity,
though not all of this goes towards AI.
But whether the projection is right or
wrong, one thing for sure is that we're
going to need a lot more power to
support this. XAI for this reason bought
up more than 30 methane turbines to
power their Colossus facility in
Memphis. OpenAI is projected to build
their Stargate facility that would have
capacity for up to 5 gawatt that can
power up to 400,000 GB200 GPU which is
the superior to the H100 GPUs that we
use today. Meta is building out natural
gas power plants and Google is expanding
their hypers scale data centers to
continue growing. And it hasn't exactly
been easy for these companies to fight
against permits and local governments
that are pushing back due to concerns
about straining local infrastructures
that ends up in lawsuits. For example,
XAI is getting complaints for using
methanes as a source of electricity and
circumventing the local permits. Meta
instead of relying on existing
utilities, they're trying to generate
electricities themselves. Google is
getting push backs from utilities and
local governments on their expansion
plan for the hyperscale data centers
nationwide. Meanwhile, China's energy
buildout has mostly been centralized to
state level, which allows them to
outpace US's heavily regulated energy
supply. Some go as far as saying that
China has an over supply of energy. By
2023, China installed over 609 gawatt of
solar and 441 gawatt of wind. They have
27 reactors under construction for
expanding their nuclear power. And
China's energy capacity just seems to
grow more and more each day. So, we're
seeing US energy infrastructure
primarily driven by private corporations
while China is being driven by
government states essentially sidest
stepping potential bureaucracies that
get in the way. Meanwhile, AI
competition is getting more and more
fierce. But thankfully, energy alone
isn't the bottleneck, but it is one
critical piece that helps drive
innovation in AI. While there are other
factors like advanced chips like GB200
GPUs or better model architectures like
mixture of experts, quantization and
speculative decoding or improved
training methodologies that can help
reduce energy costs to train and deploy
AI models. But all in all, energy supply
needs to grow with it and underpin the
innovation which requires more power
generations, better grade system to
efficiently transport energy and better
water cooling system for the chips. All
of which play into the factor of
determining who will come out on top of
the AI
Ask follow-up questions or revisit key timestamps.
This video delves into the significant energy consumption required for training and deploying large language models (LLMs) like GPT-4. It highlights that training a single LLM involves a massive amount of computational power, estimated by the energy needed to train GPT-4, which required approximately 20 septillion floating-point operations and likely used 25,000 A100 GPUs for about 3 months. The video explains the complexities of parallelizing training across thousands of GPUs using technologies like NVLink and discusses different parallelism strategies (tensor, pipeline, and data parallelism). It then shifts to the energy demands of running these models in production, which can exceed training energy costs, especially with billions of daily prompts. For instance, serving GPT-4 to millions of users for 90 days could consume around 67 gigawatt-hours, a figure amplified by the need for cooling systems (PUE). The video contrasts the approaches of US companies (like OpenAI, Meta, and Google) and China in scaling up energy infrastructure to meet AI demands, noting the challenges US companies face with permits and local regulations versus China's centralized state-driven approach. Ultimately, it concludes that while technological advancements can improve efficiency, a substantial increase in energy supply and infrastructure is crucial for continued AI innovation.
Videos recently processed by our community