Energy Demand in AI

Watch on YouTube

Now Playing

Energy Demand in AI

Transcript

349 segments

0:00

Frontier models like GPT, Gro, Claude,

0:02

and Gemini that run in data centers all

0:04

over the world need something in common,

0:06

power. In order to understand the

0:08

magnitude of the demand in energy, we'll

0:10

need to understand what it takes to

0:11

train one large language model and then

0:13

expand our scope to see the rest of the

0:15

industry. So, let's start with this one

0:17

question. What does it cost to train one

0:19

large language model? Most private

0:20

companies don't release the exact specs

0:22

on energy consumption on their AI

0:24

models. So in order to understand what

0:26

goes on in training a large language

0:28

model, we need to start with a model

0:29

that we do know to some extent by

0:31

estimation and that's OpenAI's GPT4.

0:34

GPT4 is assumed to be 1.7 trillion

0:37

parameter model that was pre-trained on

0:39

13 trillion tokens of data which

0:41

required around 20 septillion

0:42

floatingpoint operations. Meaning in

0:44

order to train this giant model, OpenAI

0:47

likely used up to 25,000 A100 GPUs that

0:50

in total took them about 3 months to

0:52

train. And each of these A100 GPUs

0:55

consumes up to 400 watts of power. And

0:57

once you stack 25,000 of them, the

0:59

energy demand starts to add up really

1:01

fast. Typically, when we have that many

1:03

GPUs working at the same time, the

1:05

challenge is how we can actually

1:06

parallelize the training efficiently.

1:09

So, similar to how having 25,000 chefs

1:11

making one giant meal at once, you need

1:13

an efficient way to group them in a way

1:15

that maximizes the type of work ahead.

1:17

And in the case of large language

1:19

models, the type of work we need to do

1:20

is matrix operations. specifically

1:23

what's called mattal or matrix

1:24

multiplications. If you have to do a

1:26

matrix multiplication of a 100,000 by

1:29

100,000, the total number of operations

1:31

required for this is around 2.5

1:33

quadrillion floatingoint operations or 2

1:35

* 10 ^ 15 flops. To do that many

1:38

mathematical operations, you can use one

1:40

single GPU to solve this, but that would

1:42

probably take you a very long time. But

1:44

maybe you could group multiple GPUs at

1:46

once and make it go faster. But the

1:48

ultimate question is this. If I have

1:50

25,000 GPUs available, do I just group

1:52

all 25,000 GPUs in one and hit the go

1:55

button? Typically, in AI training,

1:57

Nvidia chips are grouped in eight. So,

1:59

for the A100 GPUs, you can group eight

2:01

of them into a single hardware topology

2:03

called Nvidia HDX server that looks

2:05

something like this. And one HGX server

2:08

that holds eight A100 GPUs consume up to

2:10

3 to 6 kW. And keep in mind that the HDX

2:13

A100 unit is considered to be an older

2:16

hardware as the industry now moved on to

2:18

hopper architecture like HDX H100 or

2:21

even Blackwell architecture like HDXB200

2:24

which draws up more than 10 kW. But

2:26

these are newer architecture. So even

2:28

though it draws up more power, you'll

2:30

need less of them since the chip itself

2:32

is faster and efficient in comparison to

2:34

the A100 GPUs. In any case, now that we

2:37

have one Nvidia HDX that has eight A100

2:39

GPUs, instead of running the 100,000 by

2:42

100,000 tensor operations with one

2:44

single GPU, you can now parallelize this

2:46

tensor operation by using one HX server

2:49

that has eight GPUs installed. And the

2:51

protocol that allows this communication

2:53

between eight Nvidia GPUs is called

2:55

NVLink. NVLink was introduced back in

2:58

2014 as a protocol that helps this exact

3:00

scenario, being able to parallelize

3:02

tensor operations. So that in this case,

3:04

the 100,000x 100,000 matrix operation

3:07

can be shared across eight GPUs to run

3:09

in parallel. And you might be asking, I

3:11

mean, why stop at one HDX server? We're

3:13

in America, so let's go big and stack up

3:15

a 100 of these and go get some bacon.

3:17

While grouping eight GPUs in one

3:19

topology of hardware is made possible by

3:21

using NVLink protocol. Stacking multiple

3:24

of these HGX servers starts to make less

3:26

sense because of what's called

3:27

interconnect. Meaning as you extend

3:29

beyond high-speed mesh within one

3:31

internal mesh of GPUs, the interconnect

3:34

bandwidth and latency between nodes can

3:36

often lead to diminishing returns.

3:38

However, one clever way to get around

3:40

this is by putting the focus on the

3:42

model architecture. Meaning instead of

3:44

trying to make the tensor operations

3:45

more parallel, we can also find

3:47

parallelism from the model architecture.

3:49

For example, GPD4 is thought to have 120

3:52

layers of neural network. And just like

3:54

earlier, we can parallelize not just on

3:56

the tensor level, but also on the

3:57

architectural level as well. We can

3:59

divide up the neural network into 15

4:01

pipelines and each pipeline dedicates

4:04

one HDX server. So that now eight GPUs

4:06

can essentially parallelize the tensor

4:08

operations accordingly. So a simple math

4:10

tells us that 15 pipelines with eight

4:13

GPUs each comes to 120 GPUs that can run

4:16

one instance of GPD4. One question you

4:18

might have here is that why don't we

4:20

just dedicate one HDX server for each

4:23

layer of the GPD4. And that's mostly

4:25

because every layer varies in sizes. So

4:27

we don't want to get to underutilizing a

4:29

dedicated HX server for a very small

4:31

layer. And now that we have tensor

4:33

parallelism and pipeline parallelism

4:35

done, the question is this. OpenAI had

4:37

25,000 GPUs, but we saw that one single

4:40

instance of GPT4 can practically run on

4:43

120 GPUs. What do we do with the

4:45

remaining 24,880

4:47

GPUs that's sitting idle? In order to

4:49

parallelize between each instances of

4:51

GPT4, we can add another parallelism

4:54

done, but now on the data. In other

4:55

words, data parallelism. For training,

4:58

you can essentially replicate the GPT4

5:00

model hundreds of time and basically

5:02

batch process the 13 trillion tokens for

5:05

pre-training. And once all this

5:06

pre-training is completed, you can use

5:08

different gradient algorithms to average

5:10

out the weights between all those

5:12

instances. Since we have around 25,000

5:14

GPUs available to train GPT4 and each

5:17

GPT4 instance that trains needs around

5:20

120 GPUs, a simple math shows us that we

5:22

can have up to around 200 replications

5:24

of the GPT4. And now we're utilizing all

5:27

25,000 GPUs to pre-train the GPT4 model.

5:30

Now that we understand why we need this

5:32

much GPUs to train one large language

5:34

model, let's get to the heart of the

5:36

video, which is energy. Since one HX

5:38

server can connect eight GPUs, in order

5:40

to run all 25, A100 GPUs, we need to

5:43

around 3,125

5:45

of these HX servers and each HX server

5:48

consumes up to around 6.5 kW. And we saw

5:51

that from earlier. Training GPT4

5:53

required 20 septilian floatingpoint

5:55

operations that took about 3 months

5:57

given their infrastructure of 3,125

6:00

HX servers. So we have 6.5 kW * 24 hours

6:04

* 90 days * 3,125 server which brings up

6:07

to around 43,875,000

6:10

kwatt hours or 44,000 gawatt hours give

6:13

or take. To put that into perspective,

6:15

that's about what a small city of 50,000

6:17

people consume in a month. So if you

6:19

think about it from this chart, each dot

6:21

here that shows a model could easily

6:23

represent energy spent for training

6:25

equivalent to a small city's monthly

6:26

energy consumption. And keep in mind

6:28

that the GPT4 model is from back in

6:30

2023. And since then, we have seen a

6:33

huge improvement in infrastructure

6:34

interconnect protocols and also the

6:36

underlying AI model architecture. And

6:38

all of these could improve the energy

6:40

demand to train the model. But training

6:42

is only a portion of the total energy

6:44

that's needed for the AI model. Meaning

6:46

deploying the train model to production

6:48

to be used by general public to make

6:50

inference requires more energy than what

6:52

we just saw in the training energy

6:54

demand. So the next natural question is

6:56

this. how much electricity is needed to

6:58

actually run these AI models in

7:00

production. But first, let's talk about

7:02

jobs from Woven. I've been looking to

7:04

hire a software developer in my previous

7:06

company, and one thing I always found

7:08

was that candidates always had different

7:10

skill sets, and some people were really

7:12

good at code reviews, and others were

7:13

good at system debugging, and now with

7:15

AI, agentic programming. So, coming up

7:17

with coding evaluations for each role

7:19

took a lot of time and effort to build

7:21

scenarios and give feedback. It just

7:23

wasn't fun for everyone involved in the

7:25

process. Woven is a humanpowered

7:27

technical assessment tool that makes

7:28

hiring streamlined. So if you're looking

7:30

to hire engineers, Woven is offering 14

7:33

days free trial with 20% off of your

7:35

first hire. Check the link in the

7:36

description. Okay, back to the question.

7:38

How much energy is needed to run these

7:40

AI models? OpenAI is projected to have

7:42

more than 700 million weekly active

7:44

users using chatpt. So if we were to use

7:47

the earlier GPD4 model, the GPD4 model

7:50

is estimated to run on a cluster of 128

7:53

GPUs. So using the same math we used

7:56

earlier for training, we have 8GPU

7:58

tensor parallelism in one server

8:00

connected by NVL link with 16-way

8:02

pipeline parallelism. We find that we

8:04

can now run the GPT4 model in production

8:06

using 128 GPUs. But that's just one

8:09

instance of GPT4 running. Meaning to

8:12

service the 700 million active users

8:14

that use Chat GPT, we're going to need a

8:16

lot more computing, which means a lot

8:17

more power. OpenAI now receives over 2.5

8:20

billion promps per day. And the energy

8:22

consumption for processing one single

8:24

query with GPT4 can spend around 0.3 W

8:27

hours. Although that depends on the type

8:29

of work. But if we project out how much

8:31

energy this might need, 2.5 billion

8:33

prompts with 0.3 watt hours brings us to

8:35

750 megawatt hours per day just in

8:38

energy cost. In comparison to earlier

8:40

math where training the GPT4 model for

8:42

90 days costs around 44 gawatt hours in

8:45

total, deploying the GPT4 for 700

8:48

million users that runs 2.5 billion

8:50

prompts per day at 0.3 watt hours, we

8:53

come to the total number of around 67

8:55

gawatt hours for 90 days of serving GPT4

8:58

on JPT. And keep in mind that AI

9:00

companies typically serve multiple AI

9:02

models rather than just one. For

9:04

example, enthropic serves claw for

9:06

sonnet, claw for opus as well as

9:08

backwards compatible models like claw

9:09

3.7, 3.5 and more. So you can see why

9:12

the energy needed here is compounding

9:14

really fast and not only in training but

9:17

more importantly in running the models.

9:19

Another energy overhead that goes into

9:21

both training and deploying these models

9:23

is cooling. And this is measured in a

9:25

metric called power usage effectiveness

9:27

or PUE. Essentially for every wattage

9:29

that we use in a data center, what

9:31

multiple do you need to consider when

9:33

you're using cooling? So for OpenAI, the

9:35

GPT4 model could have had a POE of let's

9:37

say 1.2, which brings our energy

9:39

consumption in training from 44 gawatt

9:42

hour to 52.8 gatt hour. And the energy

9:44

consumption for the deployment of the

9:46

model goes from 67 gwatt hours in 90

9:49

days of serving 700 million users to 80

9:52

gatt hours. Back in 2023, we saw 176

9:55

terowatt hours from data center usages,

9:57

which represents around 4.4% of the

10:00

entire US electricity consumption. And

10:02

what's crazy is that this number is

10:04

expected to grow by the year 2030. Some

10:07

projections have been made for up to 8

10:08

to 10% of the entire US electricity,

10:11

though not all of this goes towards AI.

10:13

But whether the projection is right or

10:15

wrong, one thing for sure is that we're

10:16

going to need a lot more power to

10:18

support this. XAI for this reason bought

10:20

up more than 30 methane turbines to

10:22

power their Colossus facility in

10:24

Memphis. OpenAI is projected to build

10:26

their Stargate facility that would have

10:27

capacity for up to 5 gawatt that can

10:30

power up to 400,000 GB200 GPU which is

10:33

the superior to the H100 GPUs that we

10:35

use today. Meta is building out natural

10:37

gas power plants and Google is expanding

10:40

their hypers scale data centers to

10:42

continue growing. And it hasn't exactly

10:43

been easy for these companies to fight

10:45

against permits and local governments

10:47

that are pushing back due to concerns

10:49

about straining local infrastructures

10:51

that ends up in lawsuits. For example,

10:53

XAI is getting complaints for using

10:55

methanes as a source of electricity and

10:57

circumventing the local permits. Meta

10:59

instead of relying on existing

11:00

utilities, they're trying to generate

11:02

electricities themselves. Google is

11:04

getting push backs from utilities and

11:05

local governments on their expansion

11:07

plan for the hyperscale data centers

11:09

nationwide. Meanwhile, China's energy

11:11

buildout has mostly been centralized to

11:13

state level, which allows them to

11:15

outpace US's heavily regulated energy

11:17

supply. Some go as far as saying that

11:19

China has an over supply of energy. By

11:22

2023, China installed over 609 gawatt of

11:25

solar and 441 gawatt of wind. They have

11:28

27 reactors under construction for

11:31

expanding their nuclear power. And

11:32

China's energy capacity just seems to

11:34

grow more and more each day. So, we're

11:36

seeing US energy infrastructure

11:38

primarily driven by private corporations

11:40

while China is being driven by

11:42

government states essentially sidest

11:44

stepping potential bureaucracies that

11:46

get in the way. Meanwhile, AI

11:47

competition is getting more and more

11:49

fierce. But thankfully, energy alone

11:51

isn't the bottleneck, but it is one

11:53

critical piece that helps drive

11:54

innovation in AI. While there are other

11:56

factors like advanced chips like GB200

11:58

GPUs or better model architectures like

12:01

mixture of experts, quantization and

12:03

speculative decoding or improved

12:04

training methodologies that can help

12:06

reduce energy costs to train and deploy

12:08

AI models. But all in all, energy supply

12:10

needs to grow with it and underpin the

12:12

innovation which requires more power

12:14

generations, better grade system to

12:16

efficiently transport energy and better

12:18

water cooling system for the chips. All

12:20

of which play into the factor of

12:22

determining who will come out on top of

12:24

the AI

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This video delves into the significant energy consumption required for training and deploying large language models (LLMs) like GPT-4. It highlights that training a single LLM involves a massive amount of computational power, estimated by the energy needed to train GPT-4, which required approximately 20 septillion floating-point operations and likely used 25,000 A100 GPUs for about 3 months. The video explains the complexities of parallelizing training across thousands of GPUs using technologies like NVLink and discusses different parallelism strategies (tensor, pipeline, and data parallelism). It then shifts to the energy demands of running these models in production, which can exceed training energy costs, especially with billions of daily prompts. For instance, serving GPT-4 to millions of users for 90 days could consume around 67 gigawatt-hours, a figure amplified by the need for cooling systems (PUE). The video contrasts the approaches of US companies (like OpenAI, Meta, and Google) and China in scaling up energy infrastructure to meet AI demands, noting the challenges US companies face with permits and local regulations versus China's centralized state-driven approach. Ultimately, it concludes that while technological advancements can improve efficiency, a substantial increase in energy supply and infrastructure is crucial for continued AI innovation.