Why the Smartest AI Teams Are Panic-Buying Compute: The 36-Month AI Infrastructure Crisis Is Here

Watch on YouTube

Now Playing

Transcript

720 segments

0:00

We built an economy that runs on AI and

0:02

now there isn't enough compute to run

0:04

that economy. A structural crisis is

0:06

emerging in global technology

0:08

infrastructure. Over the past three

0:10

years, the world economy has reorganized

0:12

itself around AI capabilities. It's now

0:14

the biggest capex project in human

0:16

history. And those capabilities depended

0:18

entirely on inference compute. That

0:21

compute is now physically constrained

0:23

and no relief is expected before 20.

0:25

Discussion documents the nature of what

0:28

is going on. explains why it differs

0:30

from previous technology supply

0:32

crunches. And we're going to analyze the

0:34

strategic implications for enterprises

0:36

and provides some actionable guidance

0:37

for leaders who have to figure out how

0:39

to navigate the next 24 months. Just

0:41

before we get into it, I'm not going to

0:43

hide the ball. These are the top things

0:45

that are popping up for me as I dig into

0:46

this issue. Number one, demand remains

0:49

exponential and uncapped. That is the

0:51

first big driver here. Enterprise AI

0:53

consumption is growing at least at 10x

0:56

annually driven by per worker usage

0:58

increases and the proliferation of

1:00

agentic systems. There is no ceiling.

1:02

Number two, supply is physically

1:04

constrained all the way through 2028 at

1:06

least. DRAM fabrication takes 3 to four

1:09

years at semiconductor capacity is fully

1:11

allocated. High bandwidth memory is sold

1:13

out. New capacity literally cannot

1:16

arrive fast enough. Number three, all of

1:18

the hyperscalers are hoarding. Google,

1:20

Microsoft, Amazon, Meta, they've all

1:22

locked up compute allocation for years

1:24

in advance. So have open AI. So is

1:25

anthropic. They are using it to power

1:27

their own products while enterprises

1:29

compete for what remains. Number four,

1:32

pricing is going to spike. It will not

1:35

rise gradually. When demand exceeds

1:37

supply this structurally and this

1:39

severely, markets do not clear very

1:41

smoothly. Trend Force projects memory

1:43

costs alone will add 40 to 60% to

1:46

inference infrastructure in the first

1:48

half of 2026. Effective inference costs

1:52

could double or triple within 18 months.

1:54

Next, traditional planning frameworks

1:56

are broken. Capex models, depreciation

1:59

schedules, and multi-year procurement

2:01

cycles all assume predictable demand and

2:04

available supply. Neither assumption

2:06

holds in the age of AI. And finally, the

2:09

window to secure capacity is closing.

2:11

Enterprises that move now can lock in

2:13

allocation before the crisis peaks. And

2:15

those that wait are going to find

2:17

themselves bidding against each other

2:18

for scraps at best or be locked out

2:21

entirely. I want to repeat this is not a

2:24

tech problem. It's being presented as

2:26

one, but that's incorrect. It's actually

2:28

an economic transformation with

2:30

consequences that will reshape

2:32

competitive dynamics across every

2:34

industry. To understand this crisis,

2:36

let's start with what is being consumed.

2:39

A knowledge worker using AI tools

2:41

aggressively, code completion, document

2:44

analysis, research assistance, meeting

2:46

summarization, consumes, I don't know,

2:48

call it a billion tokens a year if

2:50

you're really leaning in. That's the

2:52

current baseline for heavy users at AI

2:54

forward enterprises. And the ceiling is

2:56

much higher. The ceiling is 25 billion

2:59

tokens or more a year. 25x that 1

3:01

billion. And 1 billion tokens does sound

3:04

like a lot except that it isn't. A

3:06

single complex analysis task like review

3:09

these 50 docs and synthesize their key

3:11

findings can consume a half a million

3:14

tokens real easy. Day of active coding

3:16

can run into the millions of tokens. A

3:18

research session with multiple

3:20

iterations you can get as high as 10

3:22

million tokens. Consumption continues to

3:24

grow because we have not found a limit

3:27

to what we want to do with artificial

3:28

intelligence. There is no demand limit

3:32

for intelligence. Three dynamics are

3:34

driving continued acceleration of

3:35

consumption here. First, capability is

3:38

unlocking usage. As models improve, as

3:41

orchestration improves, users are

3:42

finding new applications and features

3:44

that didn't work well a year ago, like

3:46

complex reasoning, now work very well.

3:48

Every capability improvement unlocks a

3:51

lot more new demand, like nonlinear new

3:53

demand. Think about what happens when

3:55

you go from managing one agent to 10,

3:57

10x the demand. Integration is also

4:00

multiplying our touch points here. So AI

4:02

is no longer a standalone tool you open

4:04

when you need it. It is now embedded

4:06

across email clients, document editors,

4:08

development environments, CRM, etc.

4:10

Every time we integrate, we are creating

4:13

a portal for ambient continuous

4:15

consumption and agents are compounding

4:17

on top of everything. The shift from

4:19

human in the loop to agentic systems

4:21

where AI calls AI in ever more automated

4:24

loops that is that is not just a step

4:26

change. I struggle to explain how big a

4:28

change that is in consumption terms. It

4:30

is a multiple order of magnitude change.

4:33

A single agentic workflow can consume

4:35

more tokens in an hour than a human

4:37

generates in a month. Current

4:39

projections I'm just it's not that

4:42

unusual to think that a average worker

4:45

consumption model will hit 10 billion

4:47

tokens annually within the next year and

4:50

a half on average right and you may be

4:52

over a 100red billion tokens very easily

4:55

with top users. That's just

4:57

extrapolation from observed growth at AI

4:59

forward enterprises is not all that

5:01

aggressive. Some organizations are going

5:03

to be approaching those levels in

5:04

specific roles right now. Well, now

5:07

let's step back and look at what this

5:08

math means at enterprise scale. Let's

5:10

say you have a 10,000 person

5:12

organization. You're at a billion tokens

5:14

per worker. And so you're consuming 10

5:17

trillion tokens a year at a blended API

5:19

rate of two bucks per million tokens.

5:22

You're spending $20 million a year on

5:24

inference. That's expensive, but it is

5:27

manageable for a Fortune 500. At 10

5:29

billion tokens per worker, the same

5:31

organization is now going to consume a

5:34

100red trillion tokens a year. And at

5:36

the same rate, that's $200 million a

5:39

year. Suddenly, at 100 billion tokens

5:41

per year, which Agentic systems could

5:43

reach within 18 months, now your compute

5:46

bill is $2 billion.

5:48

And these calculations assume stable

5:50

pricing and available capacity, but

5:52

really neither assumption holds. It's

5:54

even more chaotic than that. The agentic

5:57

shift deserves special attention here

5:58

because it changes our consumption

6:00

models fundamentally. Human users have

6:03

natural rate limits. They type at a

6:05

certain speed. They take breaks. They

6:06

attend meetings. They go home. A human

6:09

would probably have difficulty

6:11

sustaining more than maybe 50 million

6:13

tokens a day, even working intensively.

6:16

Agents do not have such limits. An

6:18

agentic system running continuously,

6:20

monitoring, analyzing, responding,

6:22

planning. It is not far-fetched to

6:24

imagine a single worker driving that

6:26

system to consume billions of tokens per

6:29

day. And a fleet of agents working in

6:31

parallel could consume trillions. That's

6:33

not hypothetical. Enterprises are

6:35

already deploying Aentic systems for

6:37

code review, for security monitoring,

6:39

for customer service, for document

6:41

processing. And every deployment creates

6:44

sustained 247 inference demand that

6:47

dwarfs what human users generate.

6:50

that dwarfs what human users generate.

6:52

The enterprises planning for a billion

6:54

tokens a worker are planning for the

6:56

wrong curve. They need to plan for the

6:58

workers plus the agents whose workers

7:01

deploy plus the agents the enterprise

7:02

deploys centrally. This total

7:05

consumption footprint could be 10 to

7:06

100x the per human calculation. One data

7:09

point really crystallizes the insane

7:12

trajectory that we're all on. Google has

7:14

publicly disclosed that it processed 1.3

7:17

quadrillion tokens per month across its

7:20

services. That is a 130fold increase in

7:23

just over a year. Google is the world's

7:25

most sophisticated operator of AI

7:27

infrastructure with more visibility into

7:29

demand patterns than anybody else. Their

7:31

consumption curve is your leading

7:33

indicator. If the world's largest and

7:35

most capable AI operator is seeing X

7:38

annual growth, enterprise planning for

7:40

10X may well be conservative. And yet

7:42

and yet we live in a memory bottleneck.

7:44

Right? All of this assumes that we can

7:47

get the hardware to run it. But AI

7:49

inference is memory bound. The size of

7:52

the model you can run, the speed at

7:54

which you can run it, and the number of

7:55

concurrent users you can serve all

7:57

depend on memories. Specifically, high

8:00

bandwidth memory for data center

8:01

inference and DDR5 for everyone else.

8:04

And the memory market is broken. Server

8:06

DRAM prices have risen at least 50%

8:09

through 2025, likely more. Trend Force

8:11

projects they're rising another 55 to

8:13

60% quarter over quarter in Q1 2026. We

8:17

are looking at tripledigit increases in

8:20

DRAM prices and I think it goes higher

8:22

than that. The DDR5 64 gig RD dim

8:25

modules the workhorse of enterprise data

8:27

center deployment could cost twice as

8:29

much by the end of 2026 as they did in

8:32

early 2025. Counterpoint Research

8:34

projects DRAM prices overall are going

8:36

to rise, call it 50% 47% in 2026 due to

8:40

significant under supply. Look, I don't

8:42

care whether you pick the 47% number or

8:44

the 55% number. When you're talking

8:46

about prices rising this fast, this

8:49

much, this is not a typical cyclical

8:52

shortage. There are structural factors

8:54

that make this different. First, the

8:56

three big players that control 95% of

8:59

global memory production are all

9:01

reallocating away from consumer. That's

9:04

Samsung, KH Highix, and Micron. They're

9:06

all headed to enterprise segments and

9:08

specifically AI data center customers.

9:10

It's not that they're producing less,

9:12

it's that they're producing different

9:13

memory and the hyperscalers are buying

9:15

all of it. Then there's HBM

9:17

concentration. High bandwidth memory is

9:19

essential for large model inference. It

9:21

is a specialized product. SK Highix

9:23

dominates and their output is allocated

9:25

to Nvidia, AMD and hyperscalers. You

9:28

can't get HBM at any price. It's simply

9:31

not available. And then last but not

9:33

least, new capacity is not quick to come

9:35

online. It takes years. A new DRM fab

9:38

facility is going to cost you on the

9:39

order of $20 billion and it takes 3 to

9:41

four years to construct and ramp.

9:43

Decisions made today to invest will not

9:47

yield chips until about 2030. There is

9:50

no near-term supply response. Samsung's

9:53

president has stated publicly that

9:55

memory shortages will affect pricing

9:57

industrywide through 2026 and beyond.

9:59

That's the world's largest memory

10:01

manufacturer telling you on the record

10:03

they cannot meet demand. Below memory,

10:05

there's another constraint. That's the

10:07

semiconductor fab layer, and it's even

10:09

more bottlenecked than memory. TSMC

10:12

manufactures the world's most advanced

10:14

chips, including the Nvidia data center

10:16

GPUs. Their nodes, the 5nanmter, 4 and 3

10:19

nanometer, are all fully allocated.

10:22

Nvidia is their largest consumer. Apple

10:25

is second, and the hyperscalers fill out

10:27

the rest. TSMC's capacity expansion is

10:30

proceeding, but very slowly. Their

10:32

Arizona FAB won't reach full production

10:34

until 2028. New facilities in Japan and

10:37

in Germany are on similar timelines.

10:39

Intel's 18A process, demonstrated at CES

10:42

2026, represents the first credible

10:44

American alternative to TSMC. But

10:47

Intel's foundry services are unproven at

10:49

scale. Their capacity is quite limited,

10:51

and their first major customers,

10:53

including Microsoft, are going to absorb

10:55

all of their initial allocation.

10:56

Meanwhile, Samsung's foundry business

10:58

has struggled with yields on advanced

11:00

nodes, making them a less reliable

11:02

alternative for cutting edge chips. The

11:04

results is kind of predictable.

11:06

Essentially, all advanced AI chip

11:08

production runs through TSMC in Taiwan.

11:11

There is no surge capacity. There is no

11:13

alternative. This is not a short-term

11:15

fix situation either. But we're not done

11:17

with the bottlenecks. We also have a GPU

11:20

allocation crisis. Nvidia dominates AI

11:23

training and inference chips with

11:24

roughly 80% market share. Their H100 and

11:27

newer Blackwell GPUs are the standard

11:30

for data center AI. Both are sold out.

11:32

Lead times for large GPU orders exceed 6

11:34

months. The hyperscalers have locked up

11:36

allocation through multi-year purchase

11:38

agreements worth tens of billions of

11:40

dollars. Microsoft, Google, Amazon,

11:42

Meta, and Oracle have collectively

11:44

committed to hundreds of billions in

11:46

Nvidia purchases over the next several

11:48

years because of the demand I've been

11:50

talking about. And enterprise buyers are

11:52

left with whatever allocation remains,

11:53

which increasingly is not much. NVIDIA's

11:55

H200 and Blackwell GPUs, which offer

11:58

significant performance enhancements,

12:00

are even more constrained. Initial

12:02

production runs are fully allocated to

12:04

hyperscalers. Enterprise availability in

12:06

any kind of volume is very uncertain,

12:08

and you don't have great alternatives.

12:10

AMD's Instinct MI300X is competitive on

12:13

specs and available in somewhat larger

12:15

quantities, but the software ecosystem

12:17

maturity lags significantly versus

12:19

Nvidia. Intel's Gaudy accelerators have

12:22

struggled to gain market share despite

12:23

competitive pricing. Software and

12:25

ecosystem adoption remain challenges.

12:27

Custom silicone, like Google's TPU or

12:30

Amazon's Tranium, they're not available

12:32

to enterprises and they're typically

12:34

built for internal use unless you have

12:35

very special deals. None of these

12:37

alternatives changes the fundamental

12:39

picture. The GPUs that enterprises need

12:41

are controlled by companies that have

12:43

every incentive to use them internally

12:45

rather than sell them. And this is the

12:47

piece that most analyses miss. AWS,

12:50

Azure, and Google Cloud are not neutral.

12:53

They are AI product companies that

12:55

happen to sell infrastructure. They

12:57

compete directly with their enterprise

12:59

customers. Google uses its compute

13:01

allocation to power Gemini, which

13:03

competes with every enterprise AI

13:04

deployment. Microsoft uses its

13:06

allocation for co-pilot which competes

13:08

with every enterprise productivity AI

13:11

and Amazon uses its allocation for AWS

13:13

AI services competing across the board.

13:15

When compute is abundant this conflict

13:18

of interest is very manageable. The

13:19

hyperscalers can serve their own needs

13:21

and sell excess capacity and everybody

13:23

wins. When compute is scarce like now

13:26

the conflict becomes zero sum real fast.

13:30

Every GPU allocated to an enterprise

13:32

customer is a GPU not available for

13:35

Gemini, Copilot or Alexa. The

13:37

hyperscalers must choose between their

13:39

own products and their customers. This

13:41

dynamic, by the way, is true for OpenAI

13:43

and Anthropic as well. If the GPUs are

13:46

not in an OpenAI data center, OpenAI is

13:49

missing out on serving chat GPT

13:50

customers. And when in doubt, all of

13:52

these hyperscalers will choose their

13:54

products. They already have. Look at

13:56

what's happening with rate limits. API

13:58

pricing has fallen over the past two

14:00

years, but rate limits have tightened.

14:03

Enterprise customers report increasing

14:05

difficulty getting allocation

14:07

commitments for high volume deployments.

14:09

And to be honest, hyperscalers are not

14:11

being the villains here. They're being

14:13

rational. Their AI products are

14:15

strategic priorities with internal

14:16

champions. And selling capacity to

14:18

enterprises is just a business. It's not

14:20

the business their leadership is

14:22

necessarily measured on. In the race to

14:24

AGI, enterprise CTOs need to internalize

14:27

that the cloud providers are not going

14:29

to be reliable partners in this crisis.

14:32

They are competitors who control the

14:34

scarce resource that you need. Now, of

14:36

course, in a smoothly functioning

14:38

market, prices will rise gradually as

14:40

supply tightens, demand will moderate,

14:42

and equilibrium is restored. This is not

14:45

the world we live in. Because supply

14:47

cannot respond to demand and demand

14:49

cannot be deferred, prices are going to

14:52

spike. Buyers will bid against each

14:54

other. They're willing to pay premiums.

14:55

Sellers seeing the desperation will

14:57

raise prices for extraction, not for

15:00

equilibrium. We've already seen this

15:01

before. DRAM prices spiked 300% during

15:04

the 2016 shortage. GPU prices doubled

15:08

during the crypto mining boom. Memory

15:10

prices are notoriously volatile because

15:12

the supply side is so inelastic. You

15:15

just can't produce chips fast. And the

15:17

current situation has all the

15:18

ingredients for a severe spike. Supply

15:20

is inelastic. No new fabs coming online

15:23

for a couple years. Demand is inelastic.

15:25

Enterprises are committed to AI and

15:27

they're not changing. Information is

15:29

asymmetric. The hyperscalers know how

15:31

constrained supply is. Enterprises by

15:33

and large do not. And coordination is

15:36

possible with only three major memory

15:38

suppliers and one GPU supplier. Tacic

15:41

coordination it can happen and no

15:43

antitrust violation is required to do

15:45

that. Now the impact of a spike in

15:48

inference cost is going to depend on

15:49

your business model. AI native startups

15:51

are extremely exposed. Companies like

15:53

Notion have publicly disclosed that AI

15:56

costs now consume 10 percentage points

15:58

of what was previously a 90% gross

16:00

margin business. If AI is margin diluted

16:03

at current pricing and inference costs

16:06

double, many AI native business models

16:08

are going to become unviable. Enterprise

16:10

software companies building AI features

16:11

face similar pressure. AI is a

16:13

competitive requirement, but every AI

16:15

feature erodess margin. Companies are

16:18

going to have to choose between

16:19

competitive necessity and financial

16:21

sustainability. Enterprises using AI

16:23

internally have somewhat more

16:25

flexibility. If AI is creating

16:26

disproportionate value, the cost

16:28

increase can be justified, but I would

16:30

expect budget scrutiny to intens

16:32

intensify, but I would expect budget

16:34

scrutiny to intensify over the next two

16:36

years. AI projects that were approved at

16:38

one cost level may well be canceled at

16:40

twice the cost. Now, the hyperscalers

16:42

themselves, ironically, are somewhat

16:44

insulated. They own the infrastructure

16:46

that's becoming scarce, but even they

16:49

will face constraints. Google,

16:50

Microsoft, and Amazon are all warning

16:52

investors about rising AI infra costs.

16:55

The companies most at risk are those in

16:57

the middle. They're too dependent on AI

16:59

to abandon it. They're not large enough

17:01

to secure dedicated allocation, and they

17:03

are competing in markets where pass

17:04

through cost increases are very

17:06

difficult to sustain. So, let's cut to

17:08

the chase. How do you plan for this if

17:10

you're in the enterprise? That's the

17:12

billiondoll question. traditional

17:15

planning fails. So enterprise IT

17:17

planning evolved for a fundamentally

17:18

different era and is just not ready. The

17:20

traditional model was to assess

17:22

requirements to procure infrastructure

17:24

to depreciate it over 3 to 5 years and

17:26

all of that assume predictable demand

17:28

technology that was stable and supply

17:31

that was available. None of that is true

17:33

anymore. Demand is unpredictable and

17:35

exponentially scaling. Technology itself

17:37

is unstable because model architectures

17:39

and hardware capabilities are changing

17:41

rapidly and supply is extremely

17:44

constrained. As we've discussed, CTO's

17:46

who apply traditional planning

17:48

frameworks in that environment are going

17:50

to set themselves up to systematically

17:52

make bad decisions. They're going to

17:54

overcommit to long-term purchases that

17:56

become stranded assets. I've seen it

17:58

already. They're going to underinvest in

18:00

flexibility and optionality and they're

18:02

going to assume supply availability that

18:04

isn't really there. Let's consider a

18:06

real example to make this sort of

18:08

tangible and concrete. Let's suppose we

18:10

have an enterprise that purchases a

18:12

thousand AI workstations with NPU

18:14

capabilities at 5 grand each. That's a

18:17

$5 million capital investment. Finance

18:19

sets a 4-year depreciation schedule. It

18:22

expects to extract 1.25 million a year

18:24

in value. By year two, those same

18:27

workstations cannot handle the workload

18:30

because per worker consumption has gone

18:32

up 10x. The NPUs that were adequate for

18:34

code completion and document

18:36

summarization cannot sustain agentic

18:38

workflows consuming billions of tokens.

18:41

The machines aren't broken, they're just

18:43

obsolete. What does the enterprise do at

18:45

this point? Option A is continue using

18:47

inadequate hardware. Workers get

18:49

constrained. Productivity growth lags.

18:51

Competitors with better infra ahead. The

18:53

savings from extending depreciation cost

18:55

you much more than the lost

18:56

productivity. Option B, purchase new

18:59

hardware. The enterprise takes a right

19:01

down on the assets. The $5 million

19:03

investment yields maybe 2 million in

19:04

value and finance is unhappy. Option C,

19:07

lease instead of buy. The enterprise

19:09

pays a premium because lessers aren't

19:11

charities, but it transfers depreciation

19:14

risk. Option C may be the correct

19:17

answer, but it also passes the buck and

19:20

you have to be able to find a way to

19:22

lease tech at scale that actually works.

19:25

And so as much as it is probably the

19:27

ideal solution from a balance sheet

19:29

perspective, I have yet to see an

19:32

enterprise successfully execute a

19:34

large-scale lease. What I have seen is

19:37

large-scale commitments to cloud as a

19:39

way of deferring cost incurred at the

19:43

enterprise level. And I do wonder if one

19:46

of the ways forward here is that the

19:48

workstations we're going to be using at

19:49

the enterprise will remain pretty dumb,

19:52

but we will buy that scarce cloud

19:54

capacity, which is exactly what the

19:56

hyperscalers would want because cloud

19:58

providers will offer substantial

20:00

discounts for committed user agreements.

20:02

If you're doing a multi-year commit to

20:03

cloud, you can reduce effective pricing

20:06

by 30 to 50% compared to an ondemand

20:08

rate. at 10x annual demand growth. The

20:11

problem is that those multi-year commits

20:13

end up being traps because in scenario

20:15

one, if you're buying cloud, you can

20:17

undercommit. You estimate 10 trillion

20:19

tokens a year for the business. Your

20:20

actual consumption is 30. You pay on

20:22

demand rates for your overage and you're

20:24

in real trouble on a budget perspective.

20:26

Scenario two, you get overaggressive and

20:28

you overcommit. You estimate 30 trillion

20:30

tokens. Your actual consumption is 15.

20:32

Well, now you paid for 30 trillion

20:33

regardless and you have half your spend

20:36

as waste. Scenario three, you commit

20:38

accurately and this requires very

20:40

carefully predicting your AI

20:42

consumption, your AI capability

20:43

improvements, your efficiency gains,

20:45

etc. And the probability of accurate

20:48

prediction across the dynamic

20:50

environment we're in is in practice

20:52

zero. Option three is what people wish

20:55

was there and it's not really there. A

20:57

lot of enterprises are looking at this

20:59

and they are choosing option one.

21:01

They're doing a committed use agreement

21:03

and they're treating that as the floor

21:06

and they're going with overages as

21:08

capacity grows because they can't

21:10

predict it. And that lines up with sort

21:12

of some of the the and that lines up

21:14

with the strategic levers that I've seen

21:17

sharp CTOs actually use in this kind of

21:20

environment. Principle number one, sharp

21:22

CTO are securing capacity before they

21:24

need it. The single highest impact

21:26

action an enterprise can take is

21:28

securing inference now before the crisis

21:31

peaks. That doesn't mean you're

21:33

necessarily signing a gigantic committed

21:35

use agreement. It means obtaining some

21:37

contractual guarantees of throughput

21:39

with some degree of SLAs for

21:41

availability. And so the conversation

21:43

with vendors should shift from what is

21:45

your price per million tokens to can you

21:48

contractually guarantee us x billion per

21:50

day sustained with 99.9% availability.

21:54

If the vendor can't deliver the volume,

21:57

their pricing is often irrelevant.

21:59

Principle number two, sharp CTOs are

22:01

building a routing layer. The most

22:03

durable competitive advantage in this

22:04

environment is the intelligence layer

22:06

that decides where workloads run. A

22:08

sophisticated routing system is going to

22:10

optimize for cost. It's going to manage

22:12

capacity. It's going to preserve your

22:13

optionality by abstracting the

22:16

underlying infrastructure. So switching

22:17

providers is trivial and it's going to

22:19

enable negotiating leverage. But

22:21

building this layer, it's not trivial.

22:23

You have to have the right architecture

22:25

for a routing layer. You have to be able

22:26

to evaluate models intelligently on the

22:29

fly. You have to have great

22:30

observability and you're going to have

22:32

to hire a team to sustain this. So this

22:34

is something that is so important that

22:36

if you are operating at enterprise

22:38

scale, you cannot really outsource this

22:41

capability. This is something that you

22:43

have to have the secret sauce that

22:44

connects the usage you have internally

22:47

with a smart router that optimizes for

22:50

cost. The routing layer is how you

22:52

maintain your business independence.

22:54

Principle number three, sharp CTOs treat

22:57

hardware like a consumable. Any hardware

22:59

purchased for AI workloads should be

23:02

mentally depreciated within 2 years

23:04

regardless of how accounting is going to

23:06

treat it. So for workstations and edge

23:08

devices, yeah, you should lease where

23:10

you can. It's often hard to do. And if

23:12

you're purchasing, I would use an

23:14

accelerated depreciation schedule

23:16

because I think that reflects reality.

23:18

You need to plan for refresh cycles that

23:20

coincide with hardware generations

23:22

because every 18 to 24 months there's

23:24

going to be a new GPU architecture that

23:26

arrives with a really significant

23:27

capability improvement you're going to

23:29

want. Principle number four, sharp CTOs

23:32

invest in efficiency in a supply

23:34

constrained environment. Efficiency is a

23:36

competitive advantage. Every token you

23:38

don't consume is capacity you can

23:41

allocate to additional workloads. And so

23:43

an enterprise that can accomplish the

23:45

same task with 50% fewer tokens has

23:48

twice the effective capacity. This is

23:50

why Deepseek's work on engram is so

23:52

interesting because they were able to

23:54

effectively dramatically reduce token

23:56

usage at inference for factual lookups.

23:59

But it's not just use the smallest model

24:02

capable for each task. Well-designed

24:04

prompts can result in much lower token

24:06

usage. Caching can result in much lower

24:08

token usage. Retrieval augmentation can

24:11

use embedding based retrieval that is

24:13

orders of magnitude cheaper than raw

24:15

inference. This is a little bit of what

24:17

DeepS was getting into. Quantization can

24:20

enable a smaller model to match larger

24:22

model performance on very specific

24:24

tasks. These kinds of efficiency

24:26

investments have traditionally been

24:28

lower priority than capability

24:29

investments, but in a constrained

24:32

environment, they're going to become

24:33

critical. The enterprises that can

24:35

unlock Xefficiently effectively have

24:38

given themselves 10x more capacity. The

24:40

global inference crisis if we step back

24:43

is not a prediction. I'm not predicting

24:45

something here. I am simply observing

24:48

current conditions and what is right

24:50

around the corner based on existing

24:52

price raises we're seeing. The demand

24:54

curve here is exponential. The supply

24:56

curve is flat and the gap is just going

24:58

to widen for the next couple of years.

25:00

The playbook is pretty clear when you

25:02

actually see the situation for what it

25:04

is. You have to secure your capacity

25:06

now. You have to build a routing layer

25:08

that enables you to allocate your choice

25:11

of model where you want it. You have to

25:13

treat hardware more as a consumable.

25:15

It's a big change for IT departments.

25:17

You have to invest in efficiency as if

25:19

it's your competitive advantage. And

25:21

you're gonna have to think about how you

25:24

can diversify across your entire stack

25:27

as much as you can so that you reduce

25:29

your dependence on any single player in

25:31

the ecosystem. The enterprises that act

25:33

on this playbook will be positioned to

25:36

operate through the crisis and compete

25:38

effectively when supply starts to even

25:40

out on the other side. Those that don't

25:42

are going to find themselves capacity

25:44

constrained and cost pressured and

25:46

they're going to be falling behind in

25:48

the biggest technology race in history.

25:50

But this is, remember, this is not a

25:52

technology problem. It is an economic

25:54

transformation problem. And it's going

25:56

to separate winners and losers based on

25:58

these kinds of decisions that I'm

26:00

outlining here that CTO's will make in

26:03

the next 6 months. The window is open

26:05

for action, but it's not going to stay

26:07

open long given where the prices are

26:09

going. The moment to move and secure

26:12

your capacity is now. Best of luck.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video discusses a burgeoning structural crisis in global technology infrastructure caused by an imbalance between the exponential demand for AI inference compute and physically constrained supply. Key bottlenecks include memory (DRAM/HBM), semiconductor fabrication capacity, and GPU availability. The speaker warns that hyperscalers like Google and Microsoft are prioritizing their own AI products, leaving enterprises to compete for remaining resources. To navigate the next 24 months, leaders are advised to secure capacity immediately, develop internal routing layers for workload flexibility, adopt shorter hardware depreciation cycles, and prioritize token efficiency.