Why the Smartest AI Teams Are Panic-Buying Compute: The 36-Month AI Infrastructure Crisis Is Here
720 segments
We built an economy that runs on AI and
now there isn't enough compute to run
that economy. A structural crisis is
emerging in global technology
infrastructure. Over the past three
years, the world economy has reorganized
itself around AI capabilities. It's now
the biggest capex project in human
history. And those capabilities depended
entirely on inference compute. That
compute is now physically constrained
and no relief is expected before 20.
Discussion documents the nature of what
is going on. explains why it differs
from previous technology supply
crunches. And we're going to analyze the
strategic implications for enterprises
and provides some actionable guidance
for leaders who have to figure out how
to navigate the next 24 months. Just
before we get into it, I'm not going to
hide the ball. These are the top things
that are popping up for me as I dig into
this issue. Number one, demand remains
exponential and uncapped. That is the
first big driver here. Enterprise AI
consumption is growing at least at 10x
annually driven by per worker usage
increases and the proliferation of
agentic systems. There is no ceiling.
Number two, supply is physically
constrained all the way through 2028 at
least. DRAM fabrication takes 3 to four
years at semiconductor capacity is fully
allocated. High bandwidth memory is sold
out. New capacity literally cannot
arrive fast enough. Number three, all of
the hyperscalers are hoarding. Google,
Microsoft, Amazon, Meta, they've all
locked up compute allocation for years
in advance. So have open AI. So is
anthropic. They are using it to power
their own products while enterprises
compete for what remains. Number four,
pricing is going to spike. It will not
rise gradually. When demand exceeds
supply this structurally and this
severely, markets do not clear very
smoothly. Trend Force projects memory
costs alone will add 40 to 60% to
inference infrastructure in the first
half of 2026. Effective inference costs
could double or triple within 18 months.
Next, traditional planning frameworks
are broken. Capex models, depreciation
schedules, and multi-year procurement
cycles all assume predictable demand and
available supply. Neither assumption
holds in the age of AI. And finally, the
window to secure capacity is closing.
Enterprises that move now can lock in
allocation before the crisis peaks. And
those that wait are going to find
themselves bidding against each other
for scraps at best or be locked out
entirely. I want to repeat this is not a
tech problem. It's being presented as
one, but that's incorrect. It's actually
an economic transformation with
consequences that will reshape
competitive dynamics across every
industry. To understand this crisis,
let's start with what is being consumed.
A knowledge worker using AI tools
aggressively, code completion, document
analysis, research assistance, meeting
summarization, consumes, I don't know,
call it a billion tokens a year if
you're really leaning in. That's the
current baseline for heavy users at AI
forward enterprises. And the ceiling is
much higher. The ceiling is 25 billion
tokens or more a year. 25x that 1
billion. And 1 billion tokens does sound
like a lot except that it isn't. A
single complex analysis task like review
these 50 docs and synthesize their key
findings can consume a half a million
tokens real easy. Day of active coding
can run into the millions of tokens. A
research session with multiple
iterations you can get as high as 10
million tokens. Consumption continues to
grow because we have not found a limit
to what we want to do with artificial
intelligence. There is no demand limit
for intelligence. Three dynamics are
driving continued acceleration of
consumption here. First, capability is
unlocking usage. As models improve, as
orchestration improves, users are
finding new applications and features
that didn't work well a year ago, like
complex reasoning, now work very well.
Every capability improvement unlocks a
lot more new demand, like nonlinear new
demand. Think about what happens when
you go from managing one agent to 10,
10x the demand. Integration is also
multiplying our touch points here. So AI
is no longer a standalone tool you open
when you need it. It is now embedded
across email clients, document editors,
development environments, CRM, etc.
Every time we integrate, we are creating
a portal for ambient continuous
consumption and agents are compounding
on top of everything. The shift from
human in the loop to agentic systems
where AI calls AI in ever more automated
loops that is that is not just a step
change. I struggle to explain how big a
change that is in consumption terms. It
is a multiple order of magnitude change.
A single agentic workflow can consume
more tokens in an hour than a human
generates in a month. Current
projections I'm just it's not that
unusual to think that a average worker
consumption model will hit 10 billion
tokens annually within the next year and
a half on average right and you may be
over a 100red billion tokens very easily
with top users. That's just
extrapolation from observed growth at AI
forward enterprises is not all that
aggressive. Some organizations are going
to be approaching those levels in
specific roles right now. Well, now
let's step back and look at what this
math means at enterprise scale. Let's
say you have a 10,000 person
organization. You're at a billion tokens
per worker. And so you're consuming 10
trillion tokens a year at a blended API
rate of two bucks per million tokens.
You're spending $20 million a year on
inference. That's expensive, but it is
manageable for a Fortune 500. At 10
billion tokens per worker, the same
organization is now going to consume a
100red trillion tokens a year. And at
the same rate, that's $200 million a
year. Suddenly, at 100 billion tokens
per year, which Agentic systems could
reach within 18 months, now your compute
bill is $2 billion.
And these calculations assume stable
pricing and available capacity, but
really neither assumption holds. It's
even more chaotic than that. The agentic
shift deserves special attention here
because it changes our consumption
models fundamentally. Human users have
natural rate limits. They type at a
certain speed. They take breaks. They
attend meetings. They go home. A human
would probably have difficulty
sustaining more than maybe 50 million
tokens a day, even working intensively.
Agents do not have such limits. An
agentic system running continuously,
monitoring, analyzing, responding,
planning. It is not far-fetched to
imagine a single worker driving that
system to consume billions of tokens per
day. And a fleet of agents working in
parallel could consume trillions. That's
not hypothetical. Enterprises are
already deploying Aentic systems for
code review, for security monitoring,
for customer service, for document
processing. And every deployment creates
sustained 247 inference demand that
dwarfs what human users generate.
that dwarfs what human users generate.
The enterprises planning for a billion
tokens a worker are planning for the
wrong curve. They need to plan for the
workers plus the agents whose workers
deploy plus the agents the enterprise
deploys centrally. This total
consumption footprint could be 10 to
100x the per human calculation. One data
point really crystallizes the insane
trajectory that we're all on. Google has
publicly disclosed that it processed 1.3
quadrillion tokens per month across its
services. That is a 130fold increase in
just over a year. Google is the world's
most sophisticated operator of AI
infrastructure with more visibility into
demand patterns than anybody else. Their
consumption curve is your leading
indicator. If the world's largest and
most capable AI operator is seeing X
annual growth, enterprise planning for
10X may well be conservative. And yet
and yet we live in a memory bottleneck.
Right? All of this assumes that we can
get the hardware to run it. But AI
inference is memory bound. The size of
the model you can run, the speed at
which you can run it, and the number of
concurrent users you can serve all
depend on memories. Specifically, high
bandwidth memory for data center
inference and DDR5 for everyone else.
And the memory market is broken. Server
DRAM prices have risen at least 50%
through 2025, likely more. Trend Force
projects they're rising another 55 to
60% quarter over quarter in Q1 2026. We
are looking at tripledigit increases in
DRAM prices and I think it goes higher
than that. The DDR5 64 gig RD dim
modules the workhorse of enterprise data
center deployment could cost twice as
much by the end of 2026 as they did in
early 2025. Counterpoint Research
projects DRAM prices overall are going
to rise, call it 50% 47% in 2026 due to
significant under supply. Look, I don't
care whether you pick the 47% number or
the 55% number. When you're talking
about prices rising this fast, this
much, this is not a typical cyclical
shortage. There are structural factors
that make this different. First, the
three big players that control 95% of
global memory production are all
reallocating away from consumer. That's
Samsung, KH Highix, and Micron. They're
all headed to enterprise segments and
specifically AI data center customers.
It's not that they're producing less,
it's that they're producing different
memory and the hyperscalers are buying
all of it. Then there's HBM
concentration. High bandwidth memory is
essential for large model inference. It
is a specialized product. SK Highix
dominates and their output is allocated
to Nvidia, AMD and hyperscalers. You
can't get HBM at any price. It's simply
not available. And then last but not
least, new capacity is not quick to come
online. It takes years. A new DRM fab
facility is going to cost you on the
order of $20 billion and it takes 3 to
four years to construct and ramp.
Decisions made today to invest will not
yield chips until about 2030. There is
no near-term supply response. Samsung's
president has stated publicly that
memory shortages will affect pricing
industrywide through 2026 and beyond.
That's the world's largest memory
manufacturer telling you on the record
they cannot meet demand. Below memory,
there's another constraint. That's the
semiconductor fab layer, and it's even
more bottlenecked than memory. TSMC
manufactures the world's most advanced
chips, including the Nvidia data center
GPUs. Their nodes, the 5nanmter, 4 and 3
nanometer, are all fully allocated.
Nvidia is their largest consumer. Apple
is second, and the hyperscalers fill out
the rest. TSMC's capacity expansion is
proceeding, but very slowly. Their
Arizona FAB won't reach full production
until 2028. New facilities in Japan and
in Germany are on similar timelines.
Intel's 18A process, demonstrated at CES
2026, represents the first credible
American alternative to TSMC. But
Intel's foundry services are unproven at
scale. Their capacity is quite limited,
and their first major customers,
including Microsoft, are going to absorb
all of their initial allocation.
Meanwhile, Samsung's foundry business
has struggled with yields on advanced
nodes, making them a less reliable
alternative for cutting edge chips. The
results is kind of predictable.
Essentially, all advanced AI chip
production runs through TSMC in Taiwan.
There is no surge capacity. There is no
alternative. This is not a short-term
fix situation either. But we're not done
with the bottlenecks. We also have a GPU
allocation crisis. Nvidia dominates AI
training and inference chips with
roughly 80% market share. Their H100 and
newer Blackwell GPUs are the standard
for data center AI. Both are sold out.
Lead times for large GPU orders exceed 6
months. The hyperscalers have locked up
allocation through multi-year purchase
agreements worth tens of billions of
dollars. Microsoft, Google, Amazon,
Meta, and Oracle have collectively
committed to hundreds of billions in
Nvidia purchases over the next several
years because of the demand I've been
talking about. And enterprise buyers are
left with whatever allocation remains,
which increasingly is not much. NVIDIA's
H200 and Blackwell GPUs, which offer
significant performance enhancements,
are even more constrained. Initial
production runs are fully allocated to
hyperscalers. Enterprise availability in
any kind of volume is very uncertain,
and you don't have great alternatives.
AMD's Instinct MI300X is competitive on
specs and available in somewhat larger
quantities, but the software ecosystem
maturity lags significantly versus
Nvidia. Intel's Gaudy accelerators have
struggled to gain market share despite
competitive pricing. Software and
ecosystem adoption remain challenges.
Custom silicone, like Google's TPU or
Amazon's Tranium, they're not available
to enterprises and they're typically
built for internal use unless you have
very special deals. None of these
alternatives changes the fundamental
picture. The GPUs that enterprises need
are controlled by companies that have
every incentive to use them internally
rather than sell them. And this is the
piece that most analyses miss. AWS,
Azure, and Google Cloud are not neutral.
They are AI product companies that
happen to sell infrastructure. They
compete directly with their enterprise
customers. Google uses its compute
allocation to power Gemini, which
competes with every enterprise AI
deployment. Microsoft uses its
allocation for co-pilot which competes
with every enterprise productivity AI
and Amazon uses its allocation for AWS
AI services competing across the board.
When compute is abundant this conflict
of interest is very manageable. The
hyperscalers can serve their own needs
and sell excess capacity and everybody
wins. When compute is scarce like now
the conflict becomes zero sum real fast.
Every GPU allocated to an enterprise
customer is a GPU not available for
Gemini, Copilot or Alexa. The
hyperscalers must choose between their
own products and their customers. This
dynamic, by the way, is true for OpenAI
and Anthropic as well. If the GPUs are
not in an OpenAI data center, OpenAI is
missing out on serving chat GPT
customers. And when in doubt, all of
these hyperscalers will choose their
products. They already have. Look at
what's happening with rate limits. API
pricing has fallen over the past two
years, but rate limits have tightened.
Enterprise customers report increasing
difficulty getting allocation
commitments for high volume deployments.
And to be honest, hyperscalers are not
being the villains here. They're being
rational. Their AI products are
strategic priorities with internal
champions. And selling capacity to
enterprises is just a business. It's not
the business their leadership is
necessarily measured on. In the race to
AGI, enterprise CTOs need to internalize
that the cloud providers are not going
to be reliable partners in this crisis.
They are competitors who control the
scarce resource that you need. Now, of
course, in a smoothly functioning
market, prices will rise gradually as
supply tightens, demand will moderate,
and equilibrium is restored. This is not
the world we live in. Because supply
cannot respond to demand and demand
cannot be deferred, prices are going to
spike. Buyers will bid against each
other. They're willing to pay premiums.
Sellers seeing the desperation will
raise prices for extraction, not for
equilibrium. We've already seen this
before. DRAM prices spiked 300% during
the 2016 shortage. GPU prices doubled
during the crypto mining boom. Memory
prices are notoriously volatile because
the supply side is so inelastic. You
just can't produce chips fast. And the
current situation has all the
ingredients for a severe spike. Supply
is inelastic. No new fabs coming online
for a couple years. Demand is inelastic.
Enterprises are committed to AI and
they're not changing. Information is
asymmetric. The hyperscalers know how
constrained supply is. Enterprises by
and large do not. And coordination is
possible with only three major memory
suppliers and one GPU supplier. Tacic
coordination it can happen and no
antitrust violation is required to do
that. Now the impact of a spike in
inference cost is going to depend on
your business model. AI native startups
are extremely exposed. Companies like
Notion have publicly disclosed that AI
costs now consume 10 percentage points
of what was previously a 90% gross
margin business. If AI is margin diluted
at current pricing and inference costs
double, many AI native business models
are going to become unviable. Enterprise
software companies building AI features
face similar pressure. AI is a
competitive requirement, but every AI
feature erodess margin. Companies are
going to have to choose between
competitive necessity and financial
sustainability. Enterprises using AI
internally have somewhat more
flexibility. If AI is creating
disproportionate value, the cost
increase can be justified, but I would
expect budget scrutiny to intens
intensify, but I would expect budget
scrutiny to intensify over the next two
years. AI projects that were approved at
one cost level may well be canceled at
twice the cost. Now, the hyperscalers
themselves, ironically, are somewhat
insulated. They own the infrastructure
that's becoming scarce, but even they
will face constraints. Google,
Microsoft, and Amazon are all warning
investors about rising AI infra costs.
The companies most at risk are those in
the middle. They're too dependent on AI
to abandon it. They're not large enough
to secure dedicated allocation, and they
are competing in markets where pass
through cost increases are very
difficult to sustain. So, let's cut to
the chase. How do you plan for this if
you're in the enterprise? That's the
billiondoll question. traditional
planning fails. So enterprise IT
planning evolved for a fundamentally
different era and is just not ready. The
traditional model was to assess
requirements to procure infrastructure
to depreciate it over 3 to 5 years and
all of that assume predictable demand
technology that was stable and supply
that was available. None of that is true
anymore. Demand is unpredictable and
exponentially scaling. Technology itself
is unstable because model architectures
and hardware capabilities are changing
rapidly and supply is extremely
constrained. As we've discussed, CTO's
who apply traditional planning
frameworks in that environment are going
to set themselves up to systematically
make bad decisions. They're going to
overcommit to long-term purchases that
become stranded assets. I've seen it
already. They're going to underinvest in
flexibility and optionality and they're
going to assume supply availability that
isn't really there. Let's consider a
real example to make this sort of
tangible and concrete. Let's suppose we
have an enterprise that purchases a
thousand AI workstations with NPU
capabilities at 5 grand each. That's a
$5 million capital investment. Finance
sets a 4-year depreciation schedule. It
expects to extract 1.25 million a year
in value. By year two, those same
workstations cannot handle the workload
because per worker consumption has gone
up 10x. The NPUs that were adequate for
code completion and document
summarization cannot sustain agentic
workflows consuming billions of tokens.
The machines aren't broken, they're just
obsolete. What does the enterprise do at
this point? Option A is continue using
inadequate hardware. Workers get
constrained. Productivity growth lags.
Competitors with better infra ahead. The
savings from extending depreciation cost
you much more than the lost
productivity. Option B, purchase new
hardware. The enterprise takes a right
down on the assets. The $5 million
investment yields maybe 2 million in
value and finance is unhappy. Option C,
lease instead of buy. The enterprise
pays a premium because lessers aren't
charities, but it transfers depreciation
risk. Option C may be the correct
answer, but it also passes the buck and
you have to be able to find a way to
lease tech at scale that actually works.
And so as much as it is probably the
ideal solution from a balance sheet
perspective, I have yet to see an
enterprise successfully execute a
large-scale lease. What I have seen is
large-scale commitments to cloud as a
way of deferring cost incurred at the
enterprise level. And I do wonder if one
of the ways forward here is that the
workstations we're going to be using at
the enterprise will remain pretty dumb,
but we will buy that scarce cloud
capacity, which is exactly what the
hyperscalers would want because cloud
providers will offer substantial
discounts for committed user agreements.
If you're doing a multi-year commit to
cloud, you can reduce effective pricing
by 30 to 50% compared to an ondemand
rate. at 10x annual demand growth. The
problem is that those multi-year commits
end up being traps because in scenario
one, if you're buying cloud, you can
undercommit. You estimate 10 trillion
tokens a year for the business. Your
actual consumption is 30. You pay on
demand rates for your overage and you're
in real trouble on a budget perspective.
Scenario two, you get overaggressive and
you overcommit. You estimate 30 trillion
tokens. Your actual consumption is 15.
Well, now you paid for 30 trillion
regardless and you have half your spend
as waste. Scenario three, you commit
accurately and this requires very
carefully predicting your AI
consumption, your AI capability
improvements, your efficiency gains,
etc. And the probability of accurate
prediction across the dynamic
environment we're in is in practice
zero. Option three is what people wish
was there and it's not really there. A
lot of enterprises are looking at this
and they are choosing option one.
They're doing a committed use agreement
and they're treating that as the floor
and they're going with overages as
capacity grows because they can't
predict it. And that lines up with sort
of some of the the and that lines up
with the strategic levers that I've seen
sharp CTOs actually use in this kind of
environment. Principle number one, sharp
CTO are securing capacity before they
need it. The single highest impact
action an enterprise can take is
securing inference now before the crisis
peaks. That doesn't mean you're
necessarily signing a gigantic committed
use agreement. It means obtaining some
contractual guarantees of throughput
with some degree of SLAs for
availability. And so the conversation
with vendors should shift from what is
your price per million tokens to can you
contractually guarantee us x billion per
day sustained with 99.9% availability.
If the vendor can't deliver the volume,
their pricing is often irrelevant.
Principle number two, sharp CTOs are
building a routing layer. The most
durable competitive advantage in this
environment is the intelligence layer
that decides where workloads run. A
sophisticated routing system is going to
optimize for cost. It's going to manage
capacity. It's going to preserve your
optionality by abstracting the
underlying infrastructure. So switching
providers is trivial and it's going to
enable negotiating leverage. But
building this layer, it's not trivial.
You have to have the right architecture
for a routing layer. You have to be able
to evaluate models intelligently on the
fly. You have to have great
observability and you're going to have
to hire a team to sustain this. So this
is something that is so important that
if you are operating at enterprise
scale, you cannot really outsource this
capability. This is something that you
have to have the secret sauce that
connects the usage you have internally
with a smart router that optimizes for
cost. The routing layer is how you
maintain your business independence.
Principle number three, sharp CTOs treat
hardware like a consumable. Any hardware
purchased for AI workloads should be
mentally depreciated within 2 years
regardless of how accounting is going to
treat it. So for workstations and edge
devices, yeah, you should lease where
you can. It's often hard to do. And if
you're purchasing, I would use an
accelerated depreciation schedule
because I think that reflects reality.
You need to plan for refresh cycles that
coincide with hardware generations
because every 18 to 24 months there's
going to be a new GPU architecture that
arrives with a really significant
capability improvement you're going to
want. Principle number four, sharp CTOs
invest in efficiency in a supply
constrained environment. Efficiency is a
competitive advantage. Every token you
don't consume is capacity you can
allocate to additional workloads. And so
an enterprise that can accomplish the
same task with 50% fewer tokens has
twice the effective capacity. This is
why Deepseek's work on engram is so
interesting because they were able to
effectively dramatically reduce token
usage at inference for factual lookups.
But it's not just use the smallest model
capable for each task. Well-designed
prompts can result in much lower token
usage. Caching can result in much lower
token usage. Retrieval augmentation can
use embedding based retrieval that is
orders of magnitude cheaper than raw
inference. This is a little bit of what
DeepS was getting into. Quantization can
enable a smaller model to match larger
model performance on very specific
tasks. These kinds of efficiency
investments have traditionally been
lower priority than capability
investments, but in a constrained
environment, they're going to become
critical. The enterprises that can
unlock Xefficiently effectively have
given themselves 10x more capacity. The
global inference crisis if we step back
is not a prediction. I'm not predicting
something here. I am simply observing
current conditions and what is right
around the corner based on existing
price raises we're seeing. The demand
curve here is exponential. The supply
curve is flat and the gap is just going
to widen for the next couple of years.
The playbook is pretty clear when you
actually see the situation for what it
is. You have to secure your capacity
now. You have to build a routing layer
that enables you to allocate your choice
of model where you want it. You have to
treat hardware more as a consumable.
It's a big change for IT departments.
You have to invest in efficiency as if
it's your competitive advantage. And
you're gonna have to think about how you
can diversify across your entire stack
as much as you can so that you reduce
your dependence on any single player in
the ecosystem. The enterprises that act
on this playbook will be positioned to
operate through the crisis and compete
effectively when supply starts to even
out on the other side. Those that don't
are going to find themselves capacity
constrained and cost pressured and
they're going to be falling behind in
the biggest technology race in history.
But this is, remember, this is not a
technology problem. It is an economic
transformation problem. And it's going
to separate winners and losers based on
these kinds of decisions that I'm
outlining here that CTO's will make in
the next 6 months. The window is open
for action, but it's not going to stay
open long given where the prices are
going. The moment to move and secure
your capacity is now. Best of luck.
Ask follow-up questions or revisit key timestamps.
The video discusses a burgeoning structural crisis in global technology infrastructure caused by an imbalance between the exponential demand for AI inference compute and physically constrained supply. Key bottlenecks include memory (DRAM/HBM), semiconductor fabrication capacity, and GPU availability. The speaker warns that hyperscalers like Google and Microsoft are prioritizing their own AI products, leaving enterprises to compete for remaining resources. To navigate the next 24 months, leaders are advised to secure capacity immediately, develop internal routing layers for workload flexibility, adopt shorter hardware depreciation cycles, and prioritize token efficiency.
Videos recently processed by our community