Google's New AI Is Smarter Than Everyone's But It Costs HALF as Much. Here's Why They Don't Care.
1014 segments
Google just shipped the smartest AI
model on the planet. It's Gemini 3.1
Pro. It costs a seventh of the
competition and they don't even need you
to use it. That's right. They shipped a
model that leads on 13 of 16 benchmarks.
It costs roughly a seventh of what Opus
4.6 charges. And Google really doesn't
care. That's not a weird flex on their
part. It might be the most important
strategic signal in AI right now. And
almost nobody is talking about it. The
coverage of Gemini 3.1 Pro has been all
about those benchmarks. And what's been
missing is the question that is
underneath. Why does the richest company
in tech, a company generating over a
hundred billion in annual free cash
flow, build the most powerful reasoning
engine on the market, price it at the
floor, and be perfectly comfortable if
you keep using Claude or Chad GPT for
your daily work? The answer reshapes how
you should think about every model
release from here on out. It changes how
you evaluate your own skills and it
explains why most of the conversation
about which AI I should use is really
asking the wrong question at this point.
So a couple weeks ago I wrote about Opus
4.6 and the way they use 16 AI agents to
build a C compiler. That piece was about
a new kind of labor. Agents coordinating
in teams, managing engineering orgs,
doing weeks of sustained autonomous
work. This video is about something
different. This video is about why the
company with the deepest pockets and the
widest distribution in the history of
computing in the history of the planet
is playing a fundamentally different
game from anybody else. And what that
means for how you evaluate AI models,
choose your tools, and understand which
of your problems are about to get
dramatically easier to solve versus
which ones are not. So, we're going to
talk about one benchmark out of those 16
because I don't usually talk about
benchmarks. That number is 77.1%
and it's on the ARC AGI2 benchmark. Why
do we care? It's not about pattern
matching from training data. ARC AGI2
tests whether a model can solve logic
problems it has never ever seen before.
So it's not about retrieval from
memorized examples, but about genuinely
novel reasoning. Can the model look at a
problem it's never encountered and
figure it out from first principles? I
want you to look at the acceleration on
that benchmark. Gemini 3 Pro, which
shipped just in November, scored 31.1%.
Just 90 days later, 3.1 Pro ships and it
more than doubles that score. That 46
percentage point jump is the largest
single generation reasoning gain any
Frontier model family has ever produced.
Opus 4.6 scored 68.8% on the same
benchmark, which is very close. GPT 5.2
scored a little bit lower. The rest of
the score card tells a very similar
story, right? A very high score on GPQA
Diamond, which is essentially a science
benchmark that's saturated at this
point. A very strong benchmark score on
Live Codebench Pro, which measures
coding abilities. You get the idea.
These numbers are real, but the
benchmark isn't the point. The point is
that Google chose to optimize for pure
reasoning. Enthropic built opus 4.6 Six
for agentic work. Stained autonomous
coding pool calling in loops. Agent
teams coordinating across code bases
sometimes for weeks at a time. Open AI
built codeex 5.3 for specialized coding
pipelines with self- bootstrapping
sandboxes and thousand token per second
throughput at max. Google built Gemini
3.1 Pro for none of those things. They
built it to think harder, not to code
longer, not to manage more agents, to
reason more deeply about problems it has
never seen. That design choice tells you
everything about who Google thinks it is
and where they think they're going with
AI. Demuse Hosabus has been saying the
same sentence for 15 years. Step one,
solve intelligence. Step two, use it to
solve everything else. And so Google is
focused on solving intelligence. He said
it when Deep Mind was a London startup
nobody had heard of. He said it after
Alph Go beat a Go Grandmaster. He said
it at Davos last month. He said it on
the Fortune podcast last week where he
predicted artificial general
intelligence would come very very soon.
He's actually updated his prediction and
he is a conservative guy. He's seeing it
coming within 5 years at this point. And
he said it on 60 Minutes when he talked
about curing most diseases within a
decade. The sentence hasn't changed
because his mission hasn't changed. This
is not how anyone else in the AI
industry talks. Sam Alman talks about
products, partnerships, distribution,
the race to a billion users. And yeah,
he talks about intelligence, but in the
context of how it's applied so
frequently. When OpenAI put ads in Chad
GPT, they did so because they needed to
monetize a billion person user base. And
Huss, when asked about the Chad GPT ads
at Davos, said he was surprised OpenAI
moved so fast on advertising. Of course,
he said that Google is able to monetize
search and they don't need to monetize
ads in Gemini. As a result, Open AAI
does not have the world's largest search
engine and its profit streams funding
them. And OpenAI is in a different
funding position. The subtext from
Hosabuse was unmistakable. We're not
thinking about monetizing our Google AI
chatbot. We're just thinking about
intelligence. Now, I want to be clear
here. This is not because Google is
somehow above commercial concerns.
Google runs the most profitable
advertising business in human history.
They generated over a hundred billion
dollars in annual free cash flow from
search, YouTube, and cloud. They're
spending $93 billion on capital
expenditure this year, and most of it is
AI. They can afford to let Gemini be a
research vehicle because their economic
engine has nothing to do with whether
you personally prefer Claude or Chat GPT
for your daily workflow. Everyone else
in AI is trying to figure out how to
monetize models. Google is trying to
figure out how to build intelligence.
The money handles itself. Must be nice.
And how does Google get that advantage?
It's not just the profit streams. It's
also that Google has deliberately built
over the last decade a vertical stack in
AI that nobody else has. They design
their own silica. The Ironwood TPU, 7th
generation announced earlier this year,
delivers 10 times the compute power of
the last generation at roughly half the
energy cost per operation. It can link
up 9,216
chips in a single pod. Anthropic just
signed a deal to use a million TPUs
under a multi-year arrangement valued in
the tens of billions of dollars. Meta is
reportedly negotiating a similar
commitment. When your competitors train
their frontier models on your hardware,
you have built something beyond a moat.
You've built an impregnable fortress.
Google trains their own models on that
silicon. They deploy those models
through their own cloud infrastructure.
Google Cloud, which nine out of 10 AI
research labs use in some capacity. They
distribute them to 650
million monthly active Gemini users,
although again that's not the primary
point. And it's up 44% in a single
quarter. plus billions more through
search, Android, YouTube, and Chrome.
They have easily the largest human reach
in history of any company. And they fund
the fundamental research through Deep
Mind, which won a Nobel Prize in
chemistry 18 months ago for Alphafold, a
system that predicted the structure of
virtually every known protein, a problem
biologists have been working on for 50
years. This vertical integration from
transistor design to protein folding is
not an accident. It's the architecture
of a company that believes intelligence
is a problem in computer science, that
the problem is solvable, and that
solving it requires controlling the
entire stack from physics up to
software. Google's Jeff Dean said
they're working to shrink TPU design
cycles from 2 years down to 6 to9 months
by using AI in the chip design process
itself. They're using intelligence to
build the hardware that runs
intelligence. The flywheel is
self-reinforcing and it's accelerating
dramatically. Nobody else has this.
Microsoft has Azure and a partnership
with OpenAI, but they don't make chips
and their consumer distribution in AI is
very fragmented. Copilot has been
rightly criticized for feeling
disjointed across office products.
Amazon has AWS and Tranium chips, but
their models trail the frontier
dramatically. Meta has research talent
and social distribution, but no cloud
business and no chip stack. Anthropic
has arguably the best product for
agentic work today, but they run on
other people's hardware, including
Google's TPUs and Amazon's
infrastructure, and they need every
single customer they can get to justify
their valuation. They cannot afford to
just build pure intelligence. Google is
the only company that could lose the
model race quote unquote entirely. Every
developer and every enterprise customer
choosing Claude or Chat GPT for each
task, and they would still be fine
because the models are not their
business. The models are experiments in
intelligence that they choose to
release, funded by the largest cash
generating machine in technology running
on proprietary silicon and feeding
results back into products used by half
the planet. That changes how you should
interpret what Google ships. So what is
Gemini 3.1 Pro and what isn't it? Gemini
3.1 Pro is not a coding agent per se,
though of course it can write code very
well. It's also not an agent manager,
although it can manage agents. It's not
trying to autonomously close issues
across a 50 person engineering org the
way opus 4.6 did at Raku 10en if that's
what you need. Opus is probably better
at it right now and Google knows that.
What 3.1 Pro actually is is the
strongest pure reasoner available at
scale at a price point that makes it
viable for any problem where reasoning
depth matters more than tool
orchestration. at $2 per million input
tokens and just 12 per million output
tokens. It's roughly seven and a half
times cheaper than Opus 4.6 on input and
more than six times cheaper on output.
For a workload processing a billion
tokens a month, that is the difference
between a $15,000 bill and a $2,000
bill. With context caching, Gemini's
costs can drop another 75% from there.
JetBrain's director of AI called it
stronger, faster, and more efficient.
Artificial Analysis currently ranks it
as the top model on their intelligence
index at roughly half the cost of its
nearest frontier peers. The model also
ships with configurable thinking levels,
low, medium, high, and max. So you can
dial reasoning depth and cost up or down
upon request. Simple classification or
summarization? Low thinking, fast and
cheap. Novel scientific problem
requiring multi-step deduction? Well,
let's turn it up to max. Let it work.
This is cost engineering for reasoning
at a granularity nobody else really
offers. And it matters because it means
you don't pay frontier prices for
routine tasks. But here's the real
comparison. And it matters. When you
give these models tools, web search,
code execution, database access, file
systems, and you measure their
performance on complicated real world
tasks that require using those tools
together to get work done, 4.6 catches
up and often pulls ahead. On humanity's
last exam with search and code tools,
opus scores 53.1%
versus Gemini's 51.4%. On GDP val, which
measures expert level office and
financial tasks, Opus leads by 289 ELO
points, which is a massive gap. Remember
that is the realworld work one guys. On
the arena coding leaderboard and expert
human preference rankings, clawed models
consistently win. The pattern here is
unambiguous. Gemini 3.1 Pro is the
strongest naked reasoner. Opus 4.6 is
the strongest equipped reasoner, the
model that's best at combining
intelligence with the ability to use
tools, call APIs, read files, write
code, and sustain that work over hours
and days. GPT 5.3 CEX is the strongest
specialist coder. If intelligence is the
engine, tools are the drivetrain. Google
built a better engine and Anthropic
built a better car. OpenAI built better
racing transmission for any individual
task. The question isn't which model is
smartest. The question is whether the
task got bottlenecked by raw thinking or
whether the ability to act on that
thinking across tools in time is the
real bottleneck. And that question turns
out to be way more interesting than any
benchmark and we are not talking about
it enough. So we're talking about it
now. First let's understand what Gemini
3.1 Pro is meant to solve in terms of
problems. I think a good example is
Gemini 3's Deep Think, which was
released February 12th and is a
specialized reasoning mode that sits
above 3.1 Pro on the intelligence curve.
Deepthink collaborated with human
researchers to solve 18 previously
unsolved problems across mathematics,
physics, computer science, and
economics. These are not incremental
improvements. They're not benchmark
tricks. It's making original research
contributions. A conjecture in online
submodular optimization had stood
unproven since 2015. And if like me you
were wondering what is online submodular
optimization, it is a way of talking
about data that moves around the world
on the internet and mathematicians get
involved. In this case, mathematicians
proposed a seemingly obvious rule. In a
data stream, if you copy an arriving
item, it is always less valuable to do
that than to move the original,
presumably because of the risks of
defects. Now, mathematicians had spent
more than a decade trying to prove this
was true. Gemini Deepth think engineered
a precise three item combinatorial
counter example and proved the
conjecture false in a single run. That
wasn't even the most interesting result
on the max cut problem, which is a
classic network optimization challenge,
which again I'm getting way out of my
depth here and that's kind of the point.
That is why I am sharing this. If you
feel out of your depth, this is part of
what I'm trying to communicate. Gemini
3.1 Pro is about this kind of pure
reasoning. Anyway, Jee and I solved this
problem by pulling in I had to look this
up. The curb bra theorem and measuring
theory from continuous mathematics to
solve a discrete algorithmic puzzle.
Wow, that was a mouthful. Human
algorithm researchers would not
typically reach into geometric
functional analysis to solve a graph
theory problem because as much as they
may sound like gobbledegood to you,
they're actually really different
domains of mathematics. The model
crossed disciplinary boundaries that
human specialists very rarely cross
because the model doesn't see
disciplinary boundaries and that is one
of the strengths of an AI model. It's
tackling problems in physics too where
it tackled a gravitational radian. Look,
the list goes on. It tackled problems in
physics. It caught a critical error in a
cryptography paper. And just two days
before 3.1 Pro shipped, Isomorphic Labs,
which is DeepMind's drug discovery
spin-off, published results from their
AI drug design engine. And the system
had more than doubled AlphaFold 3's
accuracy on the hardest protein
prediction tasks and outperformed gold
standard physics-based methods at a
fraction of the cost and time. So here
we have what? Protein folding. We have
complicated mathematics, conjecture
breaking. We have gravitational
radiation involved, cryptographic error
detection. It's messing around with
crystal growth optimization. These are
very much pure reasoning problems at the
extreme end of the difficulty spectrum.
My head is spinning just trying to
communicate them. And I am a long way
from anywhere close to even
understanding the research. And they
share specific characteristics. The
inputs are well- definfined like a
protein sequence. The problem can be
stated extremely precisely. And the
solution requires a long and sustained
chain of logical deduction that a human
mind can verify but often cannot
generate without years of specialized
training. This is the domain where
Google's investment in intelligence as a
problem really pays off. This is what
means when he says let's solve
intelligence and then use that to solve
everything else. The everything else
starts with science. It starts with the
problems that have the highest ratio of
reasoning difficulty to ambiguity where
the question is very clear but the
answer requires genuine intellectual
horsepower to reach. And it's why I
think the most important question for
anyone reading about Gemini 3.1 Pro is
not is it better than Opus, but rather
what percentage of your actual work is
bottlenecked by that kind of thinking?
And here's where my analysis starts to
get much more personal than most of the
3.1 Pro analyses out there. Because hard
is not one thing. We've been treating
hard work like one thing for too long.
The benchmarks tend to treat it as a
single thing. The model marketing
certainly treats it like a single thing.
The LinkedIn discourse treats it like a
single thing. And the model landscape is
now very differentiated. And it's going
to force us to decompose what hard feels
like. Think about the problems you face
at work. Some of them are hard because
they require deep reasoning. Analyzing a
complicated contract for the clause that
creates downstream liability across
three jurisdictions or working through a
multi-step financial model to find the
sensitivity that changes the investment
decision or diagnosing why a distributed
system fails under a specific load
pattern that only appears at scale.
These are problems where you need to
hold multiple variables up in your head,
follow a chain of logic through branches
and dependencies, and arrive at a
conclusion that isn't very obvious from
the surface. But most problems in
business are not actually hard on a
reasoning axis. They're hard on other
axes entirely. Let me give you a few
categories that I think we don't talk
about enough, and we need to understand
these problem types to actually figure
out how we're going to thrive in the AI
era. First, effort problems. They're not
intellectually difficult. They're just
large. Auditing 3,000 vendor contracts
for compliance changes. Migrating a
legacy codebase with 2 million lines of
cobalt. Reviewing every customer
interaction from last quarter to
identify churn signals. The thinking at
each step is super straightforward. Any
competent person could do any individual
piece. The challenge is sustained
attention and thoroughess across a
massive surface area without dropping
detail. These are the problems Agentic
AI was built for. Opus 4.6 running for
hours on Rockuten's codebase is solving
an effort problem. The 16 agents
building a C compiler over weeks solving
an effort problem. The thinking per step
is not extraordinary. The endurance is
and that's not what Gemini 3.1 Pro is
optimized for. Here's another problem
type. Coordination problems. Getting six
teams aligned on a shared architecture
decision when each team has different
priorities in different technical
contexts. Routing work across
dependencies so that the back-end team
doesn't block, the front-end team
doesn't block the QA team. Managing
information flow so the right people
know the right things at the right time
and nobody wastes 3 days building
something that was already decided
against in a meeting they weren't even
invited to. Rakuten's deployment of Opus
4.6 six where the model autonomously
closed issues and routed them across a
50 person org and six different repos
that is solving a coordination problem.
That's a model solving a human
coordination problem. It understood not
just the code but which team owned the
repo, who has context on what, where to
assign the issues and critically when to
escalate. In other words, the model
developed a kind of relevant engineering
organizational awareness. Those are
capabilities where Opus 4.6 six leads in
a way that Gemini 3.1 Pro does not.
Emotional intelligence problems.
Delivering feedback to a direct report
who's been underperforming, but is going
through a divorce. Navigating a
negotiation where the other party's
stated concern, their price, is not
their real concern, which is control.
Reading a boardroom and knowing that the
CFO's silence means opposition, not
agreement. managing a team through a
reorg where half the people are afraid
of their jobs and the other half are
angling for promotions which sounds a
lot like AI or change management.
Calibrating tone, timing, and
transparency in situations where the
right thing to say depends on dynamics
no model can observe. We actually don't
have models that solve this part well.
Models don't even attempt this with
reliability. And this is a massive
percentage of what makes management and
leadership genuinely hard. It frankly
makes being a solid senior individual
contributor hard because there is no
escaping this kind of emotional
intelligence problem the farther you get
into business. Judgment and willpower
problems. Deciding to kill a project
your team spent six months building
because the market signals shifted.
Saying no to a lucrative client whose
values don't align with your companies.
Choosing the strategically correct but
politically dangerous path when the data
supports it but the executive team does
not want to hear it. making the
unpopular call, accepting the career
risk. Those aren't really reasoning
problems. Any competent analyst would
lay out the logic. Those are courage
problems. They're identity problems and
they're almost entirely unsolvable by AI
because the bottleneck is not computing
the correct answer. It's having the
nerve to act on it. That is a human
challenge. Domain expertise problems. A
senior engineer doesn't debug faster
because they reason better than a
junior. They debug faster because
they've seen that exact stack trace
before. They know the library's
undocumented quirks and they remember
the production incident from 2019 that
had the exact same root cause. A veteran
M&A attorney doesn't evaluate a deal
better because they're smarter. They
evaluate it better because they've
closed 300 deals and they've
internalized which representations and
warranties actually get litigated and
which ones are boilerplate that nobody
ever enforces. This is experience and
pattern recognition. knowledge
accumulated through years of repetition.
It's not really novel reasoning. Models
are getting better at simulating domain
expertise through training data, but the
gap between has read about it and has
lived it in the courtroom is still very
real, particularly in domains with thin
published literature. And here's a last
one. Ambiguity problems. Deciding what
to build when the market signal is
contradictory. Defining strategy when
three plausible interpretations of
customer data exist and each one leads
to a different product roadmap. Figuring
out what the customer actually wants
when they can't articulate it
themselves. They say they want better
reporting, but they actually want their
boss to stop questioning their numbers.
The hard part is not computing an answer
here. The hard part is figuring out what
the question actually is. This is the
domain of product sense, strategic
intuition, and the ability to hold
multiple incomplete mental models in
tension until one of them resolves.
Models can help explore options here,
but they cannot resolve the ambiguity
because it's not computable ambiguity.
Now, and this is the critical piece,
look at those six different problem
types I just described and ask yourself,
which ones does a dramatic improvement
in pure reasoning actually help? Be
honest. Reasoning helps reasoning
problems. That's obvious. The Gemini
Deep Think results are pure wins for the
reasoning axis. They're enormously
valuable problems because a single
insight in drug discovery can be worth
billions of dollars. A breakthrough in
material science can reshape an entire
industry. A novel proof can unlock an
entire new branch of mathematics. The
problems are some of the highest value
problems we humans work on. So, it's not
that Google isn't tackling things that
are valuable, but we should ask whether
they're tackling things that are used in
daily work. Now, to be fair, pure
reasoning problems do exist in
mainstream business. They're just rarer
and much more specialized than people
tend to assume. I think we sometimes
think a lot of business is reasoning
because we like to flatter ourselves.
It's not. Here's an example of a few
reasoning problems that are real in
business. Multi-jurisdiction tax
optimization is an example of a genuine
reasoning problem. The tax codes across
say 12 countries are all known inputs.
The question is very well defined but
the interaction effects between them
create a combinatorial space
mathematically that is genuinely hard to
reason through. Complex derivative
pricing that's another one. So is novel
regulatory compliance. Not read these
3,000 contracts. That's an effort
problem. But does this new financial
instrument trigger reporting obligations
simultaneously under say DoddFrank,
Basil 3, and the Hong Kong SFC's updated
guidelines? That's multi-step logical
deduction across interacting rule
systems. And it's the kind of thing
Gemini 3.1 Pro on high would handle
really well. Structural fraud detection,
not machine learning pattern
classification, but tracing a chain of
seven transactions across four entities
and reasoning about whether the
structure implies layered money
movement. That is a reasoning problem.
But I want you to notice the pattern in
these ones I described. These business
reasoning problems cluster in
specialized quantitative domains that
look a lot more like applied science
than most of the knowledge work that you
and I do. Did you notice none of them is
coding? And critically, the people who
do this work spend most of their time on
everything except the reasoning. The tax
attorney spends maybe 10% of her week on
the genuine multi-jurisdiction
interaction puzzle and 90% on client
management, document gathering,
coordination with local council,
navigating ambiguity about what the
client actually wants to achieve, etc.
The supply chain director's hardest
problem is not the multiconstrain
optimization path. It's actually getting
three different vice presidents to agree
on demand forecasts before the math can
even get started. In each of these
cases, the reasoning slice is real and
it's high value, but it's embedded
inside a much larger mass of effort, of
coordination, of ambiguity type work,
which means that a model optimized for
pure reasoning is a tool that helps with
the most intellectually demanding 10% of
these roles. But a model optimized for
tools in sustained work ends up helping
with the other 90%. For most knowledge
workers on most days, for most of us,
the problems we face are hard on effort.
They're hard on coordination. They're
hard on emotional intelligence. They're
hard on ambiguity. They're tough on
domain expertise. But the pure reasoning
component, that's a really narrow slice.
I don't have a precise number and I'm
very skeptical of anyone who claims to,
but I do know this. When I look at my
own work, the moments when someone says,
"I need to think harder about this." are
vastly outnumbered by the moments when
someone says, "I need to coordinate with
20 people on this," or, "I need to get
through all of this," or, "I need to
figure out what we're actually trying to
do here," or, "We need to get aligned."
That's why I think Opus 4.6 is going to
end up getting more daily usage in the
office. And I think Google can live with
that. Google would rather you use
Gemini, of course, they're not
indifferent. They have a cloud business
to grow and an ecosystem they want to
feed. But their AI research program
doesn't depend on winning your daily
workflow like it does for Anthropic.
Google is competing for the periodic
moment when a problem shows up that
requires deep novel reasoning. And in
that moment, they want to be the best
and they want to be the cheapest.
They're also positioning for the
scientific frontier, where pure
reasoning problems are dense, where the
payoffs are measured in Nobel prizes and
trillion dollar industries, and where
Google's vertical stack from TPU silicon
to deep mind research gives them a
pipeline nobody else can match. The rest
of the time, you'll probably use Claude
or Chat GPT and Google will sell the
TPUs that some of those models run on.
So, what does this mean for you and me
tomorrow? Here's where it gets really
applicable to work. Three things I want
to call out. First, stop looking at
benchmarks and start mapping traction in
your domain. I've said stop looking at
benchmarks before. Here's what I mean by
traction. What matters to you should be
which model handles the specific tasks
in your specific workflow most reliably
and that's all that should matter and
you should be the expert on that. Are
you the smartest person in your field
about which AI model handles which test
type for you? Because you should be
because the gap between I use chat GPT
for everything and hey I route financial
modeling to Gemini on high thinking. I
route coding to claude code. I route
quick research to Gemini flash and I do
deep document analysis with Opus. That
gap is the difference between commodity
usage and actual leverage. The models
have differentiated enough at this point
that model routing is its own skill set
and nobody's going to build that routing
mile for your domain and your business
except you. A cardiovascular surgeon is
going to route differently and yes, they
will use AI from a supply chain analyst
routing differently from a creative
director. The task to model mapping is
very domain specific. And it's the kind
of practical knowledge that compounds
every single week as the models get
better. You should be the expert. And
yes, I'm going to put together guides
for this that I'll put on the Substack
to help you get there. Second, start
disentangling the dimensions of
difficulty in your work. What in your
world is genuinely bottlenecked by
reasoning? What's bottlenecked by effort
or coordination or emotional
intelligence? By domain expertise, by
ambiguity, by something I haven't even
named yet. Maybe it's political risk or
regulatory uncertainty or talent
scarcity. This problem decomposition
matters because each dimension is
getting automated on a very different
timeline at a different rate by
different tools. Pure reasoning problems
are getting dramatically easier to solve
right now. That's what the ARC AGI2
score doubling in 3 months means. But
effort problems are getting automated in
a different way. They're getting
automated by agentic models that sustain
work for hours or days. Think Opus 4.6
or Codex 5.3 encoding problems.
Coordination problems are starting to
yield to agent teams and tool augmented
orchestration. Domain expertise is
slowly being absorbed into the training
data. Although the gap between I've
actually done it and I've just thought
about it, that's still a real thing. And
we find that that's why we need some
very good engineers and very good
staffers who have real lived experience
on the ground at a senior level.
Emotional intelligence, judgment,
ambiguity, courage, those are not
problems touched by AI today. Those will
be the last dimensions to yield if at
all. And this is where your map of work
matters. If you know that most of your
value comes from axes that AI isn't
automating, frankly, you can sleep a
little better, but you should also be
smarter about where you allocate AI on
the pieces that are tractable with AI.
If you discover that most of your value
comes from the reasoning axis or the
effort axis, you need to move
deliberately toward dimensions where
human judgment dominates and you need to
get really good at routing your work to
tools that are good at reasoning or good
at sustained work for hours or days. If
you're like, I don't know, how do I do
this? I will put together a promptable
guide for you. But you can't predict
which parts of your value are durable
and which are dissolving if you don't
think about it. If you don't engage with
it, if you don't decompose your work
into this type of difficulty, and I wish
the model makers would make this easier
by talking honestly about what their
models are good at and what they're not,
but right now we mostly have them
bragging about benchmark scores. and you
get the impression they're getting
generally smarter and you get confused
and you wonder, well, if Gemini 3.1 Pro
is the best in the world, why is it not
good at managing teams of agents?
Because that's a different kind of
intelligence and frankly, they're not
optimizing for it. Third task, build the
taste, and yes, it's a buildable skill
to evaluate AI output in your domain.
Every model improvement is making this
question more urgent for you, not less.
When Opus can sustain autonomous coding
for weeks and Gemini can reason through
novel logic problems, the question is
not can AI do it. Increasingly the
question is can you tell whether what AI
produced is actually good. Lisa Carbone
is a mathematician at Ruters and she
used Gemini deep think to review a
highly technical mathematics paper and
it caught a subtle logical flaw that had
passed human peer review. Look, that's
very impressive for the model. But
notice what it required from Carbone.
The judgment to know which paper to
review, the expertise to evaluate
whether the model's finding was correct,
and the domain authority to act on the
result. The model did find the flaw. The
human had to validate that and give the
model the task. Both steps were
necessary. Neither was sufficient by
itself. that judgment, the ability to
look at a financial model and know the
assumptions are wrong, to read a legal
analysis and spot the missing precedent,
that is a skill that continues to
compound. Every other skill is getting
cheaper. But that one, that one's
getting more valuable because the models
are getting better at generating very
plausible looking output that requires
genuine expertise to dig into and check
and verify. And so, yes, while I'm
putting together some guides that will
help you dig in, I want to emphasize you
don't necessarily need my guides. The
point is your work and your thinking.
You need the work. You need to do the
work of applying whatever materials you
get, my guide, something else, YouTube,
to figure out what in your world is
actually good AI output. So yes, I'm
building guides that go deeper on model
routing by domain, that go deeper on
problem access mapping because I want
this to be easier and I haven't seen
them anywhere else. But the work of
applying them to your world, that's your
work. that's always been your work and
there's no substitute for it and it's
incredibly valuable right now. I want to
step back and I want to look at Google's
quiet game here. There's a version of
the AI story that's all about speed,
that's all about market share, who ships
the fastest, who wins the enterprise,
who reaches a billion users. That's the
story that OpenAI and Anthropic are
living. It's an important story. The
products they're building are changing
how we work and how we live all the
time. But there's another version of the
AI story, and that's the version where a
company backed by a hundred billion
dollars in annual cash flow is running
on proprietary silicon that it designs
and manufactures, employing a team that
won a Nobel Prize, and operating under a
CEO who has been saying, "Solve
intelligence since long before other
people took AI seriously." That company
isn't trying to win the product race.
That company is Google and they think
the product race is a little bit of a
sideshow. The main event is intelligence
itself. And if you solve intelligence,
the products take care of themselves.
Gemini 3.1 Pro is ultimately a marker on
that road. It is the purest reasoning
model available at scale at the lowest
price from the only company with the
infrastructure to keep pushing the
reasoning frontier indefinitely. It will
not be the most used model this month.
Claude will handle more daily tasks. I
think chat GPT may well have more daily
active users for a long time to come.
Google would prefer that to be
otherwise, but they can afford to be
very patient because they're building
the thing underneath the thing. The
engine that disproves conjectures, the
engine that discovers drugbinding sites,
the engine that catches errors in
peer-reviewed papers, and that pushes
the boundary of what thinking means when
a machine does it for you. The practical
takeaway is not which model to use. Lots
of other YouTube videos will tell you
that. I'm not here to tell you that.
It's that the model landscape has
differentiated clearly enough about
which AI I should use that that is
actually becoming the wrong question to
ask. The right question is which AI
should I use for which problem? And how
do I even know what kind of problem I'm
solving? Is it a reasoning problem? Is
it an effort problem? Is it a
coordination problem? Is it an ambiguity
problem? Each one has a different best
tool, a different automation timeline,
and a different implication for your
career. Get specific. Build a map to
your domain and what problems are AI
tractable with which models because the
tools are now specific enough to reward
that. And the people who route them well
are going to way outperform the people
who use one model for everything. And
that margin is going to widen every
single month. Look, the fog around the
AI race remains thick. It is hard to get
signal but we can see enough to see
this. We know that routing the model for
the work makes a difference. So let's
not make it complicated. Let's not sit
there and stress about whether Gemini
3.1 Pro is the best and I have to switch
everything to that. That is the wrong
question to ask. Just ask what is the
kind of problem I'm facing? What is the
model at the frontier that I need to use
for that kind of problem? And by the
way, some models, especially effort
problems, they don't even need a model
at the frontier. You can use a dumb
model for that, and that's totally fine.
One of the big skills going forward is
going to be learning when you need a
smart model or not. So, that is Gemini
3.1 Pro. It is indeed the smartest model
on the planet, and I don't think Google
cares all that much whether you use it
at work tomorrow or not. Tears.
Ask follow-up questions or revisit key timestamps.
Google's Gemini 3.1 Pro, the "smartest AI model on the planet," leads on 13 of 16 benchmarks, particularly excelling in pure reasoning tasks like the ARC AGI2, where it scored 77.1%. Priced at roughly a seventh of competitors like Opus 4.6, Google's strategy isn't immediate monetization of daily use. Instead, leveraging its vast financial resources and unique vertical integration (proprietary silicon, cloud infrastructure, DeepMind research), Google focuses on "solving intelligence" as a fundamental computer science problem. This contrasts with other AI companies that prioritize agentic work, specialized coding, or rapid user monetization. The video categorizes different types of work problems (reasoning, effort, coordination, emotional intelligence, judgment, domain expertise, ambiguity), highlighting that while Gemini 3.1 Pro is superior for pure reasoning, other models may be better suited for different "hard" problems. The key takeaway for individuals is to move beyond general benchmarks, identify the specific nature of their work problems, and strategically route tasks to the most appropriate AI model for optimal leverage and career durability.
Videos recently processed by our community