GPT-5.2 Low Found What Opus 4.5 Missed
380 segments
The cheap model won! I ran the same prompt
through 6 different AI models, and
GPT-5.2-Low just beat Opus 4.5 for a code
review.
I needed a code audit on YeeBall, which
is my multiplayer soccer game, so I had
Opus write the perfect review prompt.
About 52,000 tokens of instructions. And
then I fed that exact same prompt to every
major model.
So check it out. Gemini went rogue. It
started editing files when all I asked it
was for a code review.
But the real surprise was the model with
the lowest reasoning effort found things
that Opus 4.5 missed.
Crazy, right?
So GPT-5.2 low caught an interesting
issue with the way that sprites were
rendered in my app.
So for procedural tasks, you may not need
to be reaching for that expensive thinking
model if you just need some clarity.
Let me show you what exactly happened.
Code review handoff.
First, let me review your current work
documentation.
I've created a comprehensive handoff
review doc.
This prompt will then guide the reviewing
agent to the core objective, what's
included.
Opus is so good. So I had Opus
orchestrate this. This only cost me about
52,000 tokens and now I'm going to have
this handoff document and just switch this
over to
Gemini. Okay, so what's included? Full
audit checklist for phases one through
seven.
Cool. So key areas of focus are here. The
document also plainly marks what to
ignore,
and so on.
And of course, the power of having a
smart model like that in your codebase is
massive.
So we're going to test out Gemini 3
Flash.
How do I prompt it?
Let's see.
So cool.
Please work on this.
Gemini 3 Flash has a million token
context window.
This is going to be a really nice
extensive review and you can see Cursor's
already put
all of those things in here into to-do
lists and it's working extremely fast with
reading
all these files.
Very, very quickly. Whoa, does Gemini
already finish? No way. This is moving
fast.
Okay.
It's getting to work.
Gemini 3 Flash is fast, but if you have
Opus handing off the prompts to
Gemini 3 Flash, it's amazingly fast. Ooh,
I've updated to-do lists and the progress
made. Okay.
So would you like me to proceed by fixing
or would you like me to start working on
this animation thing?
I have completed all to-dos now, and I'm
ready for the next task.
It's interesting how Gemini 3 is kind of
talking to itself out loud here.
Actually, looking at the handoff, it says
you're tasked with conducting a thorough
code review and
work on this. I should probably fix the
kick animation now since it's broken. So
this is
interesting. The Google Gemini Flash
internal dialogue is sort of being exposed
right now in
Cursor because it's like, it could either
be the system instructions from Cursor or
Gemini's instructions to want to like
really get the work done but this document
only outlies the
code review task right so this is already
marked my to-do's is complete okay let's
do this i'll
add it to do this oh my gosh oh my gosh
so it just went to start fixing the stuff
wait oh no oh no
let me reevaluate okay so i think it
basically ended up it wanted to fix stuff
but i don think I don think it did
anything It very ambitious at wanting to
do things This is really weird I think I
got into a funny state
Gemini 3 Flash High likes to write files.
It already did this for me.
Jesus. I can't trust you, Gemini 3 Flash,
if you're gonna be doing this to me.
I just asked you to review the code, you
were like, yo bro, I got you, don't worry.
Oh, by the way, I did everything else for
you, bro.
Nah, bro, you need to chill right now.
Gemini 3 Flash, I know you want to show
off.
I know this is your moment right now.
A lot of people are getting your shine
right now, but I need to pull back a
little bit.
Just a little bit.
A little too aggressive right now.
Ay-yi-yi.
Okay.
Yeah, so you see, this is what Gemini
did.
It just said, like, I'll update
everything.
And it's like, did you?
Is it?
I'm just going to discard your work.
Discard. You know what I'm going to do is I'm
going to do this audit with instead of
Gemini Flash,
Loki I think Codex Max with smaller
reasoning window. That'd be interesting.
So Codex Max,
and then I'm just going to do low. I'm
also going to actually this is probably a
good time to see
if I can hand this off to Droid and see
if it behaves a little bit differently.
I'm really,
really curious. Yeah. Okay. So I got
Factory pulled up. We're going to give it
that same task
I don't think I need a lot of reasoning
effort because it kind of overthinks to be
honest.
I'll do low.
Okay, so it's applying the new changes.
We have flash low.
Okay, I'm curious to see how Gemini 3
flash low is going to work.
The reason why I chose a lower reasoning
level is, so this is something that Eric
Provencher
covers in the Rate Limited podcast, and
he says that sometimes the models with a
high thinking can overthink.
I think I'm kind of seeing that in Cursor
right now because you see the models like,
oh I should then write the code now,
right? I did the review, let's go write
the code. Like,
and it was just kind of fighting with
itself. I'm just trying to remove that
noise from its thinking
and just kind of be more procedural
based. Yeah, so here's Droid and did Droid
already finish? Dang,
how did it finish so fast? That's kind of
what's weird. I have completed the
comprehensive code
review already. Dang, did you review the
code? Okay, so I'm going to do the same
thing but I'm
I'm also going to just try a different
model. I'm just really curious. This is
how I discovered the models are kind of
funny.
Models. Okay, so we're going to try 5.2.
I'm going to give it the same. We're going
to say low.
And I'm going to try this. Right? Just
trying to make it not over complicated. So
this is low.
We're going to do Droid again. So 5.2
high. Let's just see if there's any
difference here.
And then we're going to do Droid again.
We're just going to compare a lot of them.
and then we're going to do CodexMax GPT
5.2 low. We have GPT 5.2 high. And then
now we're going to do CodexMax high.
And then slash model, Droid. Just for
cost, because I know people are trying to
try different things for cost.
Between Gemini Pro and Gemini Flash what
would be better if we just do Flash on
high So Gemini 3 flashed on high because
this is what Cursor is using I curious to
see what they going to do So let go ahead
and see here
And here we go. So I confirm the project
such as bun. Okay. Yes, I conducted a
review.
Specifically, I verified these things.
Okay.
And I had Opus generate this prompt. So
this is code review handoff.
And we're giving this one to the GPT-5.2
low, which is doing its thing.
This is GPT-5.2 high. It's going to read
the files. You can see the different lists
here.
And then this one is GPT-5.1 CodexMax.
And this only generated a smaller to-do
list.
And then I think what we have to do now
is just see if Opus is going to do
anything.
roid, slash, model. And I'll do Opus.
Like, maximum intelligence. What do we get
for maximum intelligence?
2x the cost, right? Is it worth the 2x
cost, right? Are you going to spend like
the same amount of tokens eventually to
get to the same place? And just kind of
compress time? I'm just really curious.
Okay, so we have so far is codebase is in
excellent health and deployable. Ghost
features documented but not built. So none
found, all documented. And then security
assessments. Okay, looks cool.
Undocumented features. Goal celebration
sync, lobby timeouts, maintenance mode,
user XP fields, actual state sprite
integrations.
Better than documented, all five OpenAI
characters have full rotations and running
animations. The kick animations exist but
aren't triggered during gameplay.
-huh, that's what we gotta fix. Security
concerns and the codebase is deployable.
Yes, okay. This is cool.
I like Opus's review. Codex Max is still
going and it's just reading files right
now. So this is pretty good.
As far as low, low kind of did a similar
thing here. So if you're using GPT-5.2
low, this is pretty good.
This actually falls on par to what Eric
Bravonchet said, that GPT-5.2 low is very
similar to Opus 4.5 in some cases here.
So I might have to start reaching to
GPT-5.2 low for a lot more types of things
because of cost, but also thoroughness
too, right?
That's amazing. Wow. GPT 5.2 low? Okay, I
see you. I see you. If you give it a good
prompt, it cooks.
Oh, wow. Frames are loaded. I like the
details. These are really good details.
Okay.
This is pretty good. Compare that to
Opus. Opus is kind of, it's basically the
same. Opus is a little bit more succinct.
It covered more features here that I had
in my document.
But for 5.2 low? For the price? Gemini 3
flash low. I was hoping you would do some
more work here, but compared to ranking
all these right now, it makes sense.
Like GPT flash low probably didn meet the
benchmarks anywhere I mean I don see it
anywhere in a lot of benchmarks Gemini 3
flash high I see why it the default in
Cursor because it could get stuff done I
haven checked how it compares to Haiku to
be honest But as far as our winner here I
think it GPT and Opus 4 Then it goes in
terms
of matter of cost. GPT-5.2-Low is nearly
on par with 4.5 for this type of task,
where I have a
very procedural document from Opus that's
generated about what specific things to do
to go review these
different files. Opus 4.5 provides a
little bit more detail as far as the
nuances and little
All therefore, I leave to vàü Fighting
process on MVPF.
Please turn off subtitle and use CC by
error.
This is an interesting callout.
Active public testing, no live batch.
Okay.
And then Autostart is client host only.
This is a good catch.
So just by looking at this, you already
know it's live.
So this is where GPT-5.1 Codex Max
doesn't really get that.
It's like, your document says live and
I'm literally just taking it literally and
it says it should be live.
It needs to be live in here.
I don't see live, you know?
And like Opus 4.5 and all the Sonnet
models are like, chill bro.
Like, isn't it obvious? Like, there's a
countdown, there's a score, there's stuff
happening.
So, that's the difference between some of
these two models, and, you know, that's
GPT-5.1, CodexMax.
GPT-5.2 didn't really get that pedantic,
as we can kind of see, it didn't call this
out. But it's just interesting that I tried
this prompt, the same prompt, across all
these models.
And especially, like, the levels of
thinking too, and like, what it chooses to
hang on to and say,
and so on.
So, it's really hard for me to kind of
discern which one I should lean on.
I think running both of them could be
helpful, but if I were to just choose one,
I think
it's going to be Opus 4.5.
It costs more, but I do always keep
wanting to try these things to test them
out. And I think you get to learning from my
experiments and cost here.
So for procedural tasks like audits,
reviews, and checklists, the low reasoning
is actually
going to beat overthinking.
And I want to let you in on the quick
verdict here between all these different
models.
Gemini Flash.
and so on.
But it's actually too aggressive.
All I asked was for a code review and it
just started writing files.
And I can't be having that.
Opus 4.5 is actually the smartest.
It caught the little nuances, but it
costs quite a bit of money.
GPT-5.2 low though, that was nearly on
par with Opus 4.5 for this type of task at
a fraction
of the cost.
So next time you need a code review, just
try the cheap model first.
All you have to do is just write a good
prompt and give it low reasoning and see
what happens.
And look, this field is changing every
single week.
There's new models, new workflows, and
all that crazy stuff.
If you want to figure out this stuff
together, I'm doing live coding sessions
on the weekends
at Start My AI.
It's completely free and all you have to
do is just put your name on the list and
I'll let you know when I'm streaming.
It is your chance to code with me live
and ask questions while we work through
this stuff
in real time.
Drop in the comments which model you're
using for code reviews.
Alright, let's keep cooking.
Ask follow-up questions or revisit key timestamps.
The speaker conducted an experiment comparing six different AI models for a code review of their multiplayer soccer game, YeeBall, using a detailed prompt generated by Opus 4.5. Surprisingly, the less expensive GPT-5.2-Low model outperformed Opus 4.5 by identifying issues that the latter missed. Gemini 3 Flash, on the other hand, exhibited an aggressive tendency to edit files rather than merely reviewing them. The experiment highlighted that models with "low reasoning" are often more effective for procedural tasks like code audits, as high reasoning can lead to "overthinking" and unwanted actions. Ultimately, GPT-5.2-Low was identified as a cost-effective solution that provides thorough results comparable to Opus 4.5, especially when given a well-crafted prompt.
Videos recently processed by our community