GPT-5.2 Low Found What Opus 4.5 Missed

Watch on YouTube

Now Playing

Transcript

380 segments

0:00

The cheap model won! I ran the same prompt

0:02

through 6 different AI models, and

0:05

GPT-5.2-Low just beat Opus 4.5 for a code

0:08

review.

0:09

I needed a code audit on YeeBall, which

0:11

is my multiplayer soccer game, so I had

0:13

Opus write the perfect review prompt.

0:16

About 52,000 tokens of instructions. And

0:19

then I fed that exact same prompt to every

0:21

major model.

0:22

So check it out. Gemini went rogue. It

0:25

started editing files when all I asked it

0:28

was for a code review.

0:29

But the real surprise was the model with

0:32

the lowest reasoning effort found things

0:34

that Opus 4.5 missed.

0:36

Crazy, right?

0:37

So GPT-5.2 low caught an interesting

0:39

issue with the way that sprites were

0:42

rendered in my app.

0:43

So for procedural tasks, you may not need

0:45

to be reaching for that expensive thinking

0:47

model if you just need some clarity.

0:49

Let me show you what exactly happened.

0:51

Code review handoff.

0:52

First, let me review your current work

0:54

documentation.

0:54

I've created a comprehensive handoff

0:56

review doc.

0:57

This prompt will then guide the reviewing

0:59

agent to the core objective, what's

1:02

included.

1:03

Opus is so good. So I had Opus

1:05

orchestrate this. This only cost me about

1:09

52,000 tokens and now I'm going to have

1:11

this handoff document and just switch this

1:13

over to

1:14

Gemini. Okay, so what's included? Full

1:15

audit checklist for phases one through

1:17

seven.

1:18

Cool. So key areas of focus are here. The

1:21

document also plainly marks what to

1:23

ignore,

1:23

and so on.

1:24

And of course, the power of having a

1:28

smart model like that in your codebase is

1:32

massive.

1:33

So we're going to test out Gemini 3

1:35

Flash.

1:35

How do I prompt it?

1:37

Let's see.

1:38

So cool.

1:39

Please work on this.

1:40

Gemini 3 Flash has a million token

1:42

context window.

1:44

This is going to be a really nice

1:45

extensive review and you can see Cursor's

1:47

already put

1:47

all of those things in here into to-do

1:49

lists and it's working extremely fast with

1:51

reading

1:52

all these files.

1:53

Very, very quickly. Whoa, does Gemini

1:55

already finish? No way. This is moving

1:57

fast.

1:59

Okay.

2:01

It's getting to work.

2:04

Gemini 3 Flash is fast, but if you have

2:07

Opus handing off the prompts to

2:10

Gemini 3 Flash, it's amazingly fast. Ooh,

2:13

I've updated to-do lists and the progress

2:15

made. Okay.

2:16

So would you like me to proceed by fixing

2:18

or would you like me to start working on

2:19

this animation thing?

2:20

I have completed all to-dos now, and I'm

2:22

ready for the next task.

2:23

It's interesting how Gemini 3 is kind of

2:25

talking to itself out loud here.

2:27

Actually, looking at the handoff, it says

2:29

you're tasked with conducting a thorough

2:31

code review and

2:32

work on this. I should probably fix the

2:34

kick animation now since it's broken. So

2:36

this is

2:36

interesting. The Google Gemini Flash

2:39

internal dialogue is sort of being exposed

2:41

right now in

2:42

Cursor because it's like, it could either

2:44

be the system instructions from Cursor or

2:47

Gemini's instructions to want to like

2:49

really get the work done but this document

2:51

only outlies the

2:52

code review task right so this is already

2:54

marked my to-do's is complete okay let's

2:56

do this i'll

2:57

add it to do this oh my gosh oh my gosh

3:00

so it just went to start fixing the stuff

3:04

wait oh no oh no

3:08

let me reevaluate okay so i think it

3:10

basically ended up it wanted to fix stuff

3:12

but i don think I don think it did

3:14

anything It very ambitious at wanting to

3:16

do things This is really weird I think I

3:19

got into a funny state

3:20

Gemini 3 Flash High likes to write files.

3:22

It already did this for me.

3:27

Jesus. I can't trust you, Gemini 3 Flash,

3:29

if you're gonna be doing this to me.

3:31

I just asked you to review the code, you

3:34

were like, yo bro, I got you, don't worry.

3:36

Oh, by the way, I did everything else for

3:38

you, bro.

3:39

Nah, bro, you need to chill right now.

3:41

Gemini 3 Flash, I know you want to show

3:43

off.

3:43

I know this is your moment right now.

3:45

A lot of people are getting your shine

3:46

right now, but I need to pull back a

3:48

little bit.

3:49

Just a little bit.

3:50

A little too aggressive right now.

3:53

Ay-yi-yi.

3:54

Okay.

3:54

Yeah, so you see, this is what Gemini

3:57

did.

3:57

It just said, like, I'll update

3:58

everything.

3:59

And it's like, did you?

4:00

Is it?

4:02

I'm just going to discard your work.

4:03

Discard. You know what I'm going to do is I'm

4:06

going to do this audit with instead of

4:09

Gemini Flash,

4:11

Loki I think Codex Max with smaller

4:13

reasoning window. That'd be interesting.

4:17

So Codex Max,

4:18

and then I'm just going to do low. I'm

4:19

also going to actually this is probably a

4:20

good time to see

4:21

if I can hand this off to Droid and see

4:23

if it behaves a little bit differently.

4:26

I'm really,

4:26

really curious. Yeah. Okay. So I got

4:29

Factory pulled up. We're going to give it

4:31

that same task

4:32

I don't think I need a lot of reasoning

4:36

effort because it kind of overthinks to be

4:41

honest.

4:42

I'll do low.

4:43

Okay, so it's applying the new changes.

4:45

We have flash low.

4:46

Okay, I'm curious to see how Gemini 3

4:49

flash low is going to work.

4:51

The reason why I chose a lower reasoning

4:53

level is, so this is something that Eric

4:55

Provencher

4:56

covers in the Rate Limited podcast, and

4:58

he says that sometimes the models with a

5:00

high thinking can overthink.

5:02

I think I'm kind of seeing that in Cursor

5:03

right now because you see the models like,

5:06

oh I should then write the code now,

5:07

right? I did the review, let's go write

5:09

the code. Like,

5:10

and it was just kind of fighting with

5:11

itself. I'm just trying to remove that

5:13

noise from its thinking

5:14

and just kind of be more procedural

5:16

based. Yeah, so here's Droid and did Droid

5:19

already finish? Dang,

5:21

how did it finish so fast? That's kind of

5:22

what's weird. I have completed the

5:23

comprehensive code

5:24

review already. Dang, did you review the

5:27

code? Okay, so I'm going to do the same

5:30

thing but I'm

5:31

I'm also going to just try a different

5:33

model. I'm just really curious. This is

5:35

how I discovered the models are kind of

5:37

funny.

5:38

Models. Okay, so we're going to try 5.2.

5:40

I'm going to give it the same. We're going

5:42

to say low.

5:44

And I'm going to try this. Right? Just

5:46

trying to make it not over complicated. So

5:48

this is low.

5:49

We're going to do Droid again. So 5.2

5:51

high. Let's just see if there's any

5:52

difference here.

5:54

And then we're going to do Droid again.

5:56

We're just going to compare a lot of them.

5:58

and then we're going to do CodexMax GPT

6:02

5.2 low. We have GPT 5.2 high. And then

6:05

now we're going to do CodexMax high.

6:12

And then slash model, Droid. Just for

6:13

cost, because I know people are trying to

6:16

try different things for cost.

6:17

Between Gemini Pro and Gemini Flash what

6:20

would be better if we just do Flash on

6:23

high So Gemini 3 flashed on high because

6:26

this is what Cursor is using I curious to

6:29

see what they going to do So let go ahead

6:32

and see here

6:34

And here we go. So I confirm the project

6:37

such as bun. Okay. Yes, I conducted a

6:40

review.

6:41

Specifically, I verified these things.

6:44

Okay.

6:48

And I had Opus generate this prompt. So

6:50

this is code review handoff.

6:52

And we're giving this one to the GPT-5.2

6:54

low, which is doing its thing.

6:56

This is GPT-5.2 high. It's going to read

6:58

the files. You can see the different lists

7:00

here.

7:01

And then this one is GPT-5.1 CodexMax.

7:04

And this only generated a smaller to-do

7:07

list.

7:07

And then I think what we have to do now

7:09

is just see if Opus is going to do

7:10

anything.

7:11

roid, slash, model. And I'll do Opus.

7:14

Like, maximum intelligence. What do we get

7:17

for maximum intelligence?

7:19

2x the cost, right? Is it worth the 2x

7:21

cost, right? Are you going to spend like

7:23

the same amount of tokens eventually to

7:26

get to the same place? And just kind of

7:28

compress time? I'm just really curious.

7:30

Okay, so we have so far is codebase is in

7:33

excellent health and deployable. Ghost

7:37

features documented but not built. So none

7:40

found, all documented. And then security

7:43

assessments. Okay, looks cool.

7:46

Undocumented features. Goal celebration

7:49

sync, lobby timeouts, maintenance mode,

7:51

user XP fields, actual state sprite

7:53

integrations.

7:54

Better than documented, all five OpenAI

7:56

characters have full rotations and running

7:58

animations. The kick animations exist but

8:00

aren't triggered during gameplay.

8:02

-huh, that's what we gotta fix. Security

8:04

concerns and the codebase is deployable.

8:06

Yes, okay. This is cool.

8:07

I like Opus's review. Codex Max is still

8:10

going and it's just reading files right

8:13

now. So this is pretty good.

8:15

As far as low, low kind of did a similar

8:18

thing here. So if you're using GPT-5.2

8:22

low, this is pretty good.

8:25

This actually falls on par to what Eric

8:28

Bravonchet said, that GPT-5.2 low is very

8:31

similar to Opus 4.5 in some cases here.

8:34

So I might have to start reaching to

8:37

GPT-5.2 low for a lot more types of things

8:40

because of cost, but also thoroughness

8:43

too, right?

8:44

That's amazing. Wow. GPT 5.2 low? Okay, I

8:47

see you. I see you. If you give it a good

8:49

prompt, it cooks.

8:52

Oh, wow. Frames are loaded. I like the

8:55

details. These are really good details.

8:58

Okay.

9:01

This is pretty good. Compare that to

9:03

Opus. Opus is kind of, it's basically the

9:06

same. Opus is a little bit more succinct.

9:09

It covered more features here that I had

9:11

in my document.

9:12

But for 5.2 low? For the price? Gemini 3

9:16

flash low. I was hoping you would do some

9:19

more work here, but compared to ranking

9:23

all these right now, it makes sense.

9:26

Like GPT flash low probably didn meet the

9:29

benchmarks anywhere I mean I don see it

9:32

anywhere in a lot of benchmarks Gemini 3

9:34

flash high I see why it the default in

9:37

Cursor because it could get stuff done I

9:39

haven checked how it compares to Haiku to

9:42

be honest But as far as our winner here I

9:45

think it GPT and Opus 4 Then it goes in

9:47

terms

9:48

of matter of cost. GPT-5.2-Low is nearly

9:51

on par with 4.5 for this type of task,

9:54

where I have a

9:55

very procedural document from Opus that's

9:57

generated about what specific things to do

9:59

to go review these

10:00

different files. Opus 4.5 provides a

10:03

little bit more detail as far as the

10:05

nuances and little

10:06

All therefore, I leave to vàü Fighting

10:08

process on MVPF.

10:10

Please turn off subtitle and use CC by

10:17

error.

10:36

This is an interesting callout.

10:37

Active public testing, no live batch.

10:39

Okay.

10:40

And then Autostart is client host only.

10:43

This is a good catch.

10:44

So just by looking at this, you already

10:46

know it's live.

10:47

So this is where GPT-5.1 Codex Max

10:48

doesn't really get that.

10:50

It's like, your document says live and

10:52

I'm literally just taking it literally and

10:54

it says it should be live.

10:56

It needs to be live in here.

10:57

I don't see live, you know?

10:59

And like Opus 4.5 and all the Sonnet

11:01

models are like, chill bro.

11:03

Like, isn't it obvious? Like, there's a

11:06

countdown, there's a score, there's stuff

11:09

happening.

11:10

So, that's the difference between some of

11:13

these two models, and, you know, that's

11:15

GPT-5.1, CodexMax.

11:16

GPT-5.2 didn't really get that pedantic,

11:19

as we can kind of see, it didn't call this

11:21

out. But it's just interesting that I tried

11:23

this prompt, the same prompt, across all

11:24

these models.

11:25

And especially, like, the levels of

11:27

thinking too, and like, what it chooses to

11:30

hang on to and say,

11:31

and so on.

11:32

So, it's really hard for me to kind of

11:34

discern which one I should lean on.

11:36

I think running both of them could be

11:38

helpful, but if I were to just choose one,

11:40

I think

11:41

it's going to be Opus 4.5.

11:42

It costs more, but I do always keep

11:44

wanting to try these things to test them

11:45

out. And I think you get to learning from my

11:48

experiments and cost here.

11:50

So for procedural tasks like audits,

11:52

reviews, and checklists, the low reasoning

11:54

is actually

11:55

going to beat overthinking.

11:56

And I want to let you in on the quick

11:57

verdict here between all these different

11:59

models.

11:59

Gemini Flash.

12:01

and so on.

12:02

But it's actually too aggressive.

12:03

All I asked was for a code review and it

12:05

just started writing files.

12:07

And I can't be having that.

12:08

Opus 4.5 is actually the smartest.

12:11

It caught the little nuances, but it

12:13

costs quite a bit of money.

12:15

GPT-5.2 low though, that was nearly on

12:18

par with Opus 4.5 for this type of task at

12:20

a fraction

12:21

of the cost.

12:22

So next time you need a code review, just

12:24

try the cheap model first.

12:25

All you have to do is just write a good

12:27

prompt and give it low reasoning and see

12:29

what happens.

12:30

And look, this field is changing every

12:32

single week.

12:33

There's new models, new workflows, and

12:35

all that crazy stuff.

12:36

If you want to figure out this stuff

12:38

together, I'm doing live coding sessions

12:40

on the weekends

12:41

at Start My AI.

12:42

It's completely free and all you have to

12:44

do is just put your name on the list and

12:45

I'll let you know when I'm streaming.

12:46

It is your chance to code with me live

12:48

and ask questions while we work through

12:49

this stuff

12:50

in real time.

12:51

Drop in the comments which model you're

12:52

using for code reviews.

12:53

Alright, let's keep cooking.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The speaker conducted an experiment comparing six different AI models for a code review of their multiplayer soccer game, YeeBall, using a detailed prompt generated by Opus 4.5. Surprisingly, the less expensive GPT-5.2-Low model outperformed Opus 4.5 by identifying issues that the latter missed. Gemini 3 Flash, on the other hand, exhibited an aggressive tendency to edit files rather than merely reviewing them. The experiment highlighted that models with "low reasoning" are often more effective for procedural tasks like code audits, as high reasoning can lead to "overthinking" and unwanted actions. Ultimately, GPT-5.2-Low was identified as a cost-effective solution that provides thorough results comparable to Opus 4.5, especially when given a well-crafted prompt.

Recently Distilled

Videos recently processed by our community