GPT-5.3-Codex and Opus 4.6 in 6min..

Watch on YouTube

Now Playing

Transcript

151 segments

0:00

OpenAI didn't even give Enthropic more

0:02

than 10 minutes of glory by releasing

0:04

GPT 5.3 Codeex just minutes after the

0:08

release of Opus 4.6 model. And honestly,

0:11

when you factor in all the major

0:13

releases that we've had in just these

0:15

two companies in the past year, it can

0:17

be quite exhausting to keep track of all

0:19

of them. So today, we're going to get

0:21

straight to the heart of why this

0:23

matters and discuss the technical

0:25

breakthroughs. Welcome to Caleb Wright's

0:27

Code, where every second counts. One of

0:29

the biggest gripes that developers have

0:31

when it comes to large anguish models is

0:33

when it comes to context window. For

0:35

example, Google's Gemini offers 1

0:37

million tokens, OpenAI GPT with 400,000

0:40

tokens, while Anthropic's previous model

0:43

only offered 200,000 tokens. And while a

0:46

Gent applications like Cloud Code does

0:49

help manage its own context pretty

0:51

efficiently, 200,000 tokens was

0:53

certainly lagging behind in comparison

0:56

to the rest of the pact. And you might

0:58

think scaling the context window should

1:00

be easy, but in reality, there are

1:02

drawbacks along the way. For example,

1:05

while Google's Gemini offers 1 million

1:07

tokens, as you can see, the accuracy of

1:10

the model's ability to actually retrieve

1:13

its context well decrease as more of the

1:15

context gets used up. Here, you're

1:17

seeing how Gemini 3 actually scored

1:20

around 25% in accuracy by the time it

1:23

hits the 1 million context window. And

1:25

you might think this is quite pathetic,

1:27

but the highest score at the time was

1:29

only 32.6% accuracy for 1 million

1:33

context, which is not that high when you

1:35

think about it. Well, until Enthropic

1:37

released Opus 4.6 that scored a whopping

1:40

76% on the 1 million context window. So,

1:44

not only was Anthropic able to jump

1:46

their context window from 200,000 to 1

1:50

million tokens in their model upgrade,

1:52

they also more than doubled the

1:54

performance in preventing what's called

1:56

context rot. Just to briefly touch on

1:59

this metric, which is called MRCR. It's

2:01

often referred to as a needle in the

2:03

haystack metric where you place repeated

2:06

facts embedded throughout the context

2:08

window and ask the model to identify

2:11

correctly the position of that fact

2:13

buried somewhere in that context window.

2:16

And the eight needle here refers to the

2:18

difficulty of the problem where eight

2:20

identical facts are embedded directly in

2:23

the context making it that much harder

2:25

for the model to keep track. So scoring

2:27

76% accuracy while using 1 million

2:31

context usage is extremely impressive

2:33

for Enthropic. Let's quickly switch to

2:35

OpenAI's GPT 5.3 codeex release. While

2:38

OpenAI maintained its 400,000 context

2:41

window for this model, there were

2:43

certainly few noteworthy achievements

2:45

that OpenAI made on this release. First

2:47

is a 25% increase in speed when it comes

2:50

to inference. One of the biggest gripe

2:52

that people had about the codeex model

2:54

in comparison to Opus was just how slow

2:57

the model actually was in real time.

2:59

Another impressive achievement that

3:01

OpenAI made was its ability to actually

3:04

navigate itself well in the terminal. We

3:06

know that ever since Cloud Code was

3:08

released back in February 2025 and even

3:11

codec cli, which was released in April

3:13

2025, terminal based agents have sort of

3:16

been the de facto in where these models

3:18

actually lived and breathe. So being

3:21

able to actually navigate itself in that

3:23

environment is critical. One of the ways

3:25

to test this very thing was having a

3:28

series of isolated environment where we

3:31

can just drop the model to these

3:32

isolated environments and give it an

3:34

objective. Then we can test the model to

3:37

see if it can solve its way out by

3:39

running the necessary terminal commands.

3:41

This kind of benchmark is called

3:43

terminal bench where the second edition

3:45

had 89 of these isolated environments

3:48

and models are dropped into each docker

3:51

containers to see if they're able to

3:53

solve varying tasks like building

3:55

repositories, setting up a server,

3:58

training LLMs, or just about anything

4:00

that people would actually use them for

4:02

in real life. OpenAI jumped from 64% in

4:06

their previous GPT 5.2 2 codecs to 77%

4:10

in their GPT 5.3 codeex model. Well,

4:13

while the official record show that they

4:15

actually scored 75%, the difference is

4:18

quite huge in comparison to Opus 4.6.

4:21

OpenAI also said that they used the GBT

4:24

5.3 codeex to assist in building itself,

4:27

which certainly points one step toward

4:29

singularity, where someday AI could

4:32

train itself faster than we can. And the

4:35

pace of innovation at that point will

4:38

spiral so fast that AI will outpace our

4:41

ability to contribute towards its

4:43

self-improvement. Let's now compare the

4:45

iteration speed from OpenAI and

4:47

Enthropic. While Anthropic's Opus line

4:50

has been innovating at a speed of 2 to 3

4:52

months per each iteration, OpenAI has

4:55

been narrowing down its iteration speed

4:57

from 4 months to two months and now to

5:00

one to two months in iteration speed. I

5:02

think seeing model releases in this

5:04

frequency of every month or even every

5:07

other week is something that we could

5:09

see in the near future and the

5:11

competition at that point will get even

5:13

more fierce between all the frontier

5:15

labs. Pricing is also something that

5:17

OpenAI certainly has huge advantage over

5:20

Enthropic where the GPT 5.3 codeex is

5:23

offered at $1.75 per million input

5:26

tokens and $14 for a million output

5:28

tokens while Opus 4.6 6 is offered at $5

5:32

per million input tokens and $25 per

5:35

million output tokens. And the word on

5:36

the street is that Opus 4.6 is actually

5:39

very token hungry. So while most people

5:42

use Chachi BT's $20 plan to access their

5:45

codeex models with enough room for

5:47

credits, it's likely that Opus 4.6, six,

5:50

especially given its large 1 million

5:53

context window. The same $20 plan from

5:55

Anthropic might not actually suffice

5:57

given their 5-hour limit window that

5:59

refreshes each time. What do you think

6:01

about these releases in the juxaposition

6:04

between OpenAI's GBD 5.3 codeex against

6:06

the anthropic OPUS 4.6? Which do you

6:09

prefer more?

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video compares recent large language model releases: Anthropic's Opus 4.6 and OpenAI's GPT 5.3 Codeex. Opus 4.6 made a significant leap by increasing its context window from 200,000 to 1 million tokens and achieved an impressive 76% accuracy in preventing 'context rot.' GPT 5.3 Codeex, while maintaining a 400,000 token context, improved its inference speed by 25% and showed substantial advancements in terminal navigation, scoring 77% on the Terminal Bench benchmark. The video also highlights OpenAI's use of Codeex to build itself, faster iteration speeds from both companies, and a notable pricing advantage for OpenAI's model.

Recently Distilled

Videos recently processed by our community