How Reasoning LLMs Like GPT-5 Are Built?
658 segments
In this video, we'll look at reasoning
LLMs, which power many of today's most
advanced chatbots like GPT5.
We will cover three things. One, what
reasoning LLM are. Two, different
inference time and training time
techniques to build reasoning LLMs.
And three, what the GPT5 unified system
might look like.
A short background here. After the
success of a standard LLMs like GPT3 and
GPT4, Lama family and Gemini between 22
and 24, open released 01 model in
September 24 and framed it as a model
that can think.
Also in this article, they are
introducing open as a new series of AI
models designed to spend more time
thinking before they respond.
They also published another article on
the same day and this artic art article
is mostly focused on the evaluation and
they are showing that how much
improvements they are getting by
training this reasoning LLM. If we
scroll to the eval part we will see on
different benchmarks for example Amy
which contains USA math olympiad
questions 01 models are doing
significantly better than the
non-reasoning equivalent which was GPT40
back then. And this line is showing the
accuracy of these models. And similarly,
they're showing on other benchmarks like
competition code, code code forces
benchmark or PhD level science
questions, 01 is doing a lot better than
GPT40.
After that, we observed a wave of
reasoning models from different teams
and companies. Popular examples are um
CL, Gemini 2.5 Pro, GPT5, Deepsec R1 and
there are a lot more examples. And most
of these reasoning models are released
in 25.
I think this meme nicely visualizes what
we just saw. In 24 we had all these
non-reasoning um LLMs and then 01 came
and now we have in 25 we have a lot more
u powerful reasoning models being
released and introduced and many of
these are doing very good on different
benchmarks.
So next let's see what reasoning LLM
are.
What is reasoning? If we go to Wikipedia
and search for reason, we would see in a
great detail what reasoning is. Uh
different definitions in cognitive
psychology and others other domains. We
would also see different forms of
reasoning. So if you are interested, you
can read this page. It's it's quite
long.
But the summary is in cognitive
psychology, the definition of reasoning
is the process of drawing conclusions
based on available information. And
reasoning can also appear in different
forms. For example, we have common sense
reasoning where questions like do people
wear sunglasses requires that. We have
mathematical reasoning where the
question is just a math problem and then
it requires some multi multi-step
reasoning to answer it. We have also
other forms of reasoning like multihub
reasoning, um logical reasoning, abduct
abductive reasoning and so on. So this
is the definition of reasoning. Now the
next question is can LLM reason and
there is a lot of argument between
different people whether LLMs can truly
reason or not but what is important is
how we define def how we define
reasoning in LLMs and in AI and I think
a very great definition of reasoning
comes from Denny um and Denny is
basically founded and lead the reasoning
team in Google brain and now part of
Google deep mind and in one of his talks
he defines
reasoning in LLMs as the presence of
intermediate tokens between question and
final answer. So for example in a
non-reasoning setup we have this LLM the
problem goes here and then the final
answer is out
whereas in a reasoning LLM it still the
problem goes into the LLM but the LLM
first generates some intermediate tokens
and after that it generates the final
answer. So basically we are going from
this quick single pass in non-reasoning
LLMs to a sequence of intermediate steps
followed by the final answer in
reasoning LLMs. Now to see a concrete
example of these intermediate tokens.
Let's go to hyperbolic. Hyperbolic is a
platform which allows you to use
different open source models and run
inference meaning that you send some
questions and get some answers. And
there are also other u providers. I just
use hyperbolic. And then here there are
a bunch of different models we can
choose. Um for example GPT OSS is the
open AAI's open weight model and there
are a bunch of other models like Quinn
and Kim K2 and so on. So here um I would
first select a non-reasoning model and
ask a question like a difficult question
like um I just learned AI and it sounds
fun.
How can I
win a touring award
in one year? So now this lamoth model is
non-reasoning meaning that it may still
uh break down the task into a smaller
pieces but it's not really thinking or
um generating intermediate tokens. So
let's see how it responds to this.
Winning a touring um award is impressive
achievement that requires a deep
understanding. And then it started to
answer. It's actually giving me a plan
um which I don't think is possible. But
it's saying month one uh month one to
three month four to five four to six and
so on. And it it stopped here just
because I had a limit on the max tokens.
But basically as soon as I asked my
question it started to answer. So this
is the final um answer of this model.
Now let me use the exact same prompt and
switch to a reasoning model uh like
deepseek
and I'll keep the same max token uh
paste the exact same prompt and let's
see what happens now. Now it's this
reasoning here that appeared. If I click
on it, it's basically is showing all
those intermediate tokens that the model
is thinking.
So it's like, okay, the user just asked
how to win a touring award. First, the
user sounds excited about discovering AI
and looking deeper, they probably don't
actually care about the award itself.
But so um these are basically the
internal thinking of the model and then
right after this when it feels confident
to answer here, it started to produce
the final answer. And um it's basically
um just going on until uh it reaches the
max tokens.
So um this is what we refer to as the
reasoning models. These intermediate
tokens. Sometimes it's visible to the
user. In this case, DeepSc R1 is an open
model. So it's visible the exact
thinking process. Some other models like
OpenAI's reasoning models are not show
um showing the exact um reasoning
traces. It shows a rewrite of that or a
summary of that. Um but that's that's
about a concrete example
and these reasoning models are better
than non-reasoning LLMs. For example,
this is an screenshot from a leaderboard
and we see that the the highly ranked
LLMs are all reasoning LMS and the first
non-reasoning LLM is ranked fourth and
this screenshot is from a website called
El Marina. El Marina is basically a
public platform which evaluates
different LLMs by pairwise comparisons.
And here we are seeing that right now
the the top LLM is Gemini 2.5, GPT5 high
and cloud ops 4.1 and these are there is
a tie and there are other models and
still like top four top five models are
all reasoning alms. So this shows the
power and importance of reasoning alms
and in the rest of this video we will
learn different techniques to build
reasoning alms.
So there are lots of different
techniques that we can build a reasoning
model and these techniques can be
categorized into two big buckets.
Inference time techniques and training
time techniques.
Inference time techniques refers to
those that um keep the model unchanged.
So the parameters of the original LLM uh
is not touched and it's the same. So the
LLM itself is still non-reasoning. But
then we add a bunch of different modules
in different ways and different
algorithms to make this entire process
as look like as a reasoning process. So
generating intermediate tokens and then
the final answer and we will go over
different techniques. But basically the
LLM is not touched. It's the frozen LLM
training time techniques on the other
hand they use a training algorithm and
some data reasoning data typically to
continue fine-tuning this um
non-reasoning LLM and then the outcome
of this is going to be a reasoning LLM.
Now this reasoning LLM knows how to
reason internally. So when we want to
use it, we can just pass a prompt to it
and then um the output would look like a
reasoning output meaning that
intermediate tokens followed by an
answer. So in this case when we want to
use the model if it's reasoning LLM if
we use some training time technique to
build this reasoning LLM we no longer
need to add um additional modules.
So with that let's start with going
through different techniques for
inference time um reasoning models.
All right, the first technique is
prompting and the idea of prompting is
to use prompt engineering to make LLM
generate intermediate tokens. Basically,
we are going from this setup which is um
a non-reasoning setup to it's this
additional module here called prompt
engineering and basically the prompt
first goes into here and then the output
of prompt engineering goes into the LL.
Let's see some examples.
For example, we can apply a few shot
chain of thought prompting. And what
that means is that let's say we have
this prompt. It's a math question. There
are three red bags with five apples and
so on. And then we pass it to this fshot
chain of thought prompting. And fshot
means that uh we show an example with
some intermediate tokens or with some um
reasoning traces and then we hope the
model to follow the same example. So
here what we see in red is the
additional thing that we are including
in as part of the prompt and basically
what we are having here is that answer 2
* 3= 6. So final answer is six
and then we have now solved this and
here this part is the original question.
Now when we pass this to the this
non-reasoning LLM this non-reasoning LLM
is understands this example and it tries
to follow that. So instead of outputting
the final answer, it's basically showing
all these intermediate steps and it's
trying to follow this exact uh structure
or template. For example, once it
reaches the final solution, it uses the
exact same uh format. So, comma, final
answer is six. So here it says so,
comma, final answer is 29.
So that's prompt engineering. And this
was few shot coot meaning that we show
one or some examples. We can also do
zero shot um coot. we just do uh we just
add let's think a step by step and just
adding this line allows the model to
also think and pushes the model to
actually think. So here we see that the
model tries to break down the task into
a smaller intermediate steps and solve
them
and there are more prompting techniques
but that's the highle overview of
prompting techniques to make a
non-reasoning LLM to reason.
The next technique is sequential
revision. And here basically the idea is
to refine the output of LLMs multiple
times using the same LLM. So instead of
passing the prompt to the LLM and
getting the output, we pass the prompt
to the LLM. Once the LLM provides the
output, we pass the output to the same
LLM again with some additional uh
instructions like evaluate this response
and improve it. And then we keep doing
this uh for a fixed number of
iterations.
And once it's over, we can pick the best
response or the final output.
This basically allows the model to
sequentially think and improve its
answers.
The next approach or technique is best
of n and the idea of best of n is very
simple. It is saying that we sample n
times from the llm in parallel and then
pick the best answer. So from this we go
to something like this. We for the same
prompt we sample n different times with
n different solutions basically and then
we have this additional component let's
call it selector and the selector looks
at all these responses for this prompt
and picks the best one and shows that to
the to the user and this becomes the
final output.
And for the selector there are different
ways we can build it. It can be just a
separate machine learning model called
reward model to look at all these u
responses and score them. It can be
simpler huristics like majority voting.
For example, in math questions, it can
look at different responses and pick um
whichever response that is more um
frequently appeared in all these sampled
responses or it can be any other uh
huristics.
Here is a short example for this prompt.
It's passed to the LLM. We also add this
let's think a step by step which we saw
earlier. And then this LLM we sample
three different solutions from it. And
we can see the first solution um has 29
as the final answer. The second one 28
and the last one 29. So the selector
here let's use let's assume we are using
majority voting would see that majority
of responses are believing the answer
the correct answer is 29. So it would
just pick one of those answers. And in
this case it's it's picking this one.
The last technique we would cover is
search against a verifier. And here
basically the idea is um instead of
having this setup we add a search
algorithm and then we rely on the LLM
and a a verifier to find the best
output. So I'll show some examples.
First let's see what is a verifier. A
verifier is simply a model trained to
score solutions or partial solutions. So
whenever we have some sequence of
thoughts, we can pass it to the verifier
and get some score and that score shows
that how good this response is for this
prompt. This is just a separate machine
learning model trained to do that and
the search algorithm relies on it to
explore different ideas or different
solutions
and search algorithm there are different
um just simply a search algorithm. It
can be a best of end which we saw
previously. It could be more advanced
searches like beam search, look ahead
search, Monte Carry search and so on.
So here is a a more concrete example
assuming that we have access to a
verifier. Best of end search algorithm
just generates n different u responses
and then all these responses goes into a
verifier. We get some score and then
best of end picks picks the best one.
Beam search is just making it more um
advanced and more efficient. So it's
passing these partial solutions to the
verifier to get the scores at each step
and then depending on the score it tries
it decides to prune or explore certain
branches more and we have look look
ahead search and there are more more
search uh algorithms.
So this is um search against verifier.
Before we go to training time techniques
let's see a summary of what we
discussed.
So a short summary is like this with
with um in a non-reasoning setup the
input we directly gets the output
in um prompting techniques like chain of
thought prompting which we saw from the
input we go to a sequence of thoughts
and then we get the final output
when it's combined with best of n we
basically sample multiple
um solutions in parallel And then here
depending on our selector it can be um
simply majority voting or some verifier
we can pick the best response and
use it as the final output. And in
search against a verifier we build this
um tree and we use the verifier to
explore more uh promising branches and
prune less promising branches until we
reach to a state where we feel
confident. And in this case this branch
is basically uh our final solution. Now
let's go to training time techniques.
So we already saw that the idea of
training time techniques is just to
continue training the base LLM to get a
reasoning LLM.
And there are two main ways to do this.
One is supervised finetuning. And here
basically just the idea is to fine-tune
this non-reasoning LLM on some chain of
thought data. And chain of thought data
is basically just a um uh pairs of
problem sequence of thoughts and the
final answer.
And if we train on this data this output
LLM this the outcome of this would be a
reasoning model.
And a popular example of this is a star
or selfreasoner.
And basically they're just explaining
how they use the base model to collect
coot data in this format question
rational answer. And then once we have
that they fine-tune LLM on this data to
get a better reasoning model. And
they're showing that they're just doing
it iteratively. So they keep
bootstrapping this model and getting
better and better coot data and
consequently getting better and better
reasoning alm. So if you're interested
you can take a look at this paper.
So this is supervised fine tuning. We
just basically continue training the
LLM.
The other technique is reinforcement
learning with a verifier. And basically
the idea here is to let the model let
this reasoning um LLM after SFTS stage
to practice. So the LLM can generate
multiple thoughts for the same prompt.
And then we have this verifier which
looks look looks at these responses and
score them. And then we apply we use a
reinforcement learning algorithm to look
at all these different responses and
update the parameters of the LLM such
that the LLM becomes more um encouraged
to
generate good answers and discouraged
from generating bad answers. Answers
which verifier believes that are not
good.
And a interesting paper showing that it
works really well is let's verify step
by step published by OpenAI. So if
you're interested to learn more about
this you can you can read this paper.
And for this verifier uh we have just
typically two two types of verifiers. It
can be either outcome supervised reward
model or OM which it scores the entire
solution. So it looks at all these
intermediate thoughts and the final
answer and score it or it can be process
supervised reward model where it scores
each individual thought separately. So
this is very detailed again if you're
interested you can go over this paper
but these are different types of
verifiers that we can in practice use.
The next technique is selfcorrection.
And the idea here is to train this LLM
on a data so that it learns to
selfcorrect itself.
So we go from here here to something
like this after training. So this LLM
now would generate some output then it
would self-correct it and it would self
correct it again.
And the key part of doing this is to
collect data. And again we can do
supervised finetuning or SFT. What we
need to do is to we need to collect
revision data which is basically a
sequence of incorrect answers followed
by the correct answer and then fine-tune
the LLM on this revision data.
Additionally we can also do
reinforcement learning for self
correction. And a popular example is a
score. I think it was published by
dimmine and it's just using
reinforcement algorithm to um to train
the model to selfcorrect itself.
So this is the third technique
selfcorrection.
The fourth technique which is the most
advanced one is to internalize search.
Basically collect some data and train
the LLM on this data so that the LLM
knows how to explore different
directions and reflect and backtrack and
and again um generate more solutions and
so on. So the key the key steps here is
just the data preparation. So we can use
um different techniques like inference
time techniques to sample different
solutions and then we can um once we
have the data we training is identical
to um SFT like identical to coot and
normal training and then after training
we would get this LLM that is able to
explore and reflect and backtrack.
Popular examples are meta coot or
journey learning and this is just
showing that after training a model to
uh on this uh kind of data the model is
able to generate some solutions and then
here it would backtrack and reflect and
generate more solutions and so on. So
this screenshot is from meta coot paper.
If you want to read in detail you can
you can go to this paper.
And finally uh we discussed all
different um inference time techniques
and training time techniques. Now let's
see what we know from GPT5.
So when the OpenAI released GPT5, they
also released this GPT5 system card and
it's a um very long article. It has uh
50 pages and they're sharing a lot of
information about it. But most of this
information is around evaluation and uh
safety mechanisms and how good the model
is. They're not sharing much about the
number of parameters or the architecture
that they've been using and so on. So
what I'm sharing here is what we know
from this system card and some other
reliable sources.
So we know a couple of things from GPD5.
One is that it GP5 unified system
consists of two models.
One is GPT5 main. It's named GPT5 main
and the other is GPT5 thinking.
GP GP5 main is more efficient so it's
less expensive but it's not a reasoning
model. GPT5 thinking is a reasoning
model and it can reason. So it's trained
for reasoning and how it's trained for
reasoning. It's not it's not um publicly
available. It's not known but it's very
likely that they're using one of the
techniques that we've already seen to
train this model to think. it's probably
having an SF uh SFDS stage uh
reinforcement learning with some
verifier and internalizing search.
So this is the first thing that we know.
The second thing we know about this is
that there is this fast uh fast router
that are put right before these models.
So whenever there is a prompt, the
prompt first goes to this router and
then the router decides which model to
use. For example, for certain prompts,
we do we do not need um thinking models.
The prompt is simply very simple. For
example, if we ask like where is a
capital of some u country for these kind
of questions, it's very likely that GPT5
main is sufficient and since it's less
likely less um expensive, so it's
preferred.
So the router decides where the prompt
should goes.
This is the second thing that it's
already explained in the system card.
And this router is very fast. it's um
it's not really costly.
The third thing we know from GPT5 is
that they're there's shifting their
focus from hard review results to safe
completions. So what that means is that
before GPT5 the models were using kind
of a intent classification on the input
prompt to decide whether it's a safe
prompt uh it's safe to answer this
prompt or no. So basically these models
were just using a binary classification
of user intent and then if they at that
stage if they believe that the input is
not safe it's not the user is not asking
a safe question they would just simply
refuse.
Whereas in GP5 they're switching the
focus on the output. So regardless of um
whether the input prompt is safe or not,
the GPT5 is optimized so that it can
produce answers that are safe.
And if you're interested, they have also
this um detailed paper about this safe
completions and this shift. So you can
just refer to this.
And finally they have also this thinking
pro mode in um GPD5.
So what that means is that um basically
in this mode they're just enabling test
time compute. So it's still the same
setup. It's still two models. There is
not any additional um pro model but the
difference is that now they are using
one of the inference time techniques
that we saw like uh self-consistency,
sequential revision, prompting and Monte
Carlo uh research and so on to explore
various uh solutions and various
directions to a prompt at the same time.
And then they are finally showing the
best one to the out to the user. So this
is what thinking pro mode means and just
to get a better sense if I go to chat
GPT here we have these modes instant is
basically um if we select instance it
means that we are asking the GPT5 to use
the non-reasoning model if we select
thinking it we are asking the model
regardless of whether our prompt is
simple or not to use the thinking model
and auto means that we rely on that
router to decide
and pro is simply we ask the model to
enable test time compute and use all
those uh inference time techniques that
we discussed.
All right, we can conclude the video. We
learned what reasoning LLMs are. We
explored various inference time and
training time techniques to build
reasoning LLMs and we also saw what GPT5
unified system might look like based on
their uh released system card.
Ask follow-up questions or revisit key timestamps.
The video discusses reasoning LLMs, defining them as models that generate intermediate tokens before a final answer, in contrast to non-reasoning LLMs. It highlights their superior performance on various benchmarks, such as the USA Math Olympiad and PhD-level science questions, and traces their emergence from 2024 to 2025 with models like OpenAI's 01 and GPT-5. The presentation then introduces various techniques for building reasoning LLMs, categorized into two main groups: inference-time techniques (e.g., prompting, sequential revision, best of N, search against a verifier) and training-time techniques (e.g., supervised finetuning, reinforcement learning with a verifier, self-correction, internalizing search). Finally, it details the architecture of the GPT-5 unified system, which integrates both efficient non-reasoning and powerful reasoning models, utilizing a fast router to direct queries, focusing on safe completions, and featuring a "thinking pro mode" that leverages advanced inference-time reasoning techniques.
Videos recently processed by our community