World Models explained in 10min..
264 segments
If I flip this coin, you know that the
odds are going to be 50/50. And you know
that as a fact without using fancy
notations because you grew up observing
the laws of physics over time. But
unlike us, Lars language models don't
have this luxury because LLMs don't have
a simulated environment to test out its
theory. The closest thing LLMs have to
that is reasoning models that use chain
of thought to reason its thinking. So
there in lies the question, are large
language models inherently flawed in
their ability to grasp the physical
world? And how exactly does world models
actually overcome this? Welcome to
Kilbright's code where every second
counts. Quick shout out to BCloud. More
on him later. On a recent podcast with
Dario Amade, he talked about the current
approach of pre-training in large
English models. Unlike humans, LLMs are
trained using trillions and trillions of
tokens over months in time. And in
comparison, humans not only develop a
lot slower, but they also experience the
world in modalities other than trillions
of text tokens. For example, we know
that the laws of physics govern the
physical world. And our senses observe
and act in the physical world and we map
our understanding about the physical
world as we embody them in our own
brain. But pure large language models
are trained only using text which is the
highest abstraction that describe the
physical world. So if LLM's never really
experienced a coin flip, how does it
really know whether a coin flip by law
of large numbers will eventually
converge to 50/50 without really
observing them in the physical world?
Around 2018, there was a resurgence of
what's called world model. And world
models approached things very
differently than LLM's in how it modeled
its understanding. What if instead of
feeding an AI model streams and streams
of text tokens, we train them to
essentially simulate the physical world
in its own brain that best represent the
physical world. As you can see, being
able to pull this kind of thing requires
a thorough understanding of the laws of
physics and cause and effect that best
mimic the physical world in its own
model. So, how exactly does a world
model do this? The original paper by
David Ha uses three main components. You
start with an environment and you
introduce a vision model that
essentially observe this environment.
The vision model's underlying
architecture uses variational
autoenccoder that is trained to
essentially compress what it observes
visually into a lower dimension in
latent space. What this compression does
is that it trains the vision model to
extract only the important features and
throw out the rest. So now that we've
set up our vision model that observes
the physical world, we now need a model
that can process this information. They
use MDN RNN as the underlying
architecture. And since recurrent neural
networks are really expressive in being
able to keep track of all of its
previous hidden states, what this
architecture allows is to store all the
previous hidden states that it saw in
the past, which allows the model to then
predict based on what it saw before and
based on what it's currently seeing. And
the MDN RNN architecture works very
similar to this app sketch RNN where
once I start drawing a picture of my dog
Logan, it'll take over and actually
finish my drawing based on its
prediction of what the rest of the
drawing would look like. The final piece
is a controller model where it samples
from NDN RNN's output as well as the
original outputs from the vision model.
And the sole purpose of the controller
model is just to make an action like
passing butter or moving left and right
or picking up objects. As you can see,
these three components in a world model
will learn to interact with the physical
world and map out what it thinks the
physical world looks like after seeing
the cause and effect of the real world.
And eventually its own representation of
the world will be sufficient that you
can just sever the actual environment.
What this allows is for the agent to be
trained solely by the simulation of the
world models inner mapping of the real
world to train the agent to do whatever
you want. And the assertion being made
here is that this kind of modeling fits
much closer to how humans think and gets
us much closer to AGI than LLMs do. And
the initial result was pretty promising
where the world model was able to learn
how to drive on this randomly generated
track by staying on the road while only
using less than 5 million parameters in
total size when we sum up all its parts.
So the real question here is can world
models actually scale to human level
capabilities or even more? And here's a
quick shout out to BCloud. If you want
to learn more about the theory behind AI
and AR research, check out ByCloud's
intuitive AI that is full of learning
materials. You can start from beginner
level to understand all the way from how
tokens work to embeddings, encodings,
and attention mechanism that power most
large language models today. He really
mixes in good illustrations while giving
an easyto- read narrative on how the
technology actually works intuitively.
You don't need to have a deep math
background. It'll just read like a novel
where you can sequentially learn from
the beginning or just use it as a
supplemental tool on areas that you're
curious about. He goes through different
pre-training and post-training mechanism
here as well as more advanced concepts
like Laura for you guys. By cloud has
given away 40% discount on the yearly
plan using the coupon code link in the
description below. One of the biggest
reasons why large language models that
we use today are so popular is because
LLMs scaled quite beautifully. While
world models are still domain specific
to certain jobs, most LLMs are what's
called foundation models, which
basically means that we can use a
generic LLM like GPD 5.2 or Opus 4.6 to
do many downstream tasks beyond simple
chat, but run deep research, software
development, manage our computers, and
more. But ever since the resurgence of
world models in 2018, we also had many
iterations and different flavors of
world models too. One of the biggest
proponents of role models is Yan Lakun
who worked on world models in meta as he
contributed to JEPA based models. He
later left Meta to start his own company
called AMI or advanced machine
intelligence that is seeking up to $5
billion valuation. And Yen has been
patronizing LLMs for quite a while now
saying how LM simply don't understand
the physical world beyond its auto
reggressive nature that predicts token
after token. But we know that language
actually contains more than what Yen
probably gives credit for. Languages not
only represent the physical world in
words, they also contain grammarss that
dictate meanings, figures of speech that
provide abstract understanding of the
physical world and the individual units
of words like nouns, adjectives,
adverbs. They all reveal facts about the
physical world. But meanwhile, the gap
between pure LLMs and world models have
also been blurred as of around 2023 as
models like GPT4 from OpenAI, Gemini 1
from Google introduce what's called
multimodality where vision language
models that can also perceive images
using cross attention help LLMs also
perceive so to speak. And conversely, we
also have VLA or vision language action,
which is a type of a world model that
uses vision transformers with LLMs to
create action tokens. And this kind of
setup is what powers Neo, the humanoid
that was released back in October 2025
that became viral back then. But many
people still criticize multimodal LLMs
because, well, at the core, it still is
an LLM that lacks spatial awareness
about the physical world. And Fei Lee
wanted to demonstrate spatial
intelligence through her startup World
Labs back in September 2024, which
raised more than $230 million. World
Labs released a product called Marble
that essentially creates gajian splats
that produce millions and millions of
these particles to interact with that
are quite beautiful to work with. So,
just like how we saw in the original
paper, what you're seeing here is a
world model's representation of the
actual environment, but mapped in its
own model that we're able to see here.
But what's different with Marble in
comparison to traditional world models
is the absence of controllers that
actually grapple with the physics of the
generated world. This kind of sentiment
is expressed by Py who founded General
Intuition in October 2025 that's trying
to create a closer representation of a
world model that can actually interact
with games and simulations. Google has
also been a huge contributor in this
space when we look at SEMA in March 2024
and Sema 2 in November 2025 and of
course their most recent Genie 3 that
creates a hyperrealistic world that we
can move around in. As you can see, the
world models depiction of the physical
world is what allows us to generate AI
videos on Sora, train cars in a
simulation, and align robots in
factories. Nvidia also has a hand in
this by providing an open-source
platform called Cosmos, which is a world
foundation model. Similar to how
foundation models in LLM allow a generic
model to be used across many downstream
tasks, NVIDIA's Cosmos provide tools for
developers more upstream where we can
use three pre-trained models to cover
various use cases mostly around data
augmentation and data generation for
downstream training like custom
post-training like autonomous vehicles,
robots and video agents. Now that we
covered the landscape of world models
architecture and different approaches
that are taken by other companies, we
are still left with this rather
philosophical question. Do LMS really
demonstrate thinking and understanding
like humans do? Or does it even matter
that they actually think like humans? Do
you think that LM and world models are
mutually exclusive or do they just solve
different problems? What is the best way
to augment intelligence artificially?
What do you think?
Ask follow-up questions or revisit key timestamps.
Large Language Models (LLMs) inherently struggle to grasp the physical world because, unlike humans who learn through observation, they are trained solely on abstract text tokens. This limitation means LLMs don't experience fundamental physical laws, such as a coin flip converging to 50/50. To address this, "world models" emerged around 2018, aiming to simulate the physical world internally. These models typically consist of three components: a vision model (using a variational autoencoder) to compress visual observations, an MDN RNN to process and remember past states, and a controller model to make actions. This architecture allows world models to learn cause and effect, map a representation of the physical world, and train agents within a simulated environment, potentially leading closer to Artificial General Intelligence (AGI). While LLMs excel in broad applicability as "foundation models," world models have historically been more domain-specific. However, the lines are blurring with the advent of multimodal LLMs (like GPT-4 and Gemini 1) that can perceive images, and Vision Language Action (VLA) models that combine vision transformers with LLMs to generate actions. Companies like World Labs, General Intuition, Google (Sema, Genie), and Nvidia (Cosmos) are actively developing advanced world models and world foundation models to create hyperrealistic simulations and train AI agents for various real-world applications, from generating AI videos to aligning robots.
Videos recently processed by our community