World Models explained in 10min..

Watch on YouTube

Now Playing

Transcript

264 segments

0:00

If I flip this coin, you know that the

0:01

odds are going to be 50/50. And you know

0:04

that as a fact without using fancy

0:06

notations because you grew up observing

0:08

the laws of physics over time. But

0:10

unlike us, Lars language models don't

0:12

have this luxury because LLMs don't have

0:15

a simulated environment to test out its

0:18

theory. The closest thing LLMs have to

0:19

that is reasoning models that use chain

0:22

of thought to reason its thinking. So

0:24

there in lies the question, are large

0:26

language models inherently flawed in

0:28

their ability to grasp the physical

0:30

world? And how exactly does world models

0:32

actually overcome this? Welcome to

0:34

Kilbright's code where every second

0:36

counts. Quick shout out to BCloud. More

0:37

on him later. On a recent podcast with

0:39

Dario Amade, he talked about the current

0:42

approach of pre-training in large

0:44

English models. Unlike humans, LLMs are

0:47

trained using trillions and trillions of

0:49

tokens over months in time. And in

0:51

comparison, humans not only develop a

0:53

lot slower, but they also experience the

0:56

world in modalities other than trillions

0:58

of text tokens. For example, we know

1:01

that the laws of physics govern the

1:03

physical world. And our senses observe

1:05

and act in the physical world and we map

1:08

our understanding about the physical

1:10

world as we embody them in our own

1:12

brain. But pure large language models

1:14

are trained only using text which is the

1:17

highest abstraction that describe the

1:20

physical world. So if LLM's never really

1:22

experienced a coin flip, how does it

1:24

really know whether a coin flip by law

1:27

of large numbers will eventually

1:29

converge to 50/50 without really

1:31

observing them in the physical world?

1:33

Around 2018, there was a resurgence of

1:35

what's called world model. And world

1:37

models approached things very

1:39

differently than LLM's in how it modeled

1:41

its understanding. What if instead of

1:43

feeding an AI model streams and streams

1:46

of text tokens, we train them to

1:48

essentially simulate the physical world

1:50

in its own brain that best represent the

1:53

physical world. As you can see, being

1:55

able to pull this kind of thing requires

1:57

a thorough understanding of the laws of

2:00

physics and cause and effect that best

2:02

mimic the physical world in its own

2:04

model. So, how exactly does a world

2:06

model do this? The original paper by

2:08

David Ha uses three main components. You

2:11

start with an environment and you

2:12

introduce a vision model that

2:14

essentially observe this environment.

2:16

The vision model's underlying

2:18

architecture uses variational

2:19

autoenccoder that is trained to

2:21

essentially compress what it observes

2:23

visually into a lower dimension in

2:26

latent space. What this compression does

2:28

is that it trains the vision model to

2:30

extract only the important features and

2:33

throw out the rest. So now that we've

2:35

set up our vision model that observes

2:37

the physical world, we now need a model

2:39

that can process this information. They

2:41

use MDN RNN as the underlying

2:44

architecture. And since recurrent neural

2:46

networks are really expressive in being

2:48

able to keep track of all of its

2:50

previous hidden states, what this

2:52

architecture allows is to store all the

2:54

previous hidden states that it saw in

2:56

the past, which allows the model to then

2:59

predict based on what it saw before and

3:01

based on what it's currently seeing. And

3:03

the MDN RNN architecture works very

3:06

similar to this app sketch RNN where

3:08

once I start drawing a picture of my dog

3:10

Logan, it'll take over and actually

3:13

finish my drawing based on its

3:14

prediction of what the rest of the

3:16

drawing would look like. The final piece

3:18

is a controller model where it samples

3:20

from NDN RNN's output as well as the

3:23

original outputs from the vision model.

3:25

And the sole purpose of the controller

3:27

model is just to make an action like

3:29

passing butter or moving left and right

3:31

or picking up objects. As you can see,

3:34

these three components in a world model

3:36

will learn to interact with the physical

3:38

world and map out what it thinks the

3:41

physical world looks like after seeing

3:43

the cause and effect of the real world.

3:45

And eventually its own representation of

3:48

the world will be sufficient that you

3:50

can just sever the actual environment.

3:52

What this allows is for the agent to be

3:55

trained solely by the simulation of the

3:57

world models inner mapping of the real

4:00

world to train the agent to do whatever

4:02

you want. And the assertion being made

4:04

here is that this kind of modeling fits

4:06

much closer to how humans think and gets

4:09

us much closer to AGI than LLMs do. And

4:12

the initial result was pretty promising

4:14

where the world model was able to learn

4:16

how to drive on this randomly generated

4:18

track by staying on the road while only

4:21

using less than 5 million parameters in

4:23

total size when we sum up all its parts.

4:26

So the real question here is can world

4:28

models actually scale to human level

4:31

capabilities or even more? And here's a

4:33

quick shout out to BCloud. If you want

4:35

to learn more about the theory behind AI

4:37

and AR research, check out ByCloud's

4:39

intuitive AI that is full of learning

4:42

materials. You can start from beginner

4:43

level to understand all the way from how

4:46

tokens work to embeddings, encodings,

4:48

and attention mechanism that power most

4:50

large language models today. He really

4:52

mixes in good illustrations while giving

4:54

an easyto- read narrative on how the

4:57

technology actually works intuitively.

4:59

You don't need to have a deep math

5:00

background. It'll just read like a novel

5:02

where you can sequentially learn from

5:04

the beginning or just use it as a

5:06

supplemental tool on areas that you're

5:08

curious about. He goes through different

5:10

pre-training and post-training mechanism

5:12

here as well as more advanced concepts

5:14

like Laura for you guys. By cloud has

5:16

given away 40% discount on the yearly

5:18

plan using the coupon code link in the

5:20

description below. One of the biggest

5:22

reasons why large language models that

5:24

we use today are so popular is because

5:26

LLMs scaled quite beautifully. While

5:29

world models are still domain specific

5:31

to certain jobs, most LLMs are what's

5:34

called foundation models, which

5:36

basically means that we can use a

5:38

generic LLM like GPD 5.2 or Opus 4.6 to

5:42

do many downstream tasks beyond simple

5:44

chat, but run deep research, software

5:47

development, manage our computers, and

5:49

more. But ever since the resurgence of

5:51

world models in 2018, we also had many

5:54

iterations and different flavors of

5:57

world models too. One of the biggest

5:58

proponents of role models is Yan Lakun

6:01

who worked on world models in meta as he

6:03

contributed to JEPA based models. He

6:05

later left Meta to start his own company

6:08

called AMI or advanced machine

6:10

intelligence that is seeking up to $5

6:12

billion valuation. And Yen has been

6:15

patronizing LLMs for quite a while now

6:17

saying how LM simply don't understand

6:20

the physical world beyond its auto

6:23

reggressive nature that predicts token

6:25

after token. But we know that language

6:27

actually contains more than what Yen

6:29

probably gives credit for. Languages not

6:31

only represent the physical world in

6:34

words, they also contain grammarss that

6:36

dictate meanings, figures of speech that

6:39

provide abstract understanding of the

6:41

physical world and the individual units

6:43

of words like nouns, adjectives,

6:45

adverbs. They all reveal facts about the

6:48

physical world. But meanwhile, the gap

6:50

between pure LLMs and world models have

6:52

also been blurred as of around 2023 as

6:56

models like GPT4 from OpenAI, Gemini 1

6:58

from Google introduce what's called

7:00

multimodality where vision language

7:02

models that can also perceive images

7:05

using cross attention help LLMs also

7:08

perceive so to speak. And conversely, we

7:10

also have VLA or vision language action,

7:13

which is a type of a world model that

7:15

uses vision transformers with LLMs to

7:17

create action tokens. And this kind of

7:19

setup is what powers Neo, the humanoid

7:22

that was released back in October 2025

7:24

that became viral back then. But many

7:27

people still criticize multimodal LLMs

7:30

because, well, at the core, it still is

7:32

an LLM that lacks spatial awareness

7:34

about the physical world. And Fei Lee

7:37

wanted to demonstrate spatial

7:38

intelligence through her startup World

7:40

Labs back in September 2024, which

7:43

raised more than $230 million. World

7:46

Labs released a product called Marble

7:47

that essentially creates gajian splats

7:50

that produce millions and millions of

7:52

these particles to interact with that

7:54

are quite beautiful to work with. So,

7:56

just like how we saw in the original

7:57

paper, what you're seeing here is a

8:00

world model's representation of the

8:02

actual environment, but mapped in its

8:05

own model that we're able to see here.

8:07

But what's different with Marble in

8:09

comparison to traditional world models

8:11

is the absence of controllers that

8:13

actually grapple with the physics of the

8:15

generated world. This kind of sentiment

8:17

is expressed by Py who founded General

8:20

Intuition in October 2025 that's trying

8:22

to create a closer representation of a

8:25

world model that can actually interact

8:27

with games and simulations. Google has

8:29

also been a huge contributor in this

8:31

space when we look at SEMA in March 2024

8:34

and Sema 2 in November 2025 and of

8:37

course their most recent Genie 3 that

8:39

creates a hyperrealistic world that we

8:41

can move around in. As you can see, the

8:43

world models depiction of the physical

8:45

world is what allows us to generate AI

8:48

videos on Sora, train cars in a

8:50

simulation, and align robots in

8:52

factories. Nvidia also has a hand in

8:55

this by providing an open-source

8:56

platform called Cosmos, which is a world

8:59

foundation model. Similar to how

9:01

foundation models in LLM allow a generic

9:04

model to be used across many downstream

9:06

tasks, NVIDIA's Cosmos provide tools for

9:09

developers more upstream where we can

9:11

use three pre-trained models to cover

9:14

various use cases mostly around data

9:16

augmentation and data generation for

9:19

downstream training like custom

9:20

post-training like autonomous vehicles,

9:23

robots and video agents. Now that we

9:25

covered the landscape of world models

9:27

architecture and different approaches

9:29

that are taken by other companies, we

9:31

are still left with this rather

9:33

philosophical question. Do LMS really

9:35

demonstrate thinking and understanding

9:37

like humans do? Or does it even matter

9:39

that they actually think like humans? Do

9:41

you think that LM and world models are

9:43

mutually exclusive or do they just solve

9:46

different problems? What is the best way

9:48

to augment intelligence artificially?

9:50

What do you think?

Interactive Summary

Ask follow-up questions or revisit key timestamps.

Large Language Models (LLMs) inherently struggle to grasp the physical world because, unlike humans who learn through observation, they are trained solely on abstract text tokens. This limitation means LLMs don't experience fundamental physical laws, such as a coin flip converging to 50/50. To address this, "world models" emerged around 2018, aiming to simulate the physical world internally. These models typically consist of three components: a vision model (using a variational autoencoder) to compress visual observations, an MDN RNN to process and remember past states, and a controller model to make actions. This architecture allows world models to learn cause and effect, map a representation of the physical world, and train agents within a simulated environment, potentially leading closer to Artificial General Intelligence (AGI). While LLMs excel in broad applicability as "foundation models," world models have historically been more domain-specific. However, the lines are blurring with the advent of multimodal LLMs (like GPT-4 and Gemini 1) that can perceive images, and Vision Language Action (VLA) models that combine vision transformers with LLMs to generate actions. Companies like World Labs, General Intuition, Google (Sema, Genie), and Nvidia (Cosmos) are actively developing advanced world models and world foundation models to create hyperrealistic simulations and train AI agents for various real-world applications, from generating AI videos to aligning robots.