Code World Model: Building World Models for Computation

Code World Model: Building World Models for Computation – Jacob Kahn, FAIR Meta

Watch on YouTube

Now Playing

Code World Model: Building World Models for Computation – Jacob Kahn, FAIR Meta

Transcript

433 segments

0:13

[music]

0:21

Great to be here everyone. I'm Jacob

0:22

Khan. I'm a researcher at at Farret

0:24

Medai. I'm going to talk today about the

0:26

code world model which I'll abbreviate

0:28

as CWM and what it means to build world

0:31

models for computation.

0:33

This is work done by an incredible team

0:35

at fair uh extends all over the world

0:38

and I'm very grateful to be

0:39

collaborating with them.

0:41

So what's our goal with CWM? Our primary

0:44

goal is to build models that reason,

0:46

plan and make decisions. And we start

0:48

with code because it's an interesting

0:50

sandbox in which to think about

0:52

reasoning, right? It's constrained. uh

0:54

there are certain rules with code and so

0:56

our our goal is to predict future

0:58

observations given past observations and

1:01

actions. That's maybe what it means to

1:02

build a world model in some sense. And

1:04

we want to do this because we can learn

1:06

good representations of things if we

1:08

learn some sort of mapping between

1:09

observations and the future. And

1:12

eventually that leads us to planning and

1:13

reasoning and we can consider different

1:15

actions and see if we like the results

1:17

for decisions we make. I think there's a

1:20

bit of a false dichotomy right now

1:21

between world models and large language

1:23

models. World models are just a

1:25

parameterization of a problem as I'll

1:26

discuss. LMS are a way to to view and

1:30

use that parameterization and I'll I'll

1:32

dive into more of what that means in a

1:34

bit.

1:36

So, one of the fundamental questions

1:38

we're asking with CWM is what does it

1:40

mean to model code? Is code literally

1:43

the syntax in your editor or is it

1:46

something else?

1:48

And if you think about it, all a model

1:50

sees that is operating on code is just

1:52

syntax, right? We tokenize the input. It

1:54

goes into the model and we predict more

1:57

code as the output. This is the starting

1:59

and ending point for an analysis of a

2:02

program with a tokenbased autogressive

2:04

model. It's just the syntax. But what if

2:06

we instead modeled execution more

2:08

explicitly? And what if we created a

2:10

maybe a natural language systematic

2:13

description of programs and neural

2:15

models could ingest a more structured

2:16

representation of what it means to

2:18

execute code and then maybe we could

2:20

emit autogressively this representation

2:22

too.

2:25

So that's one of our goals for CWM. We

2:27

want to predict program execution

2:29

because we believe it might lead to us

2:31

better modeling things about code,

2:33

writing code, analyzing code, and

2:35

beyond. And so what we're going to

2:37

implicitly do is predict a transition

2:39

function of program states as we go

2:41

about executing.

2:44

So this is what execution tracing might

2:46

look like in action. We have a program.

2:49

We're going to count the number of ours

2:50

in strawberry. And at each step maybe

2:53

we'll have some frame separator which

2:55

will denote distinct lines of execution.

2:58

And we'll actually explicitly have local

3:01

variables. We could introduce things

3:03

about memory in that trace and that will

3:06

delineate line by line what's happening

3:08

as our program executes. And this is

3:10

something we could essentially feed to a

3:12

model because each line of our execution

3:14

trace maps to a corresponding line in a

3:16

program.

3:18

We don't have to stop at functions. We

3:20

could think about entire repository

3:22

level execution traces. We could think

3:23

about distributed system level execution

3:25

traces. We could think about modeling

3:27

execution for code contest solutions or

3:30

something more complex. programs with

3:31

high complexity. We could also then

3:34

transition that into, as I said, natural

3:35

language tracing. And we'll see what

3:37

that means in a moment.

3:39

But what does it actually look like to

3:41

model that transition function at a high

3:42

level as we start to parameterize the

3:44

problem? Well, we have programs or we

3:46

have data. That's some state. We have an

3:49

action executing the next line and that

3:53

results in the next state. And so both

3:54

both the program execution and the

3:57

model's decision-m in an agentic sense

4:00

uh can be modeled as a transition

4:01

function.

4:03

So where are we? This broader approach,

4:06

world modeling, we could say in an

4:09

agentic reasoning setting, we have a

4:11

problem. We have a model that thinks

4:13

about the problem. It takes an action in

4:14

the world. We get some feedback. Maybe

4:16

we fail. We think again. And we

4:18

iteratively continue this process with

4:20

feedback from the environment. Maybe in

4:22

the sense of code, that environment is

4:23

just an execution in a in a code

4:27

setting, right? But with a world model,

4:29

maybe we can actually simulate. We can

4:30

imagine that action. we can get feedback

4:33

in our imagined environment. So we could

4:35

actually generate execution traces about

4:36

a program without executing it. And this

4:39

gives us the ability to be far more

4:41

efficient with how we actually structure

4:43

our agentic execution. We don't have to

4:46

interact with the real world unless

4:48

we're ready to.

4:51

So let's couple this with autogressive

4:53

large language models. Right now we have

4:56

a state of a program. We have an action,

4:58

maybe the next line, and then we get to

5:00

a new state. we take another action etc.

5:03

And so we can sort of turn this with the

5:05

execution tracing format I mentioned

5:07

into almost a chain of thought that a

5:09

model can just interpret a model can

5:11

learn to predict the next state of an

5:13

execution trace. And so an LLM can

5:16

autogressively generate token by token

5:19

the state and action to state function

5:22

with program executions as the starting

5:24

point. Okay,

5:27

let's talk about data for a second.

5:29

Let's talk about for CWM. We gathered a

5:32

huge amount of GitHub data. We take

5:35

GitHub events and as I said, we're

5:38

interested in modeling things at the

5:40

repo level if we can, at the systems

5:41

level if we can. We want to have

5:43

execution traces go outside of the scope

5:44

of simple programs. And so we'll take a

5:47

bunch of PRs, we'll mutate those PRs,

5:50

predict changes,

5:51

[snorts and clears throat] and we'll

5:52

eventually have a raw PR data set. And

5:54

we can actually run tests or CI on those

5:58

GitHub repos when we know they're

5:59

passing and then generate execution

6:01

traces from that repo level data if we

6:03

want.

6:04

So here we are at the artifact the code

6:08

world model itself. I'll talk a bit

6:10

about what we did with it, how we

6:12

trained it and then what we can do with

6:13

some of these interesting execution

6:14

trace capabilities. But first it's a 32

6:17

billion parameter dense transformer.

6:18

This is a model for research. This is

6:20

not a huge you can't play with. uh you

6:23

can play with it right now. It has a

6:26

nice long context length for some

6:27

reasoning tasks and we train it end to

6:30

end. We do all the pre-training and

6:31

post- training ourselves

6:34

processes. We pre-train on a few

6:37

trillion tokens. We mid-train on some

6:38

more domain specific data. We do some

6:41

long context mid-training. We fine-tune

6:43

further uh on some instruction following

6:46

and reasoning tokens. And then we do

6:48

this joint RL and agentic reasoning

6:50

setup.

6:52

So let's parameterize the problem even

6:55

more broadly with CWM. We have a prompt.

6:58

We have an agent. We do some reasoning.

7:00

We take an action. We can use a tool. We

7:02

can emit text which is code that goes

7:05

into the environment. We take a step.

7:06

And from that environment, we get a few

7:08

things back. We get tokens. We get

7:10

rewards. We get log probabilities. We

7:12

might get compiler output. So with CWM,

7:16

we're also taking a big step back with

7:18

how we interact with the environment. C

7:19

CWM is a very bashoriented model. It has

7:23

fewer tools than do other models and it

7:26

has to learn how to use the terminal

7:28

pretty well to solve a lot of the tasks

7:29

we give it.

7:31

And this starts with SRL and with SWRL

7:35

we take a GitHub issue. We feed it to

7:37

the agent starting with that repository

7:39

level data set from before and we just

7:42

use bash, right? We learn commands uh in

7:45

bash and that lets us mutate our

7:47

environment that lets us mutate the

7:49

state of files. We can maybe use an edit

7:52

tool eventually or create content and

7:54

then submit things. But ultimately,

7:56

we're trying to put the model in an

7:58

environment that's very very similar to

8:00

what an engineer would be in and and

8:02

learn end to end in a bashbased setting.

8:05

Okay.

8:06

So we can bootstrap this setup further.

8:09

We can do some SFT before RL and we can

8:12

find some failure modes for the model.

8:14

We can rejection sample. So we can take

8:17

a bunch of agentic reasoning traces on

8:19

code tasks that failed and we can

8:22

basically feed those back into the

8:24

model. So in this example here, we have

8:27

a thinking trace where we're thinking

8:28

about instantiation logic for some code.

8:31

And I can look for that code. I can call

8:32

an explicit grab function. And this is

8:35

something we did with CWM again with

8:37

fewer tools and a larger emphasis on

8:40

bash as a starting point.

8:43

Let's talk about post- training for a

8:44

moment. We want to scale post- training

8:46

quite a bit. This is the trend we see

8:49

and we're getting a lot of excellent

8:50

returns out of uh from a reasoning

8:52

perspective when we post train. So part

8:56

of solving this for CWM because we have

8:58

a small model is an opportunity to

9:00

really scale up how we do post training

9:02

and in particular to improve the

9:04

throughput of the system and we're doing

9:07

an asynchronous RLbased setup. We have

9:09

samplers. We have an environment where

9:10

we can execute in the terminal and get

9:12

output. We have a bunch of trajectories

9:14

reasoning trajectories we output. We

9:16

have a trainer where we compute

9:18

gradients and score trajectories. We

9:20

have a source of truth for the model and

9:22

then that loop repeats.

9:25

So what's the challenge here? We have

9:28

this loop, right? We have samplers

9:29

predicting trajectories. We have scoring

9:31

trajectories. We're executing in the

9:32

environment. As we're doing this, we're

9:34

going to update a model. Eventually, we

9:36

have a produce consumed pipeline

9:37

problem. And so samplers are producing

9:39

lots of trajectories that are consumed

9:41

by those trainers. We need to

9:42

synchronize weights. And so we solve

9:45

this in CWM with a very very

9:48

asynchronous model. So of course we have

9:52

a trainer that's sending a model

9:53

checkpoint to a sampler very very

9:56

eagerly.

9:58

We have trajectories which are being

10:00

sampled and then sent back to trainers

10:02

very eagerly. But in particular we have

10:05

cues. So we actually will have many

10:07

models queued up to be input into a

10:10

sampling system. will have many

10:12

trajectories queued up to be scored and

10:15

then added visav gradients to the train

10:18

model. And so this setup stays

10:22

relatively on policy even though it's

10:23

highly asynchronous and we're not really

10:26

waiting for much with this setup. We're

10:28

able to achieve very very strong

10:30

throughput uh because of the

10:31

asynchronicity.

10:34

So one interesting feature of this which

10:37

is increasingly common is that we're

10:40

actually updating models mid trajectory.

10:42

So I have a model which we're sampling

10:44

from. It's interacting with the

10:46

environment. It's generating data. It's

10:49

executing bash commands. It's executing

10:51

code. It's getting output. And I might

10:53

actually update that model while it's

10:56

interacting with the environment. So

10:58

mid- trajectory I could totally swap out

10:59

the model with the new checkpoint and

11:02

the trajectory will change a little bit.

11:05

Uh theoretically that trajectory is a

11:07

bit off policy but the guarantees we

11:11

have with this system are quite strong

11:12

still in that because of the throughput

11:15

and because of the amount of data we see

11:17

we're able to make a lot of guarantees

11:19

around and take a lot of risk with

11:21

updating the model on the fly. And this

11:24

gives us really a system where there are

11:25

very very few bottlenecks overall

11:28

because we're queuing models, we're

11:29

queuing trajectories. We don't have to

11:30

wait until anything is done.

11:34

Okay, so overall we post train on still

11:38

a relatively small number of steps at

11:40

pretty large scale and we process about

11:44

200 and some billion tokens. And this

11:48

scale works really well. It produces a

11:50

strong model, a strong open model. It's

11:53

a pretty small model. It punches above

11:54

its weight. It's very nice. It's pretty

11:56

versatile. It uses tools in bash very

11:59

well. [clears throat]

12:01

But what can you what can we actually do

12:02

with uh with this model, right? What can

12:04

we do with a model that understands

12:06

program execution traces that maybe has

12:08

a good understanding of how how a

12:10

program will run and predicting future

12:13

state of a program.

12:16

CWM traces code really well, right? We

12:18

know that. we've showed it execution

12:20

traces and I can actually give it a

12:22

function and then it can go and trace

12:25

line by line that function with very

12:27

very high accuracy. It can show me the

12:30

values of local variables at certain

12:31

points again with a lot of precision

12:35

and this gives us some pretty

12:38

interesting capabilities.

12:39

I can think about a neural debugger on

12:42

top of a model. Traditionally, right, I

12:45

have a piece of code. I don't know what

12:47

I want to write. I put some question

12:48

marks.

12:50

Historically, I might prompt a model

12:52

with natural language. I want to set the

12:54

valuable uh the variable left and right

12:57

to be something in particular. I don't

12:58

know what it is. Uh now I need to

13:00

specify very fully the ambiguity that

13:02

I'm experiencing with how to complete my

13:05

program. With CWM, I can express those

13:08

things very naturally in line with code.

13:11

And I can actually express the shape of

13:13

the program I want with code and the

13:15

model will fill in the rest. And the

13:17

model fills in the rest by understanding

13:19

that the user wrote a for loop here. The

13:21

user wrote a condition here. The user

13:24

left a variable and assigned. Well, if I

13:27

were to go execute that, I could

13:28

simulate the execution of that loop and

13:32

understand better what it is the user is

13:34

really after. And so a neural debugger

13:37

is something that helps you compose with

13:40

code side by side. It's not just

13:42

generating code and it allows you to

13:44

again express the semantics of code very

13:46

very loosely but also very very

13:48

precisely. So if I have a piece of code

13:50

where I I want a certain structure I can

13:53

ensure that the model understands that

13:55

structure and and can implicitly trace

13:57

the execution.

14:01

This will make theoreticians bristle.

14:04

But I can also think about some really

14:05

ambitious things in computer science.

14:08

The halting problem we know is this very

14:10

fundamental problem where we don't know

14:13

if a if a program is going to to halt to

14:16

stop executing to terminate. And in

14:19

particular, this is tough because in

14:20

order to know if a program halts, we

14:22

would have to simulate the entire

14:23

execution of the program which if it

14:25

didn't halt would take forever. So the

14:28

halting problem is in some sense a

14:31

difficult problem to simulate or decide.

14:34

And so the question we can ask with CWM

14:37

is can I approximate some of these

14:38

things? Can I concretely reason about

14:42

program execution dynamics in this

14:45

sense? So can I say here's a program

14:47

does it halt? Maybe the model by

14:50

simulating execution can understand

14:53

really really high level patterns.

14:56

In the same way the model can understand

14:59

high level patterns in broader systems,

15:01

right? I could use this to debug a huge

15:03

distributed system where executing code

15:06

is very very expensive or even an

15:08

expensive function on a single machine.

15:10

Right? But the ability to have an

15:13

implicit world model internally where

15:15

I'm simulating what's happening with a

15:17

piece of code or a broader system gives

15:20

me the ability to reason about it

15:22

without executing otherwise expensive

15:24

things.

15:25

So we can make some progress with the

15:27

halting problem by building a model that

15:29

simulates it that simulates execution

15:32

and from there we can simulate and

15:35

approximate what it means to solve

15:37

otherwise impossible problems in

15:39

computer science. So this is pretty

15:41

interesting.

15:43

With that I want to encourage everyone

15:45

to go build on CWM.

15:49

Uh this talk does halt. This talk does

15:52

terminate. Um, and the model's available

15:55

on hugging face. We have some [snorts]

15:57

code on GitHub which will help you get

15:59

started with inference in a fashion

16:01

where you can twiddle bits a bit more.

16:03

We also have a technical report again

16:05

where we really try to be as open as

16:06

possible with all of these details

16:08

around training. This post-raining setup

16:10

I mentioned is explained in even more

16:12

excruciating detail as well as some of

16:14

the data that we use for execution

16:16

training and some of what we imagine a

16:18

model with these capabilities could be

16:19

used for. Thanks for your time. Have

16:22

fun.

16:23

>> [applause and cheering]

16:29

[music]

Interactive Summary

Ask follow-up questions or revisit key timestamps.

Jacob Khan from Fair Media introduces the Code World Model (CWM), a research project focused on creating models that can reason, plan, and execute decisions by understanding code execution rather than just syntax. The model, a 32-billion parameter dense transformer, uses a unique asynchronous training setup and execution traces to simulate program behavior, offering capabilities like neural debugging and potential approximations for complex computer science challenges like the halting problem.