Building AlphaGo from scratch

Building AlphaGo from scratch – Eric Jang

Watch on YouTube

Now Playing

Building AlphaGo from scratch – Eric Jang

Transcript

1609 segments

0:00

Today, I'm here with Eric Jang, who was most recently vice president of AI at 1X

0:04

Technologies, and before that, senior research scientist at what is now Google DeepMind Robotics.

0:10

You've been on sabbatical for the last few months. One of the things you've been doing is rebuilding,

0:15

improving, and hacking on AlphaGo. Today, you're going to explain building

0:21

AlphaGo from scratch and what it tells us about the future of AI research and development.

0:26

Before we get to that, why is AlphaGo interesting? Why is this the project you decided to do on

0:31

sabbatical rather than just hanging out at the beach?

0:34

I like making things, and AlphaGo and Go AI is one of those things that really got me into the field.

0:40

When I saw the early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth,

0:46

it was profound to see how smart AI systems could become and the computational complexity

0:53

class they could tackle with deep learning. This is a problem that has long been understood

0:59

to be intractable for search, and yet it was solved through deep learning.

1:07

That was quite mysterious to me, and I've always wanted

1:09

to understand that phenomenon a little better. My training is in deep neural nets for robotics,

1:14

where the decisions made by the neural networks are a bit more intuitive.

1:19

But AlphaGo is a problem where the decisions are the result of a very, very deep search.

1:26

It's always been very mysterious to me how a ten-layer network can amortize the simulation

1:31

of something so deep in the game tree. If you plot out how much compute it took

1:38

to build various iterations of strong Go bots over the years, you can see that in

1:42

2020 there was an open-source project called KataGo by David Wu from Jane Street, which

1:49

achieved a 40x reduction in the compute needed to train a really strong Go bot tabula rasa.

1:54

I'm not certain if it's stronger than AlphaGo Zero, AlphaZero, or MuZero, but it's very strong,

1:59

and this is what most Go practitioners today train against when they're playing an AI.

2:04

Thanks to LLM coding, what took a whole team of research scientists at DeepMind and millions of

2:09

dollars of research and compute can now be done for a few thousand dollars of rented compute.

2:14

We should first discuss how Go works. How does the game work?

2:19

Go is a very simple game that can be implemented quickly and easily on a computer.

2:24

The objective is to put down black and white stones and

2:29

try to occupy as much territory as possible. I might start by putting down a black stone.

2:34

Black always goes first. Go ahead. The way you capture an opponent's stones is that

2:38

for every intersection, if you can surround all four of its neighbors with your stones,

2:45

then it's cut off from oxygen, if you will, and it's a dead stone.

2:53

Now I control these four stones as well as this empty intersection here.

2:57

There are slight variations between Chinese, Japanese, and what are called Tromp-Taylor rules.

3:03

Tromp-Taylor rules are designed to be completely unambiguous, so this

3:06

is what all Go AIs train and resolve against. In typical Go, when humans play, you're actually

3:12

not allowed to put this white stone down here. It would be instant suicide. In Tromp-Taylor, it's

3:17

actually fine. You put it down, and it immediately resolves to death, so the outcome is the same.

3:22

Let's start over and play a few stones, and then I'll explain some more. I'll just start there.

3:32

I'm basically playing randomly here, but I'm trying to get around your stones and

3:36

see if I can surround them. This move exposes one empty

3:48

neighbor for your white stone. It's akin to a check in chess.

3:53

If you don't respond immediately by putting one here, then I can immediately capture this.

3:57

I see. Because it's the diagonals that determine whether you're surrounded or not.

4:01

The cross-section, not the diagonals. This one is surrounded on three sides,

4:06

so you're at threat of losing that stone if you don't play one immediately there.

4:11

Now you can see that I'm starting to pressure you, because by putting a stone here,

4:16

you're forced to put one here. Otherwise, you would have this

4:20

two-block to yourself. Yes. And if you think

4:23

through what happens if you were to respond here, you can probably

4:27

search into the future and deduce what I'll do in response once you do that.

4:31

You have a lot of confidence in my abilities, but I'm guessing you'd put the black here.

4:35

That's right, and then I would capture all three of these stones.

4:37

So I should just assume that this little block is gone.

4:40

Yes. In Go, it's actually okay to let an opponent capture some stones if,

4:45

for example, it lets you position to capture more stones somewhere else on the board.

4:50

This is what makes Go a beautiful game: you can lose the battle but win the war.

4:56

As the board size increases, the complexity of these micro versus macro

5:01

dynamics gets more interesting. Presumably you'd put one here.

5:06

So now I would capture this entire group, and this would be mine.

5:10

There's one more case I want to demonstrate, which I actually had a bug in my code for recently.

5:18

Let's consider a formation like this, with other pieces on the board in play.

5:28

Let's talk about how the game ends. In this territory, who controls these

5:35

areas? Is it white or black? White.

5:38

It's actually black, because I've surrounded this whole area.

5:43

Assuming I have other black stones here, it's very hard for you to break this

5:48

out of the control of these stones. And when the final score is tallied,

5:51

would these ones also count as being in... Great question. This is where different rule

5:57

sets have different ways of scoring. We should talk about how you resolve

6:02

scores between humans and how you resolve scores in computer Go, because there's some

6:06

ambiguity in how humans evaluate this. Most humans would look at this board

6:11

configuration and conclude that black has totally surrounded white, and white has no chance of life.

6:16

We could play out more here, but at the end I would capture everything.

6:21

However, if you have a way of breaking this formation and connecting white to

6:24

something outside of it, then it can flip. This is where it's a little bit hard for a

6:30

computer to decide these kinds of things. How do humans do it? It's worth thinking about how

6:34

humans resolve this, because this will map later to how we think about the deep neural network.

6:40

Humans basically say, "I think the game is done," and then you have to also say,

6:45

"I think the game is done." Then we'll say, "I think these

6:47

are my stones," and you have to agree. If you don't agree, we keep playing.

6:53

Essentially, once two humans—their so-called value function—agree on a consensus, then the

6:59

Chinese rules resolve that. In Tromp-Taylor scoring,

7:04

it's perfectly unambiguous, so it can be decided algorithmically by a computer.

7:09

If you have this at the endgame, the way you score it is that you first count how many

7:15

stones you control, and that's unambiguous. Then you count how many empty intersections

7:20

are not touched by your opponent's stones. These intersections would not count for either

7:25

player, because all of these intersections are connected to both white stones and black stones.

7:31

If this were like this, then white would get three points.

7:36

This is a little odd because a human would know that white is actually losing these points.

7:42

But Tromp-Taylor scoring would consider white to have all of these points as well as these points.

7:48

So that is a very big difference in how computer Go scores things and how humans score things.

7:53

How does the game end? The game ends when either

7:56

a player chooses to resign or both players pass consecutively. Those are the rules.

8:04

Now help me crack this with AI. Let's understand how AlphaGo

8:10

actually works and how somebody in the audience might be able to implement it.

8:13

Let's start with an intuition about the underlying search process used to make moves,

8:21

and we'll layer on ideas from deep learning to make it much more efficient and tractable.

8:27

Go is a game with just two players. We're going to draw a person here,

8:31

and we're going to draw an AI here. Let's say this person is playing black,

8:38

so they go first. They go here. Now the AI is going to make a move based on what it sees here.

8:51

There's a question of how you encode these inputs into the AI.

8:54

Maybe you could use ones and zeros, but you want to represent black, white, and empty.

9:00

You would need at least three different values. Maybe you could use zeros, ones, and twos.

9:05

The AI might see something like zero, zero, zero, zero, one.

9:23

This is the input to the AI on its turn. The AI can choose. Let's just pick three possible random

9:29

moves it can make, and I just drew these at random. Which move is best here? Well,

9:33

we don't know until the game ends. Go doesn't have any kind of local

9:37

reward for which move here is good. This is what makes Go a very difficult game.

9:41

You don't actually know who won until you really get to the end. How deep is this tree? On a 19x19

9:49

Go board, there are roughly on the order of 361 moves on any given move, and of course,

9:56

as it fills up, you have fewer moves. The number of steps in the game

10:02

can be somewhere from 250 to 300 moves. Experts might decide to end the game well before

10:08

that, but under Tromp-Taylor scoring, you actually have to play things all the way to the end.

10:12

So this could be 300 moves, a depth of 300 in the tree.

10:19

If you keep on expanding possible moves—here the AI goes, then here the human would go,

10:27

and so forth—you find that you end up with an enormous

10:41

explosion in the possible game outcomes originating from just this one state.

10:48

This is something on the order of 361³⁰⁰, which is far more than the number of atoms in the universe.

10:58

Of course, there are redundancies and symmetries, so it's not actually that, but if you were to do

11:04

a naive tree with no merging of children, you end up with a tree about this big.

11:09

What do you mean by "merging of children"? Let me use this board here.

11:13

If we start here, and then you play here, and then I play here, and then you play here,

11:20

that's equivalent to me starting here, you playing here, me playing here, and then you playing here.

11:28

Both arrived at the same spot, but through different paths.

11:30

So this child node can be thought of as a shared ancestor.

11:34

Got it. And I guess it starts at 361 but it decreases by one each time.

11:38

Yes, the branching factor decreases by one each time.

11:41

In any case, this is a very large tree. This is also why computer scientists for

11:47

many years thought that Go was not a tractable problem this century.

11:50

The amount of compute you would need to exhaustively

11:53

search every possibility is just too large. If you could, Go is actually a deterministic game.

12:00

In any given state, you can compute the best possible strategy you can

12:06

make in order to win the game. You can search all the possible

12:09

futures where you win, and then just make sure you always stay in that set of futures.

12:16

AlphaGo's core conceptual breakthrough was using neural nets to make this search problem tractable.

12:23

Before we get into how neural networks are involved, let's talk about how we can,

12:29

assuming we had a powerful enough computer, search this tree to find the best move.

12:35

In the beginning, you're not going to build out the whole tree,

12:39

because storing it would be very expensive. Instead, you might interactively figure out

12:44

which leaves of this tree are worthy of exploring and expanding into the

12:48

future to see what else is there. There are some early algorithms in

12:54

the bandit literature like UCB1, which is not exactly appropriate for a sequential game like Go,

13:02

but very much inspired the action selection algorithm used in AlphaGo.

13:08

UCB1 looks like: on every move, we're going to take the best action—the argmax over a that

13:16

maximizes Q of a, and I'll explain what Q of a is in a moment—plus some sort of exploration bonus.

13:34

On every node, we're going to track a few quantities.

13:38

Let's consider each of these a node. This is the root node where you're

13:43

making decisions from, and these are the children of the root node.

13:50

Each node is basically a data structure that stores a visit count of this child node.

14:04

How often the parent visited this node. Yes. One thing that's easy to trip on if you

14:09

come from robotics or other kinds of reinforcement learning is, where are the actions? I'm only

14:15

talking about nodes. Nodes here represent states, and because this is a perfectly deterministic

14:21

game with no randomness, you actually can just infer the action based on the child.

14:26

If I go here, that implies an action, and this is the state we resolve to.

14:32

If you ask LLMs to vibe-code an MCTS implementation, they will most likely

14:37

design the right data structure here. But it's sort of a chef's choice.

14:41

You can rewrite the tree structure however you like.

14:44

This is what Claude 4.6 wrote for me when I asked it, and it was a very reasonable choice.

14:52

Qₐ represents the mean action value of this action.

15:02

I'll use a subscript a to denote that this corresponds to taking a specific

15:05

action to get here from the root node. Taking a gets us to this node here.

15:17

We're also going to store the probability of taking this action

15:22

Again, from the parent? From the parent, yes. What are the odds that we

15:26

sample this one? This will become relevant later. We've talked about a deterministic tree for now,

15:32

so I'll bring probabilities into this later. Finally, we have a dictionary of children,

15:39

which is just more of these nodes in a classic linked list-style reference tree.

15:45

This is the basic data structure to implement a tree.

15:50

In AlphaGo, they use a slightly different action-selection criterion called PUCT,

15:56

short for Predicted Upper Confidence with Trees. When you select which child to take,

16:06

you do argmax a of Q(s,a) plus a constant. The equation forms are pretty similar. These

16:34

are both scoring criteria. You want to argmax this quantity and you want to argmax this quantity to

16:38

determine which action to take. Let's break down the intuition

16:41

of how you select actions here. Q(s,a) is the mean action value,

16:46

so how good a given child is on average. If you actually knew the whole tree,

16:52

this is all you need to select the best action. You don't really need to do more than that.

16:56

But if you're interactively building this tree as you're figuring out what the Q values should be,

17:01

then occasionally you have to try some other actions as an explore-versus-exploit trade-off.

17:08

In both UCB and PUCT, there's this term that basically rewards

17:13

taking actions you haven't taken before. As we mentioned, each node stores the visit

17:19

count of taking that specific action. Everything is initialized to zero. For a given action,

17:25

let's call it action a, initially it's zero. As n is increasing, if we've already made 10

17:36

action selections from that root node but we haven't picked a yet,

17:40

then this term starts to become quite large for a. Conversely, if we've chosen a 10 times out of 10,

17:47

then this term is quite small. It diminishes very quickly. The same thing is true here.

17:53

Just to make sure I'm understanding, let me put it in my own words. Let's focus on UCB. Conceptually,

18:03

you can think of it as two different things: the Q and the exploration term.

18:09

Let's be clear about what Q is. Q is basically saying, once we do

18:12

these rollouts—you're actually running all these simulations—you go down the tree and figure out:

18:17

if I end up at the terminal value of this tree, do I win this game or not?

18:23

You average whether I win or not across all the leaves of this tree starting from this node.

18:31

That average, you put in Q. So Q represents the probability that

18:38

I'll win this game starting at this node. That is your sort of exploit.

18:42

It's saying: I've run these simulations, and I think this is a good move or not.

18:46

The other term is saying: have I explored this branch enough yet

18:51

relative to the other actions I could be exploring, or have already explored?

18:56

If I haven't explored this branch yet, maybe I think it has a low score,

18:59

but I just haven't explored that many leaves down this node in this tree.

19:06

So I should try this even though Q, the exploit term, is telling me it's not that valuable.

19:11

Because ln(n) grows slower than n, over time you will move from the argmax being dominated

19:22

by the exploration term, which is the second term here, to the argmax being dominated by the Q term,

19:27

which is when you've done enough simulations and are confident that this is the branch to go down.

19:33

Yes, that's right. The motivation for UCB was to come up with an algorithm where,

19:39

if you don't know the payoff of the different actions you can select, this strategy,

19:47

given some exploration term here, bounds your regret in terms of how wrong you can possibly

19:52

be. I don't know the proof. I also don't know if this one is proved to have logarithmically or

19:58

square-root-bounded regret, but I think the algorithm was derived to look something like this.

20:02

You can tell these terms grow a little differently, and this is to account for the

20:06

fact that Go has many more actions for any given move compared to your standard bandit problem.

20:13

One small clarification: you talked about simulations and probabilities.

20:18

We should remember that Go fundamentally is a deterministic game.

20:22

Where does the notion of probability come from here?

20:27

If you had a very powerful computer, there are no probabilities.

20:32

You can just compute the true average of the mean action value.

20:36

So where does the probability come in? In computer Go before AlphaGo,

20:44

we've always done some sort of Monte Carlo method where we take the expected Q value averaged over

20:52

a randomly selected tree. That randomly selected

20:56

tree is where probabilities come in. The interpretation of Q is: what is the expected

21:02

action value under the random distribution induced by some random search process?

21:10

Where does the random search process come in? That's where Pₐ, of action, comes in.

21:15

If we assume a naive algorithm where you have a uniform probability of taking any valid action,

21:20

then this would just be one over the number of valid moves.

21:26

You would be taking this average over a very diffuse tree.

21:31

This is a valid integral, but it's very slow because you're going to consider a lot of trees

21:36

that have very low value. It's essentially almost

21:39

like an importance sampling problem. Only a few actions and paths contribute high

21:47

value, and almost everything else is low value. So that's a tricky problem here.

21:55

This is the action selection criterion for how you decide which moves to go down.

22:00

As you move down in tree search, you will eventually run into a node where

22:06

it's quite clear you've won or lost. At the very end of the game, when there

22:10

are no valid moves left to play under Tromp-Taylor scoring, you can decide whether you won or lost.

22:22

This is the final return of the whole game. We can assign a value, U, to a terminal leaf

22:37

node of the tree, but how do we assign values to the nodes prior to that, the parents?

22:44

You take the mean action value, which is essentially your average.

22:52

Suppose these were all leaf nodes. The mean action value of this node

23:01

is just the average of whether you won or lost at the leaf nodes.

23:07

Correspondingly, you can walk up the chain and say the mean action value of this node—let's call it

23:12

Qb—is just a weighted average of these ones here. The weighted average could depend on whether

23:23

you have a different sampling distribution. But the basic intuition is that you want to

23:27

resolve the game where you have a deterministic win or lose, and then you can go backwards—this

23:33

is called the backup step—and assign values to these nodes or actions corresponding

23:40

to the average over the final terminal leaf. If you were to do this without neural networks,

23:47

it would still be intractable. You would have trouble finding

23:51

which actions to sample. A lot of the actions would

23:55

contribute very low value, especially if you're trying to fight your way out of a losing position.

24:00

Only a few actions give you high value, so the search in practice is still very expensive.

24:05

But the idea is that because Go follows a tree structure, you can inform a very good

24:14

estimate of the value of this node based on the downstream values, assuming they're

24:19

all correct and you've searched deep enough. Your earlier explanation—about the sorts of

24:25

states where it's obvious to a human who's going to win, but deterministically you still have to

24:31

play it out—actually drew upon the intuition for why 1) the value function is both trainable and 2)

24:39

necessary in order to learn this game effectively. Maybe it's worth defining value

24:45

in the first place. We talked about U being your

24:51

final resolution of whether you won or lost. This is a terminal leaf node condition.

24:56

Humans don't play all the way to the leaves of the tree.

25:00

They stop dozens of moves before, maybe even 100 moves before in high-level play. How do they know?

25:09

You can think about humans as implicitly having a neural network called a value function that

25:16

takes in a board state and evaluates p(win). The human glances at the board and knows,

25:27

"I'm probably going to lose." They're essentially running a neural

25:30

network that looks at a board and implicitly amortizes a huge number of possible game playouts.

25:38

They take that average and decide whether the board is winnable or not, and whether they should

25:42

concede or keep playing. This is remarkable. If you think about the beauty of something like this,

25:51

a neural network in a human can somehow do all of this simulation at a glance.

25:57

They just know within a few seconds, without actually playing every single game logically,

26:03

based on crystallized knowledge and experience. They can do this. This

26:07

gives us a hint that in games like Go, there are ways to radically speed up the search process.

26:14

This is one of the fundamental intuitions behind why AlphaGo works.

26:18

You can train a value function to look at a board and quickly resolve the game without playing out

26:25

all of these trees to a very deep search depth. Makes sense. I will say for the audience,

26:36

for previous episodes when I was prepping and it seemed relevant to understand how AlphaGo

26:39

works, I would find it very confusing. But it's the kind of thing where once you

26:44

understand the problem in this way and then build the next few pieces, it is actually a lot more

26:49

understandable and it makes a lot of sense. It's okay to be confused right now,

26:53

but it's probably simpler to understand, by the end of this lecture, than you anticipate.

26:59

I’ll just make that note for the audience. The important intuition at a high level—to

27:03

step back about where we're going with all this—is that classically, for games like Go,

27:09

you could build a tree, but we don't have computers powerful enough for that.

27:15

Estimating the value of every action you could possibly take is also hard because

27:19

you don't know until the end of the game. You could take averages by playing them to the

27:24

end, but that's also hard because you don't know which actions to take to sample these averages.

27:30

Conceptually, there are two problems: the breadth of the tree and the depth of the tree.

27:35

AlphaGo gives us a way to shrink both of those to be tractable.

27:41

That's essentially the core idea behind it. We take the idea that humans can glance at a

27:47

board and instantly predict whether we win. That maybe gives us the opportunity to

27:51

truncate how deep we search. We also know that humans can

28:06

intuitively, at a glance, decide what moves might be good on a Go board.

28:10

These are two things we can use deep neural networks for, to accelerate the search process.

28:16

Before we talk about neural nets, let's go back to how this playout works.

28:20

We've only talked about making one move. The AI looks at this encoded Go board. It

28:25

has a tree. It searches deeply into the tree to find out which of its actions might be the best,

28:32

takes that action, and goes back to the human. Now the human sees a Go board that looks like

28:40

this, and they make their move. Maybe they put their stone here.

28:50

Now we go back to the AI, which now looks at a new encoded board.

29:07

I've used 2 to denote the AI playing as white, 1 to denote the

29:11

human playing as black, and 0 as empty. On the AI's turn, it does the MCTS tree

29:17

search all over again from scratch. It throws away the old tree that it

29:21

searched last round. Now there's a new root

29:23

node and it begins to search anew. You can basically think about MCTS

29:31

as a search algorithm that decides which moves to play best, aided by

29:36

neural networks, and it's done on every move. Let's talk about the neural network part of this.

29:46

While you're erasing, another thing that was important for me to understand was about the MCTS

29:52

data structure with nodes and children of nodes. This is done per move and reinstantiated

30:00

once a move is made. A human makes a move,

30:03

then the AI looks at this and runs a bunch of simulations to figure out what move to make next.

30:12

A simulation is exploring one more node in the MCTS tree.

30:19

Once you run 1,000 simulations, that informs the probability of what move to make next,

30:29

as you'll explain. That’s what you store. You choose the best move given those probabilities.

30:34

You discard all of that, the next player makes a move, and you restart this process

30:39

at the beginning of every move. Correct. One small addendum:

30:42

you don't discard all of that. You keep one thing behind that we'll use later.

30:47

Just like I did for Reiner, I wanted to make flashcards for this episode so that

30:50

people could retain these concepts. And ideally, an LLM could generate

30:54

some candidates for me to then refine. But to actually get high quality suggestions,

30:58

I needed to design a whole pipeline where the AI could take and ingest screenshots of

31:03

the blackboard at the right timestamps and then make SVG diagrams in case visuals were helpful,

31:08

and then run the writing and drawing through a critic and then revise the card in response

31:12

to this feedback. It's very hard to

31:14

accomplish this just by stacking LLM calls. This sort of step-by-step recipe works much better

31:19

if you have a durable agent that's been engaging with the task across all the previous stages.

31:23

So I used the Cursor SDK to spin up an agent for each card.

31:26

The Cursor harness saved me a bunch of work in designing some custom context scaffold or

31:31

figuring out how to design tool calls for taking screenshots or making animations.

31:36

These agents all run in the cloud, so I don't have to worry about leaving my laptop open.

31:40

I just get an email when I have candidates to review.

31:42

You can check out my cards at flashcards.dwarkesh.com.

31:47

You can start building with the agents SDK at cursor.com/dwarkesh.

31:54

Now that we have a basic intuition of how moves are made with search, we're going to

31:57

talk about how neural networks can speed this up by providing an analog to human intuition.

32:03

There are two networks. There is the value network, which takes in a

32:09

state and predicts, am I going to win or lose? It's a binary classification problem.

32:17

Then we have a policy network, which induces a distribution over good actions to take.

32:26

I'm going to draw a one-dimensional flattened move distribution, but this is really a square grid.

32:38

These are the probability distributions over good actions.

32:42

Both of these are categorical classification problems.

32:45

You can train this like any classifier with deep learning, cross-entropy loss, that kind of stuff.

32:54

The specific architecture does not matter too much.

32:57

I tried a few different architectures. Transformers work, ResNets work. For

33:01

small data regimes, my experience is that ResNets still outperform transformers and give you more

33:07

bang for the buck at lower budgets. But this may not always be true.

33:10

Wait, why is that? They provide the

33:12

inductive bias of local convolutions. Generally, transformers start to outperform

33:18

residual convolutional networks when you want more global context.

33:22

One interesting finding from the KataGo paper was that they found it quite useful to pool together

33:29

and aggregate global features throughout the network, to give the network a global

33:36

sense of how to connect value from one side of the board to the other.

33:40

What does it mean to aggregate global features? If you have a very large 19x19 Go board,

33:48

and you've got some battles going on here and some battles going on there, when you pass

33:55

this through a convolutional neural network, the receptive fields are going to be good at computing

34:02

local things and making that invariant. But they won't be able to connect these

34:08

two features easily. They need to be pooled

34:11

together and attend to each other somehow. The argument for why transformers are good

34:17

for computer vision tasks—vision transformers and so forth—is that because they have global

34:23

attention across the whole thing, they can more easily draw these connections.

34:27

But you do need more data there so that you can learn the invariant local features through data.

34:36

I've tried very hard to make transformers work for this problem because I was curious if transformers

34:40

would present some sort of breakthrough in Go and just remove a lot of those tricks.

34:44

But try as I might, I haven't figured out a way to make transformers better than ResNets for now.

34:50

One more tangential question. It makes sense why transformers, with their global pooling

34:56

of information, would be better if you need to consider information that is not just spatially…

35:04

CNNs give you a bias that the things next to you are especially relevant.

35:09

And then they're aggregated up slowly. For games where it isn't that relevant

35:15

what is happening locally—you just have to consider the whole thing—you're

35:18

saying transformers would work better. We’re talking about the spatial dimension.

35:22

How about the temporal dimension? Right now we're only considering the previous move because

35:27

it is a deterministic full-information game. What if it were something like poker or Diplomacy,

35:35

where a bluff they made a while back is relevant to understanding now?

35:40

And to decide your next move, you need to consider all those previous states.

35:44

Would that change the consideration of what inductive

35:47

bias and which architecture is most relevant? Great question. Go is a perfect information game.

35:54

In perfect information games, there does exist a Nash equilibrium strategy for which

36:00

you can do no worse than any other strategy. If you know your opponent has a particular bias,

36:07

like they love to play aggressively, you can in principle counter that specific strategy

36:12

better than a Nash equilibrium policy. But to counter any given strategy,

36:18

there exists a single Nash equilibrium that can be decided solely using the current state.

36:25

That is a design choice AlphaGo made, which in hindsight turned out to work very well because

36:31

the Nash equilibrium seems to be superhuman. No human strategy seems to be able to beat it.

36:38

There are variations of this where you would need to consider temporal history.

36:42

This is a very exciting research area, and I'd encourage people to fork my repo and try it out.

36:48

If you were to play 2v2 Go, you actually need to model your partner's behavior.

36:54

You may not have information on how they play, so you need to aggregate information on how they

36:58

play so you can respond accordingly. These are situations where it's no

37:04

longer a perfect information game. In games of imperfect information

37:08

or partial observability, you do need context to build a model.

37:13

That's a place where things can get very exciting in terms of self-play or Diplomacy-style games.

37:21

Returning to the neural network, the architecture again is not super important.

37:25

You can get it to work with transformers. You can get it to work with ResNets.

37:27

I found that for low-budget experiments, ResNets work a little better.

37:32

You can also use Karpathy-style AutoResearch hyperparameter

37:35

tuning to make your architecture pretty good. You don't have to worry too much about that.

37:40

You just need to set up the problem so that you have a target optimization.

37:46

We're going to pick a somewhat arbitrary architecture that worked for what I did.

37:50

But again, this part is not super important. You have your encoded board state and, similar

38:00

to an RGB, we're going to have three channels. One channel to encode black, one to encode white,

38:06

and one to encode empties or a masked region if you want to train on multiple board sizes.

38:17

I'm actually not going to talk about multiple board sizes for now. That's a bit too

38:20

complicated. We have this two- or three-channel RGB-like image, and we feed it into a ResNet.

38:33

Then we have two branching heads. One head predicts the value function,

38:40

which is a single logit, let's just call it R¹. Then we have the policy, which is R³⁶¹.

38:54

This is the architecture. We're going to train this to predict the outcomes of games

39:01

given the board state, and we're also going to train it to predict what are good moves.

39:06

The original AlphaGo paper, called AlphaGo Lee, initialized this network with a

39:13

supervised learning dataset of expert human play. Later, they removed this restriction by having

39:18

the model teach itself how to play well. I find it super nice for implementation,

39:23

for your audience, to always initialize your experiments to something easy and get

39:30

the problem working before trying to bite off the whole thing and learn tabula rasa.

39:37

In deep learning, initialization is everything. You always want to initialize your research

39:43

project to something as close to success as possible, especially if you're doing

39:47

something new that you haven't done before. Always pick something that works and then

39:50

get it to do something better, rather than start from something

39:52

that doesn't work at all and try to make it work. Under that philosophy, it's a great idea to start

39:59

with something that has a good initialization. We're going to take human expert plays

40:04

and train this model to predict good actions. We're going to take all the moves in which

40:14

an expert won and predict those actions. Regardless of board state, whether you won

40:20

or lost, you're going to predict the outcome. You might be wondering, for some of the early

40:24

boards where only one stone has been put down, how could

40:28

you possibly know who the winner of this game is? Well, if you have hundreds of thousands of games,

40:36

on average you'll probably see that boards starting like this have half of the games

40:41

branch off and win, and half branch off and lose. So that'll actually be fine. When you

40:46

train this model to predict those outcomes, the logit will sort of converge to 0.5.

40:52

For these things, it's expected that once you train the model, a starting board state will

40:57

look like 0.5, and then as you progress towards the end of the game, the win probability will

41:06

either go up or down. This is your move number. As you get hundreds of steps into the game,

41:18

it becomes much clearer who's more likely to win or lose under your expert data distribution.

41:24

I didn't understand the significance of why this way of thinking about value is

41:29

especially relevant to the expert data. It is not relevant to the expert data.

41:32

It's true for any data that you trained it on. If you were to learn tabula rasa, you would also

41:37

expect this to fall out. Imagine you're vibe coding AlphaGo. You gather some expert datasets

41:48

from KataGo online, or you have a dataset of human players, and you train this model.

41:54

It turns out this model is already a pretty good Go player.

41:56

It will most likely beat most human players. If you just take this policy recommendation

42:02

and take the argmax over these probabilities—if you take the argmax and just take this action

42:12

as your Go play—it'll be a very fast Go player that doesn't think in terms of reasoning steps.

42:18

It just shoots from the hip, and it'll be a very strong Go player.

42:21

This is already quite miraculous if you think about how ten neural network layers,

42:26

maybe under 3 million parameters, can already do something that impressive. You can start

42:35

this way. It's important when implementing this to verify that this is probably true.

42:39

It's good to verify that your Go rules are implemented correctly and that you

42:43

can run these simulations relatively quickly. Just as a checkpoint, you want to make sure

42:51

you can actually do this basic step before you try to layer on more complex things like search.

42:58

But we can do a lot better than taking the raw neural network and playing the moves.

43:05

So let’s apply the neural network to improve Monte Carlo tree search.

43:10

We start with our root node, and we now have a four-step iterative process to do MCTS.

43:24

This tripped me up when I was first reading the paper and trying to understand it.

43:29

Essentially what we're going to do is choose a number of simulations, and this number varies.

43:40

It can be somewhere between 200 and 2,048. I believe in the AlphaGo vs. Lee match,

43:47

they used tens of thousands of simulations per move because they really wanted to boost

43:50

the strength of the model as much as possible. But in training, you don't actually need too many.

43:55

KataGo, I think, uses something in this order as well.

43:58

Do you know if they used… If you watch the documentary,

44:00

they had a laptop out during the game. They didn't use the laptop itself. It was on some—

44:04

It was on some TPU pod, I think. Honestly, kind of unfair. Lee is

44:11

not using 1E22 FLOPS to do a move. Fair enough. Interestingly, modern Go

44:18

bots don't need that much compute at test time. What we'll find out, as we talk about how the

44:25

MCTS policy improvement works, is that over time the raw network actually takes

44:31

all of the burden of that big TPU pod and just pushes it into the network.

44:35

You can do all of that work with one neural network forward pass.

44:40

The TPU pod will always add the extra oomph on top, and that's what they wanted for the match.

44:46

We're going to pick this number of simulations thing, and for every simulation, we're going

44:51

to do several things simultaneously. We're going to see which moves are

44:56

the best in the current tree. We're going to add extra leaves

45:01

to the tree if we get to a point where we need to add a leaf, and we're going

45:04

to update the action values for the tree. Every simulation involves this four-step process:

45:13

selection, expansion, evaluation, and backup. At the beginning of our Monte Carlo tree search,

45:29

our tree is very basic. It only has the root node,

45:32

or our current board that our AI wants to play at. We're basically going to select

45:39

the best action for this. When this root node is created,

45:43

we also know that we can evaluate it under our neural network and get the quantities Vθ,

45:50

as well as our probability over actions. And I’m going to say root.

46:00

For all of the actions here, we can create a bunch of children.

46:09

In this case, I'm drawing a 3x3 board with one missing, so there are eight possible

46:15

children associated with this root node. Each of these has an associated probability

46:34

of taking that action, so there's P₈, P₁, ... P₂, and so on. So at the beginning of our Monte

46:43

Carlo tree search, we have our root node, and we can initialize it with some children.

46:48

The policy network evaluated on the root node gives us—on a 3x3

46:52

board with one existing stone placed—eight possible children that this AI could take.

46:59

With each of the children, the policy network also gives us the probability of selecting that child.

47:04

The first step is to do the selection of the tree. Again, this is a very shallow tree.

47:08

All we have so far is essentially a tree of depth one.

47:11

Our first move is to select by maximizing, or argmaxing, the PUCT criterion, which is basically

47:18

Q(s,a) + CPUCT x Pₐ x (√N / (1 + Nₐ)). For each of these, Nₐ is zero for all

47:42

the actions initially. N is zero. So we're going to pick according to this.

47:52

Initially, the chosen action is most likely going to be biased towards

48:01

the highest-likelihood action, because these are uniform for every node.

48:07

Let's suppose P₁ was the highest-probability node, so you selected this one here.

48:13

Now you get to this node and you realize it's not a leaf node.

48:17

It's not a terminal game, so you cannot resolve the final resolution. The next

48:21

step is expansion. You will then run this board state through the policy network.

48:31

Note that this is the AI's move. The AI is making this move.

48:36

When we expand this tree, we're now thinking about what the human, or any opponent, might do.

48:51

When we evaluate the node here, we're going to evaluate it from the perspective of this player.

49:00

This node has possible actions that we could take, and we expand the leaf nodes here.

49:07

For each of these nodes that we could arrive at, we're going to now check how good they are.

49:18

From here, the human could play here, here, or here.

49:23

We're going to store Vθ for each of these, so Vθ of Node₁’, Vθ of Node₁’’, and so on.

49:46

We're basically using our neural network to make an intuitive guess of how good this

49:53

board is from the perspective of this player. Fortunately, because it's a zero-sum game,

49:59

it's easy to deduce that the value for this player at this step is just one minus

50:04

the value from the other perspective. It's easy to flip the search process

50:08

depending on which player you're at. This is the expansion step. You've taken a non-leaf node,

50:15

expanded it, and evaluated the value. This is essentially a quick guess

50:19

as to whether I'm going to win or not if I were to play to the end.

50:23

You can almost think about Vθ as a shortcut for searching to the end

50:28

of the tree, for any given simulation. This is essentially the evaluation step.

50:35

We're evaluating the quality of each of these boards.

50:39

In the original AlphaGo Lee, they did something kind of interesting.

50:43

They took this value and averaged it with the value of a real Go playout.

50:49

They actually played a real game from here all the way to the end.

50:55

I'm just going to draw this squiggly line to indicate some path.

50:59

They play this all the way to Tromp-Taylor resolution of a full board.

51:06

So this is a zero or one. They took this value and

51:15

averaged it with this one here. The formula was α x Vθ of some node

51:25

plus (1 - α) of a truly randomly sampled playout. You might be wondering how they play this out.

51:38

It would be very costly to do another search on this playout, almost like a tree within a

51:43

tree. They don't do this. Instead, they just take the policy network and play it against itself.

51:49

They use it as both players, and they just play it all the way to the end.

51:52

This helps ground the estimates here in reality because you can get a single-sample

52:00

estimate of whether you win or not. In the endgame, where the board is

52:04

almost resolved, this becomes quite useful because the play according to the policy will

52:10

most likely decide a pretty reasonable guess of the game, so you're not facing a problem

52:15

where this becomes untethered from reality. It turns out this is totally unnecessary.

52:20

In all subsequent papers after AlphaGo Lee, they got rid of it.

52:24

In my implementation, I did the same, and it speeds things up a lot because you don't have to

52:28

roll these games out on every single simulation. Okay, just to reinforce my own

52:34

understanding and re-explain it. For the audience, by the way, in case it's not

52:39

obvious, the P there in the select, that is the probability coming from the network in this case.

52:44

Correct. The policy network here. Fundamentally, a simulation, just think of it as

52:54

rolling out one more node in the search process. Almost. A simulation is easy to think about

53:01

when the whole tree already exists. You just walk down the tree using the

53:06

PUCT selection criterion, and then you keep going. In AlphaGo, the data structure is such that we

53:16

begin with a tree that basically only has depth one—its only children—and you want to iteratively

53:23

build out the tree as you're also selecting actions down the tree. That's the core thing

53:28

here. Because Go is such a combinatorially complex game, you cannot afford to build

53:32

the tree in advance and then search it. You must search while building the tree.

53:38

Let me finish up with the last step, which is the backup.

53:41

Once you've scored these things, the Q value assigned to the node here for taking this action

53:50

is just the average across your evaluated values. You take a running mean over all the simulations

54:01

you've taken, averaging the values of the children nodes.

54:05

That's the backup step, and once you evaluate this, you can recursively go back up.

54:09

If you know the action value of this node, you can then take the average on its parent, and so on.

54:16

You have this four-step process: 1) You're choosing the best action you know of so far.

54:21

2) You may run into a node you haven't been to before, so you need to grow the tree a bit. 3)

54:28

You run it through the network to guess whether you're going to win or not. 4) You walk all the

54:32

way back up to the root node to update your values on what the best moves are.

54:38

As you do this iteratively, because you're always selecting according to this criterion,

54:46

you're always going to be selecting the best action you think at any given branch.

54:51

The final visit counts of how often you chose these things will reflect

54:57

your correct policy distribution as induced through this search process.

55:03

The visit count we stored in the node earlier actually becomes the vote for which action

55:08

we should finally select. As a test of understanding,

55:14

it's worth thinking a little bit about whether we could make this even simpler.

55:18

Could we actually maybe even get rid of this one and still make the thing work?

55:24

Recall that when you do an expansion and then an evaluation at this node, you are checking the

55:30

win probability of each of the child nodes. So if this one is one and these are zero,

55:37

you do know something about which action might be better to take.

55:42

Why would you still need this, right? Why not just normalize the values into

55:48

a distribution and call that your policy distribution? This is fine.

55:53

You can do this, and it probably does work. But in practice, having a single forward pass

55:59

that gives you a pretty good guess is how the breadth is pruned out. There is a duality here.

56:10

It would be weird if the policy recommended an action that disagreed with the value.

56:15

If a policy said this was very high probability, but this one said it was a low value,

56:20

then there's something fundamentally wrong between your policy head and your value head.

56:25

They are linked, and you probably could get rid of this if you came up with a different way to

56:29

recover it from just the value evaluations. Right. Just to make sure I understand, the

56:36

reason you don't do that is so you don't have to do 360 independent forward passes to say, "Okay,

56:41

here's the value of everything. Let's argmax over it." Instead, you can just do one forward

56:44

pass and get the probabilities of all of them. You can usually batch these somewhat efficiently,

56:52

so it probably isn't a huge computational burden in practice.

56:55

But yes, you would have to pass up to 361 boards into a single mini-batch update to

57:02

evaluate all the values, then normalize them. There's actually a more important reason why

57:08

we still do this, which is how Monte Carlo tree search is used to feed back on itself

57:13

and recursively improve its own predictions and search capabilities.

57:18

That's where having this as an explicit entity you're modeling, rather than an implicit

57:23

normalization over your value, is a good idea. Makes sense. Okay.

57:28

We talked about the simulations. What you end up with as you roll out the number of simulations

57:33

is a tree that looks like… I'm drawing a very low-dimensional version of this.

57:41

Of course, in the real game, it's much more high-dimensional.

57:45

You'll end up with a tree structure that has a lot of leaves that terminate and are not visited

57:54

again because their value is deemed to be too low. But along one path, there will be a set of actions

58:00

with very high visit counts that gravitate towards that one set of decisions as you increase N.

58:09

This is the mental picture of what the tree in Monte Carlo tree search looks like.

58:13

You should contrast this with an exhaustive tree like in tic-tac-toe, where there are nine actions,

58:20

then eight, then seven and six, and so it's a nine-factorial-sized tree.

58:26

The Monte Carlo tree search in Go is very sparse.

58:29

It only considers the paths that you've expanded children nodes on.

58:36

Now that we have the search algorithm that applies the value function as well as the policy function,

58:44

we can talk about how the Monte Carlo tree search algorithm can act as an improvement

58:51

operator on top of these guys here. Twenty years ago, Jane Street's data

58:55

center fit in the corner of an office. Ron Minsky, who co-leads the tech group

59:00

there, told me about how it all got started. One of our compute clusters we called The Hive,

59:04

and I remember the first version of the Hive was literally like six Dell boxes stacked on

59:08

top of each other at the end of the row. And the trading systems themselves we also

59:12

had there because we actually wanted the ability to make sure we could turn the damn thing off.

59:17

I mean there were ups and downs, like literally at some point you know,

59:20

one of the people who was cleaning the office unplugged one of the trading systems in the

59:24

middle of the day as they were vacuuming. So, you know, in the end it is in fact

59:28

better to have it all in a data center. Jane Street's data centers have come a long

59:31

way since those sixth Dells, and I got to tour one of them in Texas with Ron and Dan Pontecorvo,

59:35

who leads Jane Street's physical engineering team. You know, these cabinets, these GB300 cabinets,

59:40

consume at peak about 140 kW each. Compare that to traditional air cooled, you're talking about

59:45

10 to 40 KW? It's a lot more. We got deep into the details

59:49

of running one of these data centers. Things that I had never considered before.

59:52

It's filled with a liquid, a mix of deionized water

59:55

and propylene glycol, 25% of propylene glycol. That’s to inhibit any bacteria or algae growth.

60:02

I don't love the world where we have to worry about bacteria growing in our servers.

60:06

I got to see way more of what actually happens in a data center than I've ever seen before.

60:10

Jane Street was willing to literally pull up the floorboards and take out the racks and take me to

60:14

the back where all the chillers are. You can check all of this out at

60:18

JaneStreet.com/dwarkesh where we posted the full tour.

60:23

We now talk about the RL part of how this thing gets stronger by playing itself.

60:30

Let's say we play a game. You make a move. The AI will compute the search,

60:43

and this is the visit count distribution. Let's say this is your initial policy

60:49

recommendation at this node. After MCTS, it gets more

60:56

confident about one of these actions. Maybe the distribution looks a bit more

61:01

peaky like this based on the search. Of course, you can tune the search

61:05

process so that it ends up more diffuse, but that's probably not a good idea.

61:08

MCTS should get more confident about specific actions over others.

61:12

It might place a lot of weight on other actions initially, and then as you increase

61:17

the number of sims, it should converge to a very peaky distribution. Let's call this π.

61:25

Let's wrap this in an MCTS operator of (a | s). After applying the MCTS process, your policy

61:38

recommended distribution looks like this. It’s a bit more peaky than the previous one.

61:43

Then you take the argmax, or maybe you just sample from this.

61:47

It doesn't have to be the argmax. You make your move, throw away the

61:51

tree, and begin anew on the next move. Again, you compute a new distribution.

62:08

Initially, maybe your guess looks like this, and then you refine it through MCTS.

62:16

There should be one more X on the board, right? I'm sorry, that's correct, yes. Something that

62:22

looks like this. On every move, you have your initial guess from your policy network.

62:32

Then the search process, which combines your policy network and your value network,

62:35

arrives at a more confident action that you take, and so on and so forth.

62:40

Then the game ends, and one person wins and one person loses.

62:46

The beauty of how AlphaGo trains itself is that it can actually take this final search

62:53

process—the outcome of the search process—and tell the policy network, "Hey, instead of having

62:59

MCTS do all this legwork to arrive here, why don't you just predict that from the get-go?

63:04

Why don't you not use this guess and predict this to begin with?"

63:07

If you have this guess to begin with in your policy network, then MCTS has to do

63:11

a lot less work to get things to work. If we draw a test-time scaling plot…

63:18

Let's say this is the number of simulations. At zero simulations, your implicit win

63:26

rate is here. If you just take

63:33

this raw action, this is what your win rate is. Let’s say we increase the number of sims, maybe

63:39

you have a win rate curve that looks like this. When you search for 1,000 simulation steps,

63:49

that gets you to a policy here that gets you to here, which is great.

63:56

But if you were to distill this MCTS policy network back into your shoot-from-the-hip policy

64:03

network, then you could actually start here. Let’s say this was zero, by distillation,

64:14

if you then spend another 1,000 sim steps, you actually get to here.

64:19

It's almost as if you could amortize the first 1,000 steps into the policy network instead of

64:26

the search process, then you could begin at a much better starting point and get a much better

64:30

result for the number of sims that you play. The sigmoid type nature of test-time scaling

64:36

as the number of simulations increases, the increase in win rate gets smaller.

64:41

Is that true even for the distilled network? That is to say, is there some gain where we

64:46

start from the distilled network and get these early gains again,

64:48

or is that just inherent to the nature of MCTS? To be honest, I actually don't know the test-time

64:53

scaling behavior of MCTS simulations. I believe it might be quite sensitive to

64:58

how strong this one is in practice. I'm just drawing a monotonically

65:01

increasing function that gets to one. Don't pay too much attention to the

65:05

shape of the curve. Just know that it's

65:06

monotonic with respect to sims. The idea of MCTS is very brilliant.

65:15

We got something better by applying search. Now, on our next iteration of updating this

65:22

network, we're going to train it to approximate the outcome of 1,000 steps of search.

65:28

Instead of starting here, we get to have our neural network start here, and then the

65:32

play gets stronger once we then apply another 1,000 steps on top of it. You can keep going.

65:37

The training algorithm for AlphaGo is to basically take the games where you've applied the search on

65:43

every move that the policy encountered—whether you won or lost, and that's quite important—and just

65:49

train the model to imitate the search process. There's an analogy to robotics,

65:55

which is the DAgger algorithm. First I'm going to draw a

66:00

schematic of the states: S₀, S₁, S₂, S₃. Let's say we took a series of actions in

66:10

an MDP to get a trajectory. These actions may be suboptimal. Maybe we lost at the end of this game.

66:25

There is a family of algorithms that basically take trajectories and relabel

66:31

the actions to better trajectories. Maybe a better action here would

66:35

have been to take A₀’. A better action here would

66:39

have been A₁’, and yet another one, A₂’, A₃’. What MCTS is doing is saying: you played this

66:52

game where you eventually lost, but on every single action, I'm going to give you a strictly

66:57

better action that you should have taken instead. It does not guarantee that you are going to win,

67:02

but it does guarantee that if you take these tuples as training data and retrain your policy

67:10

network to predict these actions instead of the original ones, you're going to do better.

67:15

This is very related to DAgger in robotics and imitation learning,

67:19

where you want to collect an intervention here. Even if you're in a not-great state—for example,

67:25

a self-driving car that veers off the side of the road—there is still a valid

67:28

action that corrects you and brings you back. Pedantic question, but is there a guarantee

67:35

that MCTS must be better than the policy? For example, you could imagine early on in

67:39

training, because MCTS is informed by the value network, that when the value network hasn't been

67:47

well-trained on finished games, MCTS is worse than a randomly initialized policy.

67:54

Is it just a heuristic that MCTS is better than the policy, or is there some guarantee?

67:59

In practice, it is a heuristic. It does work in practice, but let

68:04

me illustrate an example where MCTS can give you a worse distribution than your policy network.

68:09

This can often happen if your self-play algorithm has trained to a good point,

68:13

but then somehow it collapses because it's not trained on diverse data or something.

68:20

Let's say we have a board state where the policy recommendations are very good.

68:25

So Πθ (a | s) is great. But somehow, maybe because we're

68:33

playing a lot of games where the bots just resign instead of playing all the way to

68:37

the Tromp-Taylor resolution, they forget how to evaluate those late-stage plans.

68:43

Like in the case we showed with the corner play, maybe 100% of our training data in

68:48

our replay buffer has lost examples of how to evaluate the value function at those states.

68:54

You might end up in a scenario where your terminal value is very bad.

69:01

If the terminal values of the leaves are not good, then this will propagate all the

69:06

way up and cause your PUCT selection criteria and your backups to be off.

69:12

And then you end up visiting a very different distribution than what

69:15

your policy initially recommended. Also, if your number of sims is low,

69:20

you might have a variance issue where you just don't explore enough.

69:24

It's only guaranteed to converge when you take N to infinity.

69:29

Variance in your search process as well as inaccuracies in your

69:33

evaluation can definitely screw with the quality of your policy recommendation.

69:38

That's why it's not a guaranteed improvement. That is why I suspect AlphaGo Lee had the playouts

69:45

to the end in their training algorithm, so that they could ground this thing in real playouts.

69:50

In practice, what you could also do is, for 10% of the games, prevent the bots from

69:55

resigning and just say, "Resolve it to the end." That way you get some training data in your replay

69:59

buffer to really resolve those late-stage playouts that normal human players would not play to.

70:09

If you assume that the value functions are correct,

70:12

this is why MCTS gives you a better policy. It's a very critical chain of assumptions.

70:18

Assuming this is accurate, your search process should give you a better

70:22

recommendation than your initial guess. If you have a cold-started policy—if

70:27

you have an AlphaZero-type thing—really what's happening for the first few epochs

70:31

is that the policy is kind of useless. What you're really just doing is, "Hey,

70:34

let's play full games, and once we have played full games, for the preceding moves,

70:41

we'll have labeled who won and who didn't win." The loss for AlphaZero has two components:

70:50

how good is the policy relative to MCTS, and how good is the value prediction relative to

70:55

who actually won the game from this move. You can think of this being applied to

70:59

every single action or every single move. Really what's happening at the beginning of

71:03

AlphaZero training is just that we're trying to get the value function to actually predict

71:07

who will win the game if you find yourself in this state and you're this player. Functionally,

71:14

that's all that's happening. Later on, once that's well trained, now the policy is also improving.

71:19

Correct. There’s one trick I found to be pretty useful.

71:22

This is not a peer-reviewed claim, so take it with a grain of salt.

71:26

I found it useful in my own implementation to do the following.

71:30

You want to first make sure that this is good before you invest a lot of cycles doing MCTS.

71:35

It doesn't really make a lot of sense to do search on garbage value predictions.

71:41

You want to start at a good place where this works.

71:43

AlphaGo Lee does a very good thing where it just takes human games, you train on them, and it just

71:48

works. It totally works. You can also take an open-source Go bot, play it against itself,

71:53

generate data, and that also works. If you have some offline dataset that

71:59

has realistic, good play, you can easily learn the late-stage value functions pretty well.

72:07

That's what you need to start the search process. Sorry, can you just repeat that sentence one

72:10

more time? Sure. It's

72:12

quite easy to evaluate a late-stage Go game. When almost all the pieces are on the board,

72:18

it's almost like a decidable problem because there's lower and

72:20

lower uncertainty as to the depth of the tree. So most games played to the end by reasonable

72:26

people will be good training data to train a good value function at terminal parts of the tree.

72:32

Then, as you play more games, the search will back up good

72:37

values into the intermediate nodes of the tree. As you increase the amount of data, your value

72:42

head gets a good intuition of what is a healthy board state versus a not-healthy board state.

72:48

Those are much more subtle to judge in the mid-game than at the beginning or the end.

72:52

The most difficult part to score is not the beginning, because the beginning is obviously 0.5,

72:57

and at the end it's pretty obvious who's winning. The hard part that you want to learn in the

73:01

value function is who is winning in the middle. So this is actually very analogous to TD learning.

73:06

Yes. There's a beautiful connection to TD learning that we can talk about in a bit,

73:11

contrasting it with Monte Carlo tree search. You first want to get good value functions,

73:17

and expert data can give you a quick shortcut. I recommend for practitioners, just do that first

73:22

to initialize to a good starting point. If you want to do the AlphaZero thing or

73:26

KataGo tabula rasa learning, what you can try to do is play random games on a small

73:31

board. Just take a random agent. If you play 50,000 games, you'll actually learn a pretty

73:39

good value function as well. On a 9x9 board, you can see

73:43

enough of the common patterns with random play. If you train on both 9x9 and 19x19 data—and KataGo

73:52

proposed one of these architectures—there's pretty good transfer learning from the

73:56

value head evaluated at 9x9 to 19x19. Right, because this, unlike other games,

74:02

has very much a sense… There's not like a new kind of piece introduced

74:06

when you increase the size or something. If we take it to its limit and consider a

74:10

very tiny 4x4 Go board, if you play 50,000 games, you're going to have a lot of end

74:15

states that look like human play. It's just Tic-Tac-Toe at that point.

74:18

If you broaden this a little bit to 5x5 or 9x9, it's not unrealistic to imagine that

74:24

purely random play will generate pretty reasonable-looking boards.

74:29

You can score those pretty easily. That is what gives you the bootstrapping

74:32

to then improve your policy with search. But it's very critical that MCTS has

74:37

accurate value estimates. You need to ground the value.

74:41

Ultimately, MCTS will fall apart if you don't have a grounding function for the value.

74:47

I'd be curious how much compute you save by training the value and policy on the same network.

74:53

Because they share the same representations, how much more efficient is learning?

74:59

We've just talked about how they're making similar predictions, or they

75:02

should be in line with each other. So I'd be curious if you're halving

75:07

the amount of compute you have to do by keeping them in the same network.

75:10

AlphaGo Lee, the original AlphaGo paper, had two separate networks.

75:14

In all subsequent papers, they merged them into two heads, and presumably this saves compute.

75:19

But answering that question in a rigorous scientific way is

75:23

a simple question that in practice, if you really want to chase the question down to its limit,

75:29

takes quite a bit of work to really resolve. Intuitively, yes, they share a lot of

75:34

representations. As we mentioned,

75:38

your policy network and your value network, when doing evaluations, should agree.

75:43

There really should be this consistency between them.

75:46

Tell me if this is the wrong way to think about it.

75:49

When I learn how an LLM works and how simple RLVR is as an algorithm,

75:58

I'm stunned by the kinds of things it can do. It can learn how to build very complicated code

76:04

repositories simply from getting a yes or no. Here, if you understand it more deeply, just

76:11

predicting MCTS, AlphaGo seems less impressive in retrospect the more you understand it.

76:19

You're putting in a lot of bias by telling it how it should titrate exploration as things go on.

76:26

You're building this very explicit tree search for it.

76:30

I don't know if you share that intuition where the more you understand it, the less

76:34

impressive the accomplishment in 2017 seems. I personally disagree. I think they're

76:41

profound for different reasons. I don't understand the LLM RL

76:45

enough to comment on it on your podcast. But why is AlphaGo a profound accomplishment?

76:53

It's worth stepping back a little bit. It is different from modern RL,

76:57

and we can talk a little bit about some of the algorithmic choices there.

77:01

I think the most profound thing here is that a 10-layer neural network pass, basically

77:12

10 steps of reasoning… Of course, the reasoning is not just one trail of thought.

77:16

It could be distributed representations and a lot of thoughts going on at the same time.

77:20

But by construction, let's say a 10-layer neural network can only do 10 sequential

77:24

steps of thinking. 10 steps of neural network parallelized distributed-representation thinking

77:31

is able to amortize and approximate to very high fidelity a nearly intractable search problem.

77:40

This was a breakthrough that I think most people don't even fully comprehend today,

77:46

how profound that accomplishment is. This is what also girds AlphaFold,

77:52

for example, where you have a very, very difficult physical simulation process where

77:58

you would need to roll out as so many microscale simulations, and yet 10 steps of a somewhat small

78:04

neural network can somehow capture what feels like an NP-class problem into a single problem.

78:15

It actually makes me wonder if our understanding of problems like P=NP,

78:20

or these fundamental computational hardness problems, is incomplete.

78:26

Obviously, it's not a proof of P=NP, but there's something to it that is very disturbing,

78:34

where what felt like a very hard problem can fall to a very simple macroscopic solution.

78:40

That is a very interesting insight, that a lot of problems which are proven to be NP-hard—I don't

78:47

know if Go is proven to be NP-hard, but protein folding, et cetera—neural networks can solve.

78:54

They're NP-hard in the worst case, but we're usually not concerned about the worst case.

78:58

These problems usually have a lot of structure to them.

79:01

I think the question we should be asking ourselves is about how we've been formulating solutions to

79:08

NP-hard problems in worst-case complexity. I wouldn't say this solves Go.

79:14

It doesn't give us an exact solution of the optimum, but in practice, it is extremely useful.

79:20

The same thing has been shown in AlphaTensor and AlphaFold.

79:24

Yes, there is a very hard problem that, in the worst case, seems intractable, and yet we're able

79:28

to make almost arbitrary amounts of progress. In the limit, what might this look like?

79:36

If you want to simulate something very complex like weather, or predict the future—do we live

79:42

in a simulation or not—the computing resources you need to build a very complex simulation might

79:48

be much smaller than you think, based on our ability to amortize a lot of that computation

79:54

into the forward pass of a single network. To me, AlphaGo was the first paper that

80:00

really showed this profound level of simulation being compressed into a small amount of compute.

80:07

I feel totally not qualified on the computational complexity or the math to comment on this, but I

80:14

wonder if there's an important role of chaos here. What is the problem with weather,

80:22

and why does it take 10x the amount of resources to predict weather a day out,

80:27

and continually so for every additional day out? It's because it's a chaotic system, so small

80:32

perturbations can totally change the final estimate as time goes on.

80:39

I guess you would expect that for Go and protein folding as well.

80:42

Here's an analogy to weather that might be relevant in Go. Here's our current board state.

80:55

Given what we know about both players, what is the exact board state in the future?

81:03

This is extremely sensitive to initial conditions. A single stone placed here can disrupt

81:08

the entire prediction. This is hard. Intuitively, this is the chaotic problem.

81:14

Yet somehow, we can predict who's going to win. This captures a lot of possibilities.

81:26

There's a more macroscopic quantity that we really care about, which is the

81:29

average or expectation or some sort of global macrostructure over a lot of possible futures.

81:37

In weather, it could be the same thing. We don't exactly care what the

81:42

velocity of wind 6,000 feet above a specific latitude and longitude is.

81:48

We care where the hurricane is, and things like that.

81:52

In chaos, there's the classic Lorenz attractor. If you start anywhere on the Lorenz attractor,

82:00

you don't know where you're going to end up, but you do know that the thing looks like this.

82:05

There's a kind of beauty here. Sometimes we don't necessarily

82:08

care about the microscale things. We actually care about the macroscopic

82:11

structure, and these things can be predictable. Contrast that to something like a hash function,

82:18

which is also incredibly dependent on initial conditions but doesn't have a

82:21

macrostructure, or at least hopefully doesn't, if the algorithms work.

82:24

One would hope. There's no equivalent of

82:28

a value function or broadly how the weather is going to be that's interesting there.

82:33

It's really just about what the board is going to look like exactly 100 moves from now exactly.

82:39

Intuitively, that seems correct. This is also out of my area of expertise, but I

82:46

find it interesting that the tools of cryptography and hashing have also not been able to prove

82:56

that you cannot come up with fast approximations. You cannot come up with fast approximations.

83:02

If they were able to do that, then you could prove P is not equal to NP.

83:05

In fact, we know that there's structure in many cryptographic

83:08

protocols, obviously RSA cryptography. There is structure, and that structure

83:12

is what quantum computers exploit to break them. Reiner has a very interesting blog post, which

83:19

we've talked about in the episode, where he points out that if you look at a high level at what

83:24

cryptographic protocols look like and what neural networks look like, they're extremely similar.

83:29

You have sequential layers of jumbling information together.

83:35

There's a convergent evolution in the algorithms.

83:38

In cryptography, you want the final state to be incredibly sensitive to initial conditions,

83:43

so that it comes out looking jumbled if you change anything.

83:48

In neural networks, you similarly want everything to be dependent on all the information, because

83:53

you want to process all the information and consider how it relates to itself.

83:56

You have the maximum power of a neural network at the edge of chaos.

83:59

I think there are some research papers from Jascha Sohl-Dickstein on this.

84:04

There's something quite fundamental about chaos that's not just hopeless noise.

84:05

There's something useful in chaotic systems, at least at that boundary.

84:17

But this is just thinking about it as a philosophy.

84:19

I don't know the math well enough to comment on it.

84:25

We’ll talk about LLM RL in a little bit, because there’s some connection there,

84:28

but let's go back to MCTS. What is it doing? Crucially, it is not saying that we're going to

84:36

increase the probability of winning directly. It's not going to upweight all actions that won

84:42

and downweight all actions that didn't win. Importantly, what it is doing is saying:

84:47

for every action we took, we did a pretty exhaustive search on MCTS to see if we could

84:54

do better, and we're going to make every action that we took better by having the

84:58

policy network predict that outcome instead. This is a very nice idea because you have one

85:04

supervision target for every single action. The variance of your learning signal is very

85:10

low compared to the alternative naive RL thing. Let's consider a very naive algorithm that looks

85:18

a lot more like modern LLM RL today, where we do something like take the winner of a self-play

85:25

game and encourage it to do more of that. It's worth thinking a little bit about what

85:31

alternatives we could use to train self-play agents instead of MCTS.

85:35

We use a lot of LLM-style RL these days. Is that relevant? Could we do that instead? Suppose we

85:44

have a very naive algorithm where we take a league of agents of different

85:48

checkpoints and play them against each other. For the games where a single player wins,

85:54

we reinforce those actions up and retrain the policy network to imitate those winners,

86:01

instead of the MCTS objective. Let's say you have a chain of

86:10

actions that led to a win, and you have a matchup between two agents that are basically the same.

86:19

Assume policy a (Πₐ) and policy b (Πb) are evenly matched, so their true win rate is 50%.

86:34

Let's say you play 100 games, and each game lasts 300 moves.

86:47

You're doing some sort of evolution strategy or some way to perturb these

86:52

things to get them to do different things. Or maybe you don't, and you just play them

86:55

against each other to see that occasionally one might have a better strategy than the other.

87:00

Let's say policy a wins 51 games and policy b wins 49.

87:13

This is just due to random luck, or maybe you perturbed policy a

87:16

in some way that let it do this. Just to have a very simple model,

87:19

let's pretend that for 50 of the games, they played exactly equally.

87:34

In the one game where a won, it played slightly differently.

87:37

It made one critical move that normally it would have done differently, but due to some

87:42

exploration or random noise, it happened to make a smarter move than it did previously.

87:46

So you have one true supervision signal for your policy network, and then you have 99 games

87:58

times 300 moves for which imitating those actions gives you exactly the same policy you had before.

88:07

The scale of your variance is actually very bad, because you only have one label out of

88:12

this enormous dataset of supervision actions where you want… Actually, let me clarify a little bit.

88:20

We're talking about how the good move, the out-of-distribution move, is a small fraction

88:24

of all the moves that are played across all the games on which you'd want to train.

88:29

This, of course, reminds me of how LLMs are trained with policy gradient methods.

88:36

When Karpathy was on the podcast, he called it like "sucking supervision through a straw".

88:42

It's interesting that this thing you're saying—which would be intractable and prevents

88:45

you from actually getting beyond a certain level in Go—is just by default how LLMs are trained?

88:51

Right. This is not to say it doesn't work. If you imagine increasing the number of games

88:57

to millions of samples, you actually can get some meaningful supervision samples,

89:03

so long as you find a way to mask out the supervision from these guys.

89:08

This is where things start to get pretty related to RL in terms of advantage and baselines.

89:14

Let's look at the gradient variance of a very naive approach—I'm going

89:21

to call it gradient RL—that's basically the sum of rewards.

89:52

The sum of rewards is the return. In our naive setup here,

89:56

we only have an indicator variable for the return, where either you won or lost.

90:05

In the case where you lost, your gradient is zero, so you don't train on those examples,

90:10

and when you won, you try to predict those things.

90:13

You can think about this setup as a special case of this general formula here.

90:20

The trouble here is that this is very high variance.

90:24

When you multiply these terms out to compute the variance of the gradient,

90:34

it's equal to the expectation of… Just for simplicity, we can pretend this is on average zero

90:45

or something if you're centering it at no signal. The variance here basically means that you're

90:53

taking the square of this product term. You end up with

90:58

a term that grows quadratically with T. When you have a setup like this, this thing acts

91:08

as a coupling effect on top of these terms here. Let's actually map this to an LLM case,

91:17

and we can answer why LLMs only do one-step RL instead of a multi-step RL scenario.

91:25

In LLMs, you have a decoder that might predict some words like "hello world".

91:34

In current LLM RL, they treat this entire sequence as a single action, aₜ, and big T is just one.

91:44

It's true that because of how transformers are formulated through the product of conditional

91:51

probabilities, the log probability of the whole sequence is equal to the sum of the

92:03

probabilities of individual tokens. In this case, I would say something

92:10

like log(hel) + log(lo) + log(world). This is true, and if this term were one,

92:27

then they would be the same thing. However, when you're sampling, if you

92:32

have a reward term assigned to every specific token, now you have these interaction effects

92:39

between the cross-multiplication of these terms and these terms.

92:43

The problem becomes, how do you ascribe the credit associated with

92:47

every episode to all these different terms? The thing I'm confused on is what it would

92:52

even look like to do it that way in LLMs, because you only get a reward at the end of the episode.

93:01

You could imagine a reward that says, "I'm going to give you some process supervision where you get

93:07

a reward for each of these actions on every step." Okay, so you're saying if instead of doing it

93:13

that way… Well, the way you've written it, it would be a sum at the end anyway,

93:18

so they wouldn't have to be multiplied. But you're saying instead of doing it

93:22

that way, you would just add up these process rewards at the end and then

93:26

treat that as one single reward signal? Correct, for one single log prob action.

93:29

But isn't that how it's written to begin with anyways? The sum of the rewards?

93:36

The thing that's a little bit hidden here in the math is that we’re assuming that when you

93:39

decompose the problem into a multi-step problem, you're now introducing correlations between your

93:45

actions through the computation of this guy. If you separate these things out,

93:52

this will magnify the variance of this one. In the case where you don't separate it out,

93:57

if you just have T=1, you have a single estimate of log prob and a single estimate of reward. This

94:07

term still shows up. In LLMs, the naive REINFORCE estimator looks a bit like the return of the

94:15

single action… It looks kind of like this. This is the basic form here, but it's still

94:30

a contributor to variance. You want to make sure that,

94:35

similar to how in this case we were training on a lot of neutral labels, you're sort of

94:41

penalizing the labels that don't help and only rewarding the ones that actually make you better.

94:47

Intuitively, the analogy is: can we find a term in our training objective such that

94:52

it’s actually discouraged from doing this, or these don't have any effect on the gradient,

94:58

and this has an effect on the gradient? If you applied that there, the only thing

95:03

you could do is eliminate 49 of the games. At least the way you have it written there,

95:11

it would be 51 times… Actually, the optimal case is to discard

95:17

all of these moves, and only get a gradient on that single move that you got better on.

95:23

How would you do that? This is a pretty tricky problem in practice.

95:28

This is where advantage estimation happens in reinforcement learning.

95:32

You want to subtract a term from your multiplier. Instead of an indicator function of one and zero,

95:46

you want something that behaves like a zero for all of these guys, and then a one for these ones.

95:52

You could do that if you can say, "Hey, I won this game, so this is slightly

95:58

above baseline performance." Well, you won a lot of games.

96:03

But you don't know which ones let you win because they were truly better versus winning by accident.

96:07

How would you design a baseline where it's truly better?

96:10

This is where in RL people use things like TD learning to better approximate the quality

96:16

function, the Q that we mentioned earlier. You can try to subtract that from your return.

96:23

Ideally, what you really want to do in RL is push up the actions that make you better

96:30

than average and push down the actions that make you worse than average. They call this

96:36

advantage. There are multiple ways to compute it. I highly recommend John Schulman's "Generalized

96:41

Advantage Estimation" paper as a good treatment on how to think about various ways to compute it.

96:49

At the end of the day, you want to reduce variance by trying to make this smaller,

96:54

so it doesn't magnify the variance of this one. That makes sense. This requires you to have

96:59

a very good estimate of what average performance from a state would look like.

97:03

This gets us back to the value function thing we were talking about earlier.

97:08

Keep in mind that this model-free RL setting is trying to solve a credit assignment problem

97:14

where you don't know which actions were actually good and which ones were bad.

97:18

Monte Carlo tree search is doing something very fundamentally different.

97:20

It's not trying to do credit assignment on wins.

97:24

It's trying to improve the label for any given action you took.

97:30

We can actually think about a completely different algorithm called neural fictitious self-play,

97:34

which was used to great effect in systems like AlphaStar and OpenAI's Dota.

97:41

Let me talk a little bit about how you can unify some of these RL ideas in the model-free setting

97:47

as well as the self-play setting. What happens if you don't have

97:53

the ability to easily search a tree? In Go, it's a perfectly observable game.

97:58

You can easily construct a pretty deep tree that completely captures the game state.

98:02

In a game like StarCraft where you don't have complete control over the binary,

98:06

it's a little bit hard to do this, and I'm not even sure if it's a deterministic game.

98:11

That makes this difficult from a data structures perspective.

98:16

What is done instead is that the basic idea of supervising your

98:22

actions with a better teacher is still there. We’re going to talk a little bit about how neural

98:30

fictitious self-play works. It's the same idea. We're going to come up with better labels for

98:44

each of the actions we took, just like in MCTS. But how do we derive the better labels? In MCTS,

98:57

we perform search. Assuming we have a good value function, the search will give us

99:02

a better result than our initial guess. In a game where you can't easily simulate

99:08

a search process, what they do instead is train what's known as a best response policy.

99:18

You fix your opponent. Let's say you're currently training Πₐ against a strong opponent, Πb.

99:28

In StarCraft, maybe these are the Zergs and you're playing Protoss or something.

99:34

You fix your opponent, and you treat this as a classic model-free RL algorithm,

99:39

where your goal is just to beat this guy. Here you use your standard TD-learning-style

99:44

tricks, or use PPO or any model-free RL algorithm to

99:49

try to hill climb to winning against this player. You have a reward function where the return is one

100:02

if it wins against Πb, and zero otherwise. This is no longer a self-play problem.

100:09

It's just a fixed opponent, and you're trying to maximize a score against them.

100:19

You have a fixed environment where all you care about is beating this guy.

100:25

Once you have a good policy that you train with your favorite model-free RL algorithm—PPO

100:31

or SAC or V-MPO—you now have a good policy that gives you a good label for what this

100:40

one should do when playing against that player. When you train multiple best response policies,

100:46

you can basically distill the RL algorithms into the labels for a given opponent.

100:52

You might have a best response policy against πb, and then maybe you have

100:55

a league of opponents like πb, πc, πd. You're going to take the best response

101:04

policy that you train against each of these fixed opponents, and supervise

101:09

them with the label that this one would provide. This is almost like a proxy for your MCTS teacher.

101:16

Instead of an MCTS teacher, you use a model-free RL algorithm to find the best search action

101:22

that you could do to beat your opponent. Then you're distilling the policy here

101:29

into what's known as a mixed strategy, where it's trying to average across all

101:34

possible opponents you could play against. This is what gives you something that

101:37

can do no worse than an averagely selected opponent from the league.

101:43

This gets around the problem of having to derive a teaching signal from MCTS, but fundamentally

101:48

it's still about relabeling your states with better actions so that they improve your policy.

101:54

Just to make sure I understand, if you win a game against this other policy,

101:59

you reinforce all the actions on that trajectory. Yes. Here you can use a number of algorithms like

102:05

PPO, V-MPO, or even Q-learning if you want. The specific algorithm here is usually a

102:15

model-free thing because you don't have search, but there's an interesting connection between

102:19

MCTS and Q-learning that I want to bring up. In MCTS, you have a tree, and through the

102:30

resolution of your value function at the approximate leaves of the tree,

102:36

you can back up through many sequences and obtain some sort of mean value estimate.

102:44

Your Q is derived from the average of a bunch of simulations.

102:49

In model-free algorithms, there's often a component of estimating a Q value.

102:58

Q values are often learned through TD learning, although in PPO,

103:01

the way they do advantage estimation is not necessarily through a Bellman backup.

103:05

In Q-learning, there's this very cool trick where Q(s,a) is backed up as r plus some discount factor

103:19

times the max over a, Q of your next step. Intuitively, if you have an MDP and this is

103:33

terminal, it's saying that the best action you can take at this state is equal to the

103:46

reward you get for taking this action, plus the best that you can do at the next state.

103:52

There's a recursive, dynamic-programming property of MDPs, and you can train neural

103:57

networks to basically enforce this consistency. You can say, "Well, once I know the Q value of

104:04

this action, I can then use that to compute something about the Q value for the next."

104:09

Earlier I was like, "Hey, why are we training policy?

104:11

Why don't we just train the value alone?" That is what this is.

104:15

This is an algorithm for recovering value estimates of intermediate steps when you

104:20

don't have the ability to do forward search. You must collect a trajectory first of n steps

104:27

before you're able to do this trick. But the intuition is kind of the same.

104:32

Knowing something about the Q value here can tell you something about the Q value here.

104:36

And indeed, you can recover a policy from a Q value.

104:40

You don't need to explicitly model the policy distribution.

104:43

You can recover the policy distribution by doing argmax over your Q values.

104:50

Q-learning, or approximate dynamic programming, propagates what you

104:56

know about the future Qs backward like this. You can see that there's a similar structure

105:01

going on here, where in one case you're planning over trajectories your agent hasn't been to yet,

105:07

whereas in the other you're planning over trajectories your agent has visited.

105:13

Importantly, why was Q-learning a big deal? It's because historically we just haven't had the

105:19

ability to do search on fairly high-dimensional problems like robotics or whatever.

105:24

For a long time, we made the assumption that if we can't model the dynamics with a world model,

105:30

we're going to instead just collect trajectories and then

105:32

plan with respect to the only number that really matters, which is reward.

105:36

This is very interesting. To unify this with our discussion of LLMs, you don't have Q values,

105:44

but you're doing this backwards learning where you find the trajectories that pass

105:50

some unit test in some coding environment and then reinforce those trajectories.

105:54

There's a huge difference between that and this forward approach with MCTS.

105:59

The reason it's much more preferable to do MCTS is because you can do it per move and make each move

106:05

better, rather than having to learn per trajectory and hope, as Karpathy said, to learn this…

106:11

Through a straw. Right, you get supervision through a straw.

106:14

You basically just upgrade all the tokens in a trajectory that might or might not have been

106:17

relevant to getting the answer right. The reason you can do this much more

106:21

sample-efficient, favorable thing with Go is that because MCTS works in Go, you basically

106:30

know that if you just do the search locally here—and this search is truncated at the end

106:36

by a value function that works even if you haven't unfolded your whole trajectory—you can just say,

106:43

"This is my new policy," and improve in a more iterative, local way, rather than

106:50

having to unfold all these trajectories. There was some research from Google in

106:57

2023 or 2024 where they tried to apply tree structures to reasoning.

107:04

The jury is still out as to whether this can ever work.

107:11

We probably will see a revisiting of this idea of forward search in the future.

107:18

But there are two things that make MCTS very simple for Go. Value estimation is concrete.

107:24

You can determine it for real, and then you can use it to truncate depth, as you said. The breadth

107:31

is also determined. What's critical is that the action selection algorithm, where you iteratively

107:37

visit and grow the tree, is well suited for the size and depth of problem that Go is.

107:45

But for something like LLM reasoning, PUCT might not be a good enough heuristic.

107:49

It might be too greedy with local tokens, and it might only give you obvious thoughts that are

107:57

correct but don't really solve your final problem. I would say the jury is probably still out

108:02

on what the final instantiation of reasoning for LLMs will look like.

108:06

I wouldn't rule out that this stuff could come back, but it's a bit hard.

108:09

Don't LLMs natively learn to do MCTS, where they'll try an approach and be like, "Oh,

108:16

that doesn't work. Let's back up. Let's try this other thing," and then go in the

108:20

direction that proves to be more fruitful? Certainly, LLMs manage to do something that

108:25

looks like real human reasoning without having to do an explicit tree structure.

108:30

That being said, I think the idea of doing forward search and simulation to get a better sense of

108:35

what is valuable might make a comeback, even if not in exactly the same instantiation as AlphaGo.

108:43

Just to make sure I understand the crux of it: the breadth comes from the number of legal actions

108:50

being wider, and the depth comes from not being able to train a value function as easily, because…

108:57

Here's an example where LLMs break down. The CPUCT rule involves the √N over 1+Nₐ.

109:03

In an LLM, you're most likely never going to sample the same child more than once.

109:10

If you have multiple steps of thinking, because language is so broad and open-ended, a

109:17

discrete set of actions is not really an appropriate choice for an LLM.

109:21

Even though they're discrete tokens, it's just such a large number that this type

109:25

of exploration heuristic is probably not the right thing to guide how to search down a tree.

109:31

I guess the crux comes down to the fact that in Go, you know that MCTS is almost certainly

109:38

better than your current policy, even though you haven't explored the end of any trajectory.

109:45

In normal reasoning for LLMs or robotics, there's no way to just locally evaluate

109:50

and improve your next move in a way that's independent of actually solving the problem.

109:58

"No way" is a strong word. Lots of people have thought

110:00

about how to apply MCTS or its successors like MuZero to continuous control spaces,

110:06

and I'm sure very cool research work is still ongoing to try to crack that problem.

110:11

But yes, the apparent challenge right now is that most problems in much higher-dimensional

110:19

action spaces, or something combinatorially much bigger like language, don't seem as amenable

110:25

to the discrete action-selection heuristics and game-evaluation type stuff that Go has.

110:33

That's not to say the idea of thinking into the future along

110:37

multiple parallel tracks might not give you some information about which way to search.

110:41

If you think about mathematics, it often occupies more of a logical search procedure where you can

110:48

back up and see which paths seem good or not. There's more of a rigid structure there,

110:53

whereas in a business negotiation it's less of a tree and maybe something a bit different.

111:03

We're now seated, so I can ask you some more questions about

111:05

AlphaGo and AI research more generally. In 2021, Andy Jones had a paper called

111:12

"Scaling Scaling Laws with Board Games", and he basically anticipated inference compute,

111:17

or inference scaling, by showing that you can trade off test-time compute and training compute.

111:22

That is to say, you can spend more compute searching through MCTS, and if you do that,

111:28

you can get the equivalent performance to having spent more time training the model.

111:34

If you see this pattern, you might think that with LLMs you might do something similar in the future,

111:38

and in fact that's what ended up happening. What is a fun exploration one could do now to

111:46

explore other axes of scaling in toy settings, which will be important to understanding what

111:52

AI development might be like in a few years? Test-time scaling and reasoning, and how they

112:01

interact with model size, are quite profound when it comes to how much needs to actually be done as

112:08

explicit search versus how much can be packed into the forward pass of a neural network.

112:14

How does a forward pass of a neural network learn to do something that should be sequential and

112:20

recursive? That's quite interesting. The Andy Jones "Scaling Scaling Laws with Board Games"

112:24

paper is quite cool. There was another

112:26

really nice result from that paper. Not only can you predict scaling laws of

112:32

the LLM variety—where as you increase parameters you can decrease the amount of compute for search

112:38

or vice versa—but he also showed that you can predict how much compute is needed to solve a

112:45

larger version of the board game. For example with Go, which can scale from

112:51

3x3 to an infinitely sized Go board, you might be able to revisit this question and

112:57

try to reproduce whether this shows up. I actually started this project with the

113:02

motivation of asking: does the "Bitter Lesson", or our knowledge of scaling laws, allow us to

113:06

execute a lot better on a compute-optimal Go bot? Can we build a strong Go bot without all of the

113:12

KataGo tricks, just by really focusing on the Bitter Lesson and scaling laws?

113:16

I have not been successful so far, but usually when you want scaling laws to work, you want

113:23

to be in the regime where the recipe already works and the data sets are good, rather than

113:28

trying to figure out how to do scaling while also trying to figure out what the right data sets are.

113:34

The scientific understanding component in research often follows a step where

113:38

you get something to work first. Then you use that system to collect

113:43

data that then helps you build a mental model of how things work, such as scaling laws.

113:47

Usually, if you want to build a strong Go bot using scaling laws, you actually have to make

113:51

a strong Go bot first and then use the scaling laws to extrapolate a bit further into the future.

113:56

Say more. Just so I understand, are you saying scaling laws did not work?

114:01

There was no scaling-laws pattern that you could see in your Go bot?

114:05

A mistake I made initially, when I had some bugs around how MCTS labeling was working,

114:09

was that I would collect a bunch of data with an expert policy and then treat it

114:14

as a supervised learning problem and try to identify scaling laws with expert data sets.

114:19

You can indeed plot things that look like this, but if you're in a regime where

114:24

your policy is not working well, you might just be studying scaling laws on bad data.

114:29

One important implementation detail is that if you want to study a scaling-laws problem,

114:33

you have to have a problem for which the data is good, the architecture is good, and there

114:38

are no bugs, and then you solve it from there. Ex ante, I wasn't able to apply scaling laws

114:44

to direct what to look at until I had the rest of the system working. This sounds obvious to

114:51

researchers. Of course you want a working, bug-free system before you study scaling.

114:55

But as advice for practitioners, on where I tripped up when I started this project,

115:00

you don't necessarily want to jump into the science of studying your man-made

115:04

artifact before your man-made artifact is interesting enough to be studied.

115:08

Speaking of compute, you can look at these charts of compute used to train the best AI model in the

115:16

world over time going back 10 years. It's a very smooth line in log space

115:22

that grows exponentially year over year. Except there's this huge aberration, and that

115:27

aberration is AlphaGo Zero, which was trained on way more compute than any other AI model at

115:33

the time. It was 3E23 FLOPS. It's comparable to a frontier LLM, orders of magnitude off, but still.

115:44

The question is, especially with you being able to get something

115:47

off the ground, did you train it on your own? I got a donation from Prime

115:50

Intellect for about $10K. I spent maybe the first $4K doing exploratory

115:58

research, and then about $3K on the final run. Some of it remained for serving the model.

116:06

Is your sense that they did a bad job training it, if you can do it in $10K now?

116:11

The compute required to be the first to do something is always much larger

116:15

than the compute it takes to catch up. It's the same story playing out in LLMs.

116:18

Once someone else has done it, you can use tricks like distillation.

116:22

You can use all sorts of crutches to bootstrap your way to success.

116:27

With my own bot that I've hosted online, I actually used best-response training

116:32

against the KataGo models to get a strong level of performance.

116:37

As of the time of recording, I'm validating whether I can do that first step, which is

116:42

to do the tabula rasa play. Importantly for research,

116:45

you often want to start from a good init. The simple thing I did first was train

116:48

best-response agents against KataGo. The AlphaZero team did not have any

116:54

policy they could train against, because they were trying to do everything tabula rasa.

116:58

Being the first to do it means you're prioritizing getting the thing working rather than the most

117:02

compute-efficient possible implementation. This plays out in robotics as well.

117:07

If you look at the frontier of large models trained for robotics, the scatter plot is

117:12

all over the place, and there isn't a very clean line the way there is for frontier LLMs.

117:17

That's because the folks training these models often are not at the

117:22

scale where every FLOP counts and they need to squeeze out the performance of every single

117:27

FLOP as the deciding factor in pre-training. Instead, their focus is more, "We want a certain

117:33

capability to show up, so we optimize the training setup to make it easy to derive that capability."

117:39

Once you have that capability… Invariably if you scale up the compute, you are forced to make it

117:44

compute-efficient, because this is hundreds of millions of dollars we're talking about.

117:48

But in the past, when compute for experiments was more plentiful—or not accounted for in a way the

117:57

researcher was really responsible for—you end up with people optimizing for things besides being on

118:01

the compute-optimal Pareto frontier. I see, like speed or something?

118:04

Yeah, time to result, or just getting it to work. The first AlphaGo probably had lots of compute,

118:10

and they didn't need to worry too much about making it the most compute-optimal thing.

118:15

How much of the improvements to compute efficiency are methods that did not exist

118:19

as of 2017 versus things which they could have done in 2017 but didn't?

118:24

Great question. Going into this project, I knew in the back of my mind that things

118:28

always get easier to do over time. I wanted to see where Go was at,

118:32

given that there hadn't seemed to be any major open-source strong bot after KataGo in 2020.

118:39

Reading the KataGo paper, there were a lot of clever ideas.

118:42

I was thinking, "Let’s see if the Bitter Lesson had happened, where a lot of these tricks just go

118:48

away because Nvidia made faster GPUs. Roughly, where are we on that?"

118:53

Again, this is not a peer-reviewed claim. It's just my preliminary vibe guess based on

118:59

what I’ve seen with my own experiments. It seems like architecture choices

119:05

don't matter that much. Transformer versus ResNet…

119:09

We're at the speed of GPU where the size of the model is not so big that this really matters.

119:15

You can simplify the setup quite a lot. Instead of doing a distributed asynchronous

119:20

RL setup with replay buffers and pushers and collectors, you can do a dumb synchronous thing

119:25

where you collect, train a supervised learning model, and then collect again.

119:30

There are opportunities to simplify infrastructure.

119:33

Nvidia GPUs have indeed gotten faster. Whereas KataGo was trained on V100s,

119:37

you can train on half the number of desktop Blackwell GPUs and it still works.

119:45

Some of the auxiliary supervision objectives that KataGo developed aren't really necessary

119:50

if you have a strong initialization. If you're initializing best-response

119:55

training against KataGo itself, your own model needs none of the tricks that KataGo needs.

120:00

The core thing is getting as quickly as possible to some strong opponents.

120:05

That matters a lot more than the specific architectural innovations.

120:08

But there are still some nice compute multipliers.

120:11

I found that training on 9x9 boards was very nice for resolving endgame value functions.

120:16

If you can co-train that on an architecture that can transfer between 9x9 and 19x19,

120:21

you can really cut down the warm-start time to learn from scratch.

120:25

AlphaGo Zero's plot showed the first 30 hours or so spent basically catching up

120:29

to the supervised learning baseline. You can cut down that time a lot

120:34

by pre-training on a small board and then warm-starting that into your 19x19 board play.

120:42

There was some other stuff, like varying the number of sims between episodes.

120:46

This turns out to be not that sensitive. You can fix it or increase it. It doesn't

120:52

matter too much. Anyway, it's nice from a scientific perspective, revisiting an old

120:57

paper and seeing what really matters. This is a tangential question,

121:00

but why is it okay to have a buffer in AlphaGo? Every time I talk to an AI researcher, they're

121:05

telling me about how bad it is to be off-policy. But the way a naive implementation of AlphaGo Zero

121:10

would work is that most of the moves in a given backward step, or in a batch of backward steps,

121:18

would not be among the ones made by the most recently trained model. So why is that okay?

121:26

Great question. This gets into the fundamental off-policy versus on-policy

121:32

reinforcement learning questions. As you'll recall, in MCTS, you take

121:37

actions that you took, and you relabel them to take different actions on the same states.

121:44

The off-policy part comes in where, what if you're relabeling states that your new policy

121:49

would never visit? What's the point? You're wasting capacity. In the extreme limit, imagine

121:54

the distribution of states in your training buffer are all states you would never visit.

121:58

Then you're supervising them to take good actions on states you would never achieve.

122:05

Therefore, your policy can get really bad. This is where off-policy can really hurt AlphaGo.

122:12

However, if you interpret this from the DAgger perspective—which is basically a way to correct

122:17

yourself back to the optimal trajectory given some data—what you want in an algorithm like

122:24

this is to have mostly states that you would visit, but then a small or reasonable percentage

122:30

of states in this high-dimensional tube around your optimal trajectories.

122:36

Any of those states are given a supervision target to funnel you back into your optimal trajectory.

122:43

Maybe I can just draw real quickly here. In a DAgger-style setup, your optimal

122:50

training data distribution is your optimal states and actions.

122:56

You want to be in this state, you want to be in this state, you want to be

122:59

in this state, and then you win here. These are your optimal policy actions.

123:05

These are the things you definitely want to train on.

123:08

But to make it robust to disturbances, you want to make sure that if you happen to

123:12

drift off into some other states, you can funnel yourself back in.

123:17

But why isn't this a fully general argument for off-policy training?

123:20

This is actually why you want to do off-policy training sometimes.

123:24

You don't want a compounding error where, if you make a mistake, you don't have the data

123:28

for how to return to your optimal distribution. Optimal control doesn't really say too much about

123:37

how to not accidentally get here, because it's making the assumption that once you learn the

123:42

policy, you're going to get to here. But in applications like robotics,

123:47

a gust of wind blows you slightly off and now you need to correct.

123:51

Or the friction on one of your tires is slightly lower than the other wheel, and now your car is

123:55

drifting and you have to correct it. These kinds of things often happen in

123:59

more real environments. There was a funny quote about chess and Go.

124:07

The problem with Go and chess is that the other player is always trying to do some shit.

124:12

Things can drift off, and you always want to be able to correct back to your winning condition.

124:19

Your replay buffer really should have the states your policy would visit, plus some

124:25

distribution of states you might drift to, and then how to return back to your optimal states.

124:31

Now, if you take this to the extreme and you say, "We don't have any of this data,

124:41

and we're just going to be labeling with MCTS states that are so far away from our optimal

124:49

behavior, like this bag of states over here…" Well, now each of them gets an MCTS label,

124:58

and your policy learns how to take the best possible action here, but you never get here.

125:02

You're training your model on states you would never reach.

125:09

This is where off-policy can really hurt. As part of this project, I did try an experiment

125:15

where I took a bunch of trajectories, and to try to saturate the GPU as much as possible,

125:21

I took random states from the dataset and reran MCTS on just those states.

125:28

Instead of playing a whole game where I'm doing MCTS on every move, I ignore the causality of

125:33

moves and pick random board states, and I label those with my current network.

125:38

I might revisit old states I've labeled before and relabel them again with my current network.

125:43

In practice, this actually does work. You can take some states that are reasonable and

125:49

constantly be relabeling them while training. This starts to converge on a very

125:56

robotics-like setup, which is very common. You have your dataset of trajectories, and then

126:02

you have something like a replay buffer pusher. These are off-policy offline trajectories. Your

126:13

replay buffer pusher pushes transition tuples to the replay buffer, and then you have a job that's

126:27

continuously replanning what the best action you should have done instead of the action you took.

126:37

In robotics, it's very common to minimize TD error.

126:41

Your Bellman updater is constantly pulling things from here and trying to satisfy Q(s,a).

127:01

From here you have your trainer, which is trying to fit the Q to the Qᵗᵃʳᵍᵉᵗ.

127:12

You can think about this as a sort of planner. You visit old states you've been to,

127:17

and you take your current model and rethink. What could I have done better if I visited this?

127:25

This is how off-policy robotic learning systems are usually trained.

127:29

These days there's a simpler recipe, but in the Google QT-Opt days, we did things like this.

127:35

So what is the trainer? The trainer is where you try

127:39

to minimize Q(s,a) and Qᵗᵃʳᵍᵉᵗ. Can you explain the whole setup

127:47

again at a high level? You have your off-policy

127:50

data that came from various policies. You're constantly pushing transitions

127:54

you saw before to a replay buffer. Then you've got this thing called

127:58

a Bellman updater, which basically replans: instead of this action, what action should I

128:02

have taken at s to have a better value? The way you enforce that is by trying

128:07

to minimize the TD error. Given this, you have s’. You compute Q(s’,a) and you find

128:16

the action that should go with s’ that makes this Q value as high as possible.

128:21

Then you add that to the reward here, and that gives you your actual target.

128:24

For this current s and a, your Qᵗᵃʳᵍᵉᵗ is this. Now you send back the Qᵗᵃʳᵍᵉᵗ to this transition.

128:36

With this tuple, you pair a Qᵗᵃʳᵍᵉᵗ. Then on the trainer side, you simply

128:43

use supervised learning and minimize your current network's Q(s, a) with its target.

128:48

Got it. So in the background you're basically thinking through how valuable

128:54

all these actions actually were. Yeah. In a more optimal policy

128:57

where you're trying to maximize this, what is the Qᵗᵃʳᵍᵉᵗ of this transition?

129:01

It's sort of like daydreaming. Exactly. You can think about

129:03

it like you're going back in hindsight. Given what I've seen in the historical buffer,

129:11

was there a better action I could have taken? The connection to Go here—I tried and it was

129:16

moderately successful, but too complex to open source—was replacing this with an MCTS relabeler.

129:27

Instead of doing this target network computation, you run MCTS on your transition.

129:34

In this case, you have your state, your action, and whether you won or not at the game.

129:40

Actually, you can just toss those two. You don’t care about these ones.

129:42

You just take your state and plan MCTS to get your best policy, Π, on your current network.

129:54

Not the network that took this action, but your current best policy network.

130:00

You rerun your search offline on these transitions.

130:04

If these are transitions your policy can get to, this actually acts as

130:08

a very nice stabilizing effect. Another benefit is that you can

130:11

fully saturate your GPU better because you're not blocking on the Go game to give you board states.

130:18

You simply search across all board states at any depth in parallel.

130:24

Here the trainer would just predict the MCTS label as best as possible.

130:28

This kind of works, and it is quite relevant in robotics where you have a lot of offline

130:33

data and you can't simulate things like MCTS. But in practice, it does run into the problem

130:38

that if the current model is looking at states it would never reach, it's wasting capacity.

130:45

You have to be a little bit careful here. Much of RL has converged to a much more

130:51

on-policy setup where they don't really try to directly train on off-policy data.

130:55

At best, they use off-policy data as a way to reduce variance,

130:58

but not to directly influence the objective. Why have they converged to that?

131:03

It's just more stable. You might use the off-policy Q as a way to do advantage

131:10

computation, like Q minus the sum of Q. If there are N actions, this is your value,

131:26

and these are your current Q values. Your advantage for that action is

131:29

the average value minus your current one. People can try to estimate Q in an off-policy

131:34

way and then just use advantage here. If there's a problem in these dynamics,

131:40

it doesn't blow up your loss as much. In robotics, there's a convergence

131:45

toward using off-policy data to shape your rewards but not actually be directly here.

131:51

I'm reminded of our earlier conversation about why MCTS is so favorable compared to

131:57

the REINFORCE or policy gradient thing LLMs do. This might be totally wrong, but I wrote a blog

132:03

post a few months ago about how policy gradient RL is even more inefficient than you might think.

132:10

The inefficiency one thinks about naively is the fact that you have to roll out a whole trajectory

132:16

to get any learning signal at all. As these trajectories become longer

132:20

and longer—as an agent has to do two days' worth of work to figure out if

132:31

it did a project correctly, rather than just completing the next word in a sentence—the

132:34

amount of information per FLOP decreases. You have to unroll two days' worth of thinking to

132:41

see if you implemented a feature correctly, so the amount of samples per FLOP has been decreasing.

132:50

You're trying to maximize as you’re learning, bits per FLOP.

132:58

You can think of bits per FLOP as samples per FLOP times bits per sample.

133:14

What I mentioned a second ago is that the samples per FLOP go down as RL becomes more long-horizon.

133:21

But this kind of naive RL is also terrible from a bits-per-sample perspective,

133:28

at least compared to supervised learning. Early on in training, let's say you have a

133:34

vocabulary size for an LLM that is 100K long, so there are 100K

133:39

possible tokens that one could answer. You have a totally untrained model and

133:43

a prompt like, "The sky is…" With supervised learning, the model would have some probability

133:56

distribution over all the things it could say. There's a label that says the term here is "blue,"

134:01

and it would learn through cross-entropy loss exactly how far its distribution

134:05

is from correctly saying "blue." If you were doing this through RL,

134:12

the model would try, "The sky is halycon." Nope, that's wrong. "The sky is told." Nope,

134:20

that's wrong. This is a totally untrained model. You would have to do this on the order of 100K

134:25

times just to stumble on "blue" and get some learning signal.

134:31

In the supervised learning regime, you have your distribution of probabilities,

134:35

you get told that it's "blue," and you figure out how far off you were.

134:39

The amount you learn is a function of your pass rate.

134:44

The further away you are from "blue," the more you learn to

134:47

go toward "blue" using cross-entropy loss. You can think of it as like your pass rate,

134:52

your prior probability of having said "blue." As a function of that, in supervised learning,

134:58

through cross-entropy loss, you would learn negative log(p), p being pass rate,

135:05

bits, once you get this label. Whereas in RL, if you're just randomly

135:12

guessing and seeing if it works, that's basically just the entropy of a binary random variable.

135:25

What's also tough here is that the distribution you're sampling

135:28

under is your policy's distribution. If your policy has no chance of sampling

135:34

"blue," then you will never get a signal. Exactly. That's modeled by the fact that your

135:41

probability of sampling "blue" is extremely low. If you sample it, you do learn as much as

135:46

you would have in supervised learning. In all other cases—99.99% of the time in

135:51

an untrained model—you're learning incredibly little from seeing that

135:57

"halycon" or "told" is not the correct word. That's what happens most of the time. You

136:02

learn very little. If you put your pass rate on the X-axis and the bits you're learning

136:17

from a sample on the Y-axis, with 0%, 50%, and 100%, so at the end of training you're here.

136:28

If you have supervised learning, the negative log pass rate would look something like this.

136:36

The entropy of a binary random variable would look like this,

136:46

depending on whether you're doing nats or bits. If you do bits it's one here at the peak.

136:52

This is like a coin flip. You learn the most from a coin

136:54

flip. This is supervised learning. This is RL. However, the problem is you spend most of training

137:06

in this regime, in the low pass rate regime. How fast you're learning is a function of how

137:14

many bits per sample you're getting, and you're getting very little signal here.

137:20

If you chart the pass rate on a log scale—where at the beginning of training with a vocab size of

137:29

100K the pass rate is 1/100,000, then 1/10,000, 1/1,000, 1/100—what this graph looks like here,

137:45

supervised learning would look like this, and then RL would look like that.

137:59

And arguably you spend all your time here, potentially never even getting a single success.

138:06

It's a depressing plot in the sense that once you're here,

138:10

it's not at all obvious how you get to there. Once you're here you have something,

138:14

but you actually spend all the time here in many RL problems.

138:19

There's a question of how you initialize so you're at least at a non-zero pass rate.

138:25

One more thing I'd like to add about bits per sample that's very relevant to

138:30

any machine learning problem is that there's a connection to soft targets and distillation.

138:37

If you have access to the logits, not just the one-hot token answer, if you

138:44

have access to the soft targets the entropy of this distribution is far higher than the one-hot.

138:53

There's way more information in bits per sample in a soft label.

138:58

That's why distillation is so effective per sample.

139:01

It's giving you way more information per sample. I wonder what the equation would be,

139:06

but obviously it's… It would just be the

139:07

entropy of this distribution. The entropy of this is zero.

139:11

The entropy of this is the entropy equation. This is also why AlphaGo is quite beautiful.

139:16

In AlphaGo, you don't train the policy network to imitate the MCTS action.

139:22

You train it to imitate the MCTS distribution. Both of these are valid, and if you wanted to do

139:27

a scientific experiment of how important this soft label is, dark knowledge distillation,

139:33

you can run an experiment where you retrain the policy network on the action

139:37

MCTS selected rather than the soft target. Earlier I was stumbling around. Intuitively,

139:44

why is this ability to do iterative search—where you don't necessarily need to be able to win the

139:54

game in the beginning, you just need to be able to improve your current

139:56

policy—so powerful a capability in learning compared to how LLMs currently learn RL?

140:04

It's exactly this thing of considering your pass rate of the entire trajectory.

140:10

I just don't know a formal way to think about this.

140:12

Maybe you should help me out here. Why is AlphaGo an elegant RL algorithm?

140:18

The major reason is that you never have to initialize at a zero percent success

140:24

rate and solve the exploration problem of how to get to a non-zero success rate.

140:28

This is what allows you to hill-climb this beautiful supervised learning signal.

140:32

If you look at the actual implementation of AlphaGo, every step of the way,

140:36

there's actually no TD error learning or dynamic programming, at least explicitly.

140:43

It's just supervised learning on a value classification as

140:47

well as a policy KL minimization. It's just a supervised learning problem

140:53

on improved labels. The training is very stable. You can train as big of a network as you want.

140:58

You can retrain this on the data set. Everything will just go stably. The infrastructure

141:02

is very simple to implement as well. You don't need a complex distributed

141:05

system to keep everything on policy. At the end of the day, you're just saying,

141:10

"I have some improved labels. Let's retrain my supervised

141:13

model on these targets." You're always in this

141:16

beautiful regime where you're just trying to improve the policy, rather than escape local

141:23

minima where every signal is flat all around you. One way to draw the curve is if you draw the win

141:29

rate of an MCTS policy versus the raw network… Let's say this dotted line is the raw network.

141:35

The MCTS policy looks like this. Every step of the way,

141:40

this supervision signal is very clean. You're never in a situation where the

141:46

MCTS is giving you no signal, unless your MCTS distribution converges to exactly

141:52

what your policy network predicts. That's a great way to explain it.

142:02

Maybe we sit down and I ask some questions about automated research.

142:06

One thing I really wanted to talk to you about is that you did a bunch

142:09

of the research for this project through this automated LLM coding assistant loop.

142:21

There's an idea that if you fully automated AI research, you could have some sort of singularity.

142:25

Obviously we're not there yet, but to the extent that we have early indications of what

142:28

this process might look like, I'm curious about your observations about what the AI is good at,

142:35

what it's not good at, what you think about this scenario's likelihood eventually, and what

142:41

thoughts you have about this in general. I think automated scientific research is

142:45

one of the most exciting skills that the frontier labs are developing right now.

142:50

It's important for everyone who's doing any kind of research to get a good intuition of what it

142:55

can do now and what it can't, and how the science process might work in the future once we have AIs

143:00

automating a lot of this investigation. In brief, I mostly used Opus 4.6

143:06

and 4.7 while working on this. What works is that the models can do

143:13

a very good job of hyperparameter optimization. In the past, people would come up with a search

143:18

space of hyperparameters like learning rate, weight decay,

143:21

and maybe how many layers are in your network. They would do a grid search or a Bayesian

143:26

hyperparameter optimization approach, and it would find some tuned parameters.

143:31

The really cool thing that automated coding can do now is search a much

143:38

more open-ended set of problems. It can say, "I've identified that

143:43

the gradients are small in this layer, so let me change it up here.

143:45

Let me rewrite the code so the data loader has a new augmentation I came up with.

143:50

Let's try to find the best way to fit the constraints of the optimization problem."

143:55

You end up with this much more flexible, high-level, almost grad-student-like

144:00

ability to just grind a performance metric. This can squeeze out quite a lot of performance.

144:06

On a fixed data set with a fixed time budget, you can improve perplexity by quite a lot on a

144:11

classification problem like LLMs or Go. It is also fantastic now at basically

144:18

executing any experiment. I have a Claude Skill that

144:21

I wrote called Experiment where I give it a description of what I want it to plot.

144:26

I just describe, "Here's the x-axis I want, here's the y-axis. Answer this question for

144:30

me." It'll run off and do all the experiments, compile the plot, make a report, and suggest

144:35

what might have caused it and so forth. That's what works quite well today,

144:40

and I think we can expect these abilities to get better in the future.

144:42

But it's also useful to know what it's not doing so well today.

144:47

In the blog version of this tutorial, I have a plot of all the experiments I did grouped in a

144:54

tree, where every node represents a failed, successful, or mixed experimental result.

145:00

From there, it branches off into a child representing the follow-on experiment.

145:03

Occasionally, I'll rabbit-hole down a track like off-policy MCTS relabeling, do a few experiments,

145:09

and then realize it's probably not worth it. So then I'll jump to a completely different track.

145:13

I call these things rows. What I find is that the current closed models the public can access today

145:22

don't seem to be that great at selecting what the next experiment should be in a given track.

145:27

They don't seem to be able to step back and do the lateral thinking of, "Wait a minute,

145:32

this track doesn't really make sense. Let's go back to first principles and

145:37

think about what the bottleneck might be, or what we are trying to achieve."

145:41

Often I had to catch infra bugs myself by prompting the right question to Claude to

145:46

investigate what's causing the discrepancy, and then it'll answer the question.

145:50

With Mythos-class models or Mythos++ models coming online, maybe this just completely changes and

145:56

these problems fall to improved scaling. But at the same time, I think there's a

146:02

rich opportunity to develop RL environments that might incentivize this kind of lateral thinking.

146:08

One of the motivations for setting up this Go environment was that Go captures a lot

146:13

of very interesting research problems, often overlapping with LLMs or robotics.

146:18

Yet it's very quick to verify. The outer loop is ultimately:

146:22

does the agent do what I think it does? You can check the outcome of a Go game

146:26

quite easily. The inner loop

146:28

involves all this research engineering around distributed systems, predicting whether an

146:33

idea is going to work or not, and predicting the difference a particular modification to

146:38

your training algorithm might make. I think there's a rich library of

146:42

subtasks and sub-environments that you can train an automated scientist to work on,

146:46

with Go as a sort of outer verification loop. Once you acquire these skills, maybe you can

146:51

apply them to other domains like biosciences or robotics.

146:55

Or automating AI research. Or automating AI research.

146:58

Which is the real crux, the scary/incredible thing of making AIs make future versions of AIs.

147:06

You're suggesting the outer loop here could just be your win rate against KataGo?

147:10

That's one of them. I think there are a lot of deeper questions that one could tackle.

147:15

For example, let's say you have an idea on how to improve a scaling-law compute multiplier.

147:23

The outcome isn't necessarily "achieve the best Go bot ever".

147:28

The outcome might just be, "Can I predict what the win rate of my Go bot will be?"

147:33

Or, "Can I predict the scaling-law plots that emerge from my idea?"

147:36

But then you can verify that you haven't reward-hacked anything by using a very

147:39

verifiable game like Go on the outer loop. I think there are a couple of interesting

147:43

follow-on questions. There are questions

147:45

on the inner loop and the outer loop. On the inner loop, there's a question

147:49

of how locally verifiable any modification you might make is.

147:54

That is to say, would you know whether some idea you try out is

147:58

actually an improvement or a degradation? Would you know if something isn't working

148:02

as a result of a bug, or is it the result of the idea itself being wrong?

148:08

Ilya was talking about how one of the things he thinks makes him a good researcher is that he has

148:21

a strong belief in what the correct idea is. He's able to persevere through bugs and know

148:30

which things are bugs versus mistakes in the fundamental idea, based on his high-level belief

148:34

that "this idea should work, so therefore there has to be a bug", versus the other way around.

148:38

Why don't we start with that question? How locally verifiable are things

148:42

which are good ideas? As in the case of the

148:46

success story for deep learning, you can think about this as a decades-long idea

148:50

that took a lot of faith to get it to work. This presents a very challenging long-horizon

148:57

RL problem where every step of the way you have a committee telling you that this is a bad idea,

149:03

and then ultimately you break through. How do you design RL environments that

149:09

maybe give you some feedback earlier? I think this is a very tough open

149:15

question that I don't have an answer to. Ultimately, to play a very strong Go bot,

149:20

you probably did need to discover deep learning. Having a challenging game that cannot be cheated

149:30

easily on the outer loop could be used as an outer-loop signal for something like

149:35

discovering the principles of deep learning. Now, of course, to make it tractable—and this

149:39

is where research taste really matters—you have to come up with ways to initialize

149:44

your problem so that you don't try to solve a very intractable problem.

149:49

Maybe you can leverage LLMs as a universal grammar in the middle to

149:54

give you some sort of local feedback. The fact that LLMs are a universal

149:58

grammar means that they can move at almost any level of the stack.

150:02

They can think very locally, as well as step back and think in very broad steps.

150:07

I think that's where a lot of the lateral thinking ability of humans comes from:

150:12

knowing when the track you're pursuing or the objective you're pursuing is not right,

150:15

and you should be asking a different question. The other question is how stackable local

150:20

improvements are in the attempt to get to a better result on the outer loop.

150:25

I've heard rumors that at some AI labs, the thing that has gone wrong is that

150:29

people will individually pursue good ideas, but those don't end up stacking well, and so

150:34

the training run fails because of some weird interaction between two seemingly good ideas.

150:40

Having a single top-down vision of how things should work is very important.

150:44

Having worked at different AI labs and also played around with parallel agents

150:50

trying different ideas, what's your sense of how parallelizable AI innovation is?

150:55

Great question. I think the research taste for executing well on the Bitter Lesson is

151:03

that you need to know how much the Bitter Lesson can buy you and how much is too much

151:09

to ask for, at any given moment. Of course, in the fullness of time

151:13

compute is the single most important determinant of how things work.

151:20

It's almost inevitable that as you scale up energy and compute and parameters,

151:24

intelligence will just fall out of that. That's super beautiful, super profound. No algorithmic

151:29

detail really matters beyond that. But in the present day, we don't

151:33

have infinite compute and parameters and arbitrarily good initialization, so we

151:37

have to come up with heuristics that give us that. These heuristics are probably somewhat redundant.

151:44

That's probably why you see this effect where a lot of these compute

151:47

multipliers don't necessarily stack. They might have some correlated benefit.

151:52

And then three years down the line, when the Nvidia GPUs have gotten even

151:56

stronger, maybe they stack even less well. Maybe at any given point in time, the benefit

152:03

of any given compute multiplier is transitory, which is what I suspected with the KataGo paper.

152:08

There were many algorithmic ideas applied, and then you can see that with modern Blackwell GPUs

152:14

and Ada-class GPUs—which are much better than the V100-grade GPUs that that paper used—some of these

152:21

algorithmic tricks to speed up convergence just don't matter so much compared to something else.

152:26

I think that's a matter of taste in the present time.

152:30

How about the outer loop? How verifiable for making AI smarter?

152:37

With Go, you do have this outer loop of win rate against the best open-source model out there.

152:43

Even there, as you were saying, there are other outer loops of whether you discovered a new

152:48

phenomenon, which is actually very hard to…. If you didn't know scaling laws were

152:53

important… When were Chinchilla or Kaplan scaling laws released?

152:56

2018. So if you're back in 2015, there's

153:00

not an automated procedure one can easily imagine for knowing which paper is the scaling laws paper

153:07

versus which is just another random plot. Even in the Go case, it's a hard-to-verify

153:12

outer loop, and the whole idea of an outer loop is to have some backstop on improvement,

153:21

let alone for general AGI, where of course we have a bunch of these benchmarks. But there's

153:27

a problem. We know the things we can measure, and we improve on the things we can measure.

153:32

We care about this broader ability to do economically useful work,

153:35

which is not super easy to measure, at least until you automate everything.

153:42

There's a question of how good the outer verification loop is

153:47

for AI self-improvement, and does that matter? I'm going to give a non-rigorous argument,

153:53

but one that I intuitively believe. DeepMind started with a focus on games.

154:00

They used games as their outer loop, and then their researchers learned from the experience

154:06

of solving games, and now they're working on LLMs. Presumably, there was some positive transfer from

154:11

their time working on games and Atari and Go and StarCraft that now helps them make good LLMs.

154:19

I assume that there's positive transfer in some regard, whether it's coding or

154:22

general research ability or project management. All these things probably help them do well.

154:31

If that's the case, why wouldn't it also be true for automated AI researchers?

154:36

They should be able to positively transfer experience tackling quick-to-verify,

154:41

quick-to-iterate environments to something more ambitious and economically useful,

154:45

like automating drug discovery. I don't know. Isn't the issue…

154:51

Historically, until Gemini 3 or whatever a couple years ago, people were saying, "Look,

154:58

Google isn't catching up in LLMs because they're too tied to the old approach."

155:06

There are gains, but there are also ways in which it actively hinders you. It's not obvious to me.

155:14

The jury's still out. Let's say currently Google's doing quite well.

155:20

Who knows if the initialization of training on games is ultimately going to hobble their ability

155:25

to be the winner in the long term. It's hard to say for sure.

155:32

Likewise, who knows if the seeming late start was really just them pre-training

155:37

for longer on how to scale up TPUs. They invested all their tech tree in

155:42

getting TPUs to be good, which seemed not that useful in the short term but in the long term

155:46

it becomes… So it's even hard for humans to reason about what the optimal research strategy

155:51

should be, even with the data we have today. Okay, we should let people know how they can

155:59

find out more about this project, whether to fork it themselves, or check out your blog post where

156:04

you do an excellent job explaining many of these ideas. Where do people go next?

156:09

My website is evjang.com. There's a blog post that links to an interactive version of this tutorial.

156:16

On my GitHub, which is the username ericjang, there's an AutoGo repo that people can fork

156:22

and reproduce the training results. I also highly recommend people check

156:25

out this blog post, "As Rocks May Think." We touched on some of the ideas in this

156:28

conversation, but it's this grander thesis of what happens when you have thinking as a primitive in—

156:38

Computer science. Exactly. I highly recommend

156:41

people check out that blog post as well. I encourage the audience to think about

156:45

the relationship between thinking and Go via MCTS and search, and how it relates to LLMs.

156:51

I think there's something quite profound there, and probably underexplored just because Go has

156:56

been relatively underexplored compared to the boom in LLMs.

156:59

It's not to say that I think we should have trees in our LLMs, but there is

157:04

some very interesting duality between them. You can actually do a lot of research on Go,

157:09

MCTS, and reasoning with very small budgets. So that's very exciting.

157:14

Awesome, Eric. Thanks for doing this. Thank you. It's an honor to be on the podcast.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

Eric Jang, formerly VP of AI at 1X Technologies and a senior research scientist at Google DeepMind Robotics, spent his sabbatical rebuilding AlphaGo. He explains AlphaGo's breakthrough in solving the complex game of Go, which was previously deemed intractable for traditional search, by integrating deep learning with Monte Carlo Tree Search (MCTS). The system uses a value network to predict win/loss probabilities and a policy network to suggest optimal moves, with MCTS iteratively refining these networks through self-play. This training method offers stable and sample-efficient learning, contrasting with the high-variance, end-of-trajectory reward common in LLM reinforcement learning. Jang also touches upon the implications for automated AI research, highlighting LLMs' strengths in hyperparameter optimization and experiment execution, while noting their current limitations in lateral scientific thinking. He concludes by discussing the substantial reduction in compute costs for replicating AlphaGo's capabilities today compared to its initial development.

Recently Distilled

Videos recently processed by our community

How The Federal Reserve Could Shrink Trillions From Its Balance Sheet | Darrell Duffie

Jul 18, 2026

by The Monetary Matters Network

Breaking Down the Multi-Manager Playbook: How This $19B CIO Thinks About Alpha | Sean McGould

Jul 18, 2026

by The Monetary Matters Network

Semiconductors Are Gushing Cash… Here’s What’s Next in The AI Trade | Ben Pouladian

Jul 18, 2026

by The Monetary Matters Network

Why Capturing The Market’s Biggest Trends Means Embracing High Volatility | Takahe Capital

Jul 18, 2026

by The Monetary Matters Network

Meet Moonshot, China's latest Al challenger

Jul 18, 2026

by Yahoo Finance

Bank Earnings Just Gave the Market a Much Needed Confidence Boost | The Weekly Wrap

Jul 18, 2026

by Steve Eisman

The Only TWO Ways to Make Money in This World | Steve Harvey

Jul 18, 2026

by The Official Steve Harvey