RL Environments at Scale – Will Brown, Prime Intellect

Watch on YouTube

Now Playing

Transcript

550 segments

0:13

[music]

0:20

Today we're talking about RL

0:22

environments and how to scale them. But

0:25

the title is a little bit of a red

0:27

herring. We'll talk a bit about the

0:28

engineering pieces and like running

0:30

these with thousands of parallel

0:31

rollouts and sandboxes on hundreds of

0:33

GPUs, but I'm mostly going to focus on a

0:35

different notion of scale. Uh, and what

0:39

I mean by scaling here is we there's a

0:42

number of different ways we talk about

0:43

scaling in the context of AI and

0:45

research. We know about scaling laws and

0:47

we talk about how much data you need,

0:49

compute and parameters and that if you

0:51

pour in more data and compute and

0:53

parameters or inference time. All of

0:55

these things make models smarter or more

0:57

performant. But there's also fuzzier

0:59

side of scaling which is sometimes

1:01

referred to as unhobbling or algorithmic

1:03

tricks or talent. But where does this

1:06

come from? It's not just pouring in

1:08

resources, but it's something that is

1:10

more intangible, harder to put a finger

1:12

on, but really it comes from a community

1:15

of people, a company, an organization,

1:17

universities, the world, the internet,

1:20

talking about ideas and sharing them and

1:23

working on different applications,

1:24

having these applications inspire ideas,

1:26

using these ideas as test beds for

1:29

different techniques, and building on

1:31

top of these to increase the

1:32

accessibility for other people in the

1:34

future to not have to reinvent the wheel

1:35

and to be able to build from uh what has

1:39

been done by those before them to uh do

1:42

more effective research and accelerate

1:44

the pace of innovation.

1:46

And so why do we have this talent

1:48

bottleneck? There's a big issue that we

1:49

hear all about with AI labs trying to

1:51

like find more talent and salaries are

1:53

going through the roof and everyone

1:55

wants to hire the best and brightest AI

1:57

researchers. But one other approach

1:59

besides trying to just pay the most is

2:02

increase the pool. Uh and so how do we

2:04

increase the pool of AI researchers? How

2:06

do we make doing AI research more

2:08

accessible? And I want to talk a bit

2:10

about who we are at Prime Intellect. If

2:11

you haven't heard of us, we are a bunch

2:13

of things. We're a research lab. We are

2:15

a comput provider. We're a platform

2:17

company. And we are an open source

2:18

ecosystem. We do a lot of things and

2:20

they all fit together in a way that I'm

2:22

going to try to explain in this talk.

2:24

But we see these as all different pieces

2:26

of how we can build a business around

2:28

doing exactly this, which is increasing

2:30

the accessibility of AI research and

2:32

making doing research more of a toolkit

2:35

available to people at organizations

2:36

around the world without needing to be

2:38

inside of a large lab or without needing

2:40

to spend crazy amounts on massive

2:42

clusters or go do a PhD. We think that

2:44

there's versions of doing AI research

2:46

that really should be part of the

2:48

breadandbut workflows of AI engineers

2:51

around the world as we build

2:52

applications and try to improve our

2:54

systems and models and products.

2:57

And I think a thing people are kind of

2:59

iffy about in terms of AI is whether

3:02

open source models are going to work.

3:03

And in my mind, that's not quite the

3:05

right analogy to draw. And so when we're

3:07

comparing like AI to traditional

3:08

software, there's lots of like great

3:11

examples of open source software

3:12

ecosystems that have been thriving in

3:14

the past, things like Linux and Node and

3:16

Apache. But in my mind, the analogy in

3:19

AI is not models as kind of these fixed

3:22

checkpoints, but it's about research as

3:23

a practice and research as a set of

3:25

ideas. And it's one that's more

3:27

intangible, but there's a lot of

3:29

parallels in terms of the goals of the

3:32

best practices of growing a research

3:33

ecosystem as well as a software

3:34

ecosystem where you want to uh compound

3:37

abstractions and best practices and have

3:38

better tooling and iteration efficiency

3:40

and have these gains over time allow uh

3:44

more advanced powerful complex things to

3:46

be built by uh decreasing barriers to

3:49

entry for any given application and

3:51

allowing this to become more accessible.

3:54

And so one thing that we a term we'll

3:56

use to describe some of what we're

3:58

building at Prime Elect is we like this

3:59

phrase called the open super

4:00

intelligence stack. One because it's a

4:02

fun acronym but also I think the idea of

4:04

the stack of of all the pieces of the

4:06

puzzle to build the engine to go do

4:08

research. Uh there's a lot of layers to

4:10

it. You need compute uh you need

4:12

orchestration you need libraries for

4:14

doing uh training and evaluation and you

4:16

need platforms to support things like

4:19

code execution and eval inference and

4:21

fine-tuning and we're doing all these

4:22

things. Uh but really the goal of this

4:24

is to give people the tools to be able

4:27

to go train models. We want people more

4:29

people in the world. And we think I'll

4:30

explain why in a bit. There's a lot of

4:32

reasons why uh the best products are

4:35

going to be the ones that are not just

4:38

kind of taking the thing out of a box of

4:40

an API and putting a thin wrapper around

4:42

it. There's ways you can kind of improve

4:44

around APIs. But I think in many cases

4:46

people are realizing that winning

4:48

products are going to be the kinds of

4:49

things that whether it's a part of the

4:52

model, a part of the stack, the part of

4:53

the product or the whole thing, the

4:55

ability to do research and have at least

4:57

the option of deciding where in your

4:59

product you might want to customize a

5:01

model or improve a model gives you a lot

5:02

more flexibility to really u make the

5:05

best user experience. Um,

5:09

and so we have heard the phrase in the

5:11

past that the model is the product. And

5:13

I think we're starting to see now this

5:15

change a little bit to a lot of winning

5:16

applications have the product kind of be

5:18

the model. And I think the two notable

5:20

examples of this that I'm big fans of

5:22

and heavy users of are Cursor's new

5:24

composer model as well as uh OpenAI's

5:26

codeex. And I think these are both good

5:28

examples of models that really are where

5:31

the product kind of is the model very

5:33

directly where the the model was trained

5:35

to be the model for that product and the

5:37

experience of using the model is the

5:39

experience of using the product. And the

5:41

way that [clears throat] this is done is

5:42

by taking a harness that represents the

5:44

product and training the model in the

5:46

harness in essentially an environment,

5:48

an RL environment. And environments

5:50

really are just a harness with a

5:52

collection of tasks and rewards. But

5:55

they also have many other parallels

5:57

throughout the ecosystem. Environments

5:58

are not just for RL. Environments are

6:00

also essentially the same thing as

6:02

evals. Environments can also be engines

6:04

for synthetic data which then you can

6:05

use for SFT or distillation. You can do

6:08

RL in them directly. But also the agents

6:10

were actually deploying and monitoring

6:11

out in the world. These are

6:12

environments. The product of these

6:14

things, the tasks, the harness, and the

6:16

rewards, whether this is a data set

6:18

offline or the stream of user tasks

6:20

coming in to a product is an

6:22

environment. And so this as an

6:24

abstraction I think is a very useful way

6:26

of framing what it might look like to

6:28

start having uh research become more of

6:31

a a practice that is adopted more

6:33

broadly beyond just large AI labs. And I

6:37

also think that there's a sense in which

6:38

they're a really accessible entry point.

6:39

Uh and so I like the analogy of

6:41

environments as kind of like the web

6:42

apps of AI research. And what I mean by

6:44

this is that they're very simple.

6:46

They're self-contained. They can they

6:47

start simple but they can also get quite

6:49

complex. They can get very elaborate

6:51

representing the full complexity of a

6:52

large product. They're also pedagogical

6:54

in nature and that you can start simple

6:56

and as you build complexity, you start

6:59

bumping into these walls where you have

7:00

to start learning new concepts,

7:02

understanding more about scaling the

7:03

system side, understanding more about

7:05

the hyperparameters and the algorithms

7:06

and they kind of open this door where

7:08

you can by playing around with them

7:11

start entering into a world of research

7:14

without needing to kind of build a whole

7:15

training infrastructure system from

7:17

scratch. Um, and they also require

7:20

experimentation. And so I think the key

7:21

different uh differentiation between

7:24

just an agent harness and an agent

7:26

environment is that the environment

7:28

forces you to also have your tasks and

7:30

your rewards predefined to be able to do

7:32

this experimentation. It's a proper

7:33

eval. And what this means is that you

7:36

can't just vibe check it. You can't just

7:39

like build it and test it out a bit and

7:40

say, "Hey, it's good. We're going to

7:41

ship it." It forces you to say, "Okay,

7:44

let's think about this a little more

7:45

scientifically. Let's do some

7:46

experiments. lets try out different

7:48

models, try different hyperparameters.

7:49

Uh, and it also gets you to the point

7:51

where you can start doing more advanced

7:54

research in terms of RL training or

7:55

distillation or fine-tuning. And uh, so

7:58

to really facilitate this, we wanted to

8:01

make the environment as an entry point

8:03

much more accessible. A few months back,

8:04

we launched what we called the

8:06

environments hub, which is a open source

8:08

community platform for creating,

8:09

discovering, and sharing RL environments

8:12

and evals. And so far, we've had a lot

8:14

of fun kind of seeing everyone build

8:15

here. We've had hundreds of builders and

8:17

environments come create either their

8:19

own ideas or re-implement papers. Uh

8:22

there's a bunch of examples here I can

8:24

show you, but really it's just a bunch

8:25

of people who have wanted to do research

8:28

and found this as an entry point to

8:30

start digging a little deeper. Whether

8:32

this was investigating some benchmark

8:33

and figuring out how to reimplement it

8:35

or modify it to be appropriate for an RL

8:37

context in terms of like new data or new

8:39

examples or whether [snorts] this is

8:41

some game that they'd been thinking

8:42

about or some other task. But having

8:45

this as an abstraction for encapsulating

8:47

the the thing you want a model to do is

8:50

a way of allowing yourself to start

8:52

experimenting with ways of improving it

8:53

without needing to have the answers. So

8:55

I think people talk a lot about how

8:57

fine-tuning never really took off in the

8:58

SFT regime. And I think a big part of

9:00

this is that getting data was really

9:02

hard of the actual like solutions. I

9:04

think having labeled examples of what

9:06

you want the model to do is a very

9:08

difficult thing to ask someone to go

9:10

create. But if you can just think about

9:12

the the settings it might be in without

9:14

having the answers up front, if you can

9:16

measure the answers, now you kind of can

9:18

start creating data on the fly. And this

9:20

engine is really what the environment is

9:22

about unlocking. Um, and so actually 9

9:26

months ago, I was right here in this

9:27

room. I had just released a library

9:29

called verifiers, which I'm still

9:31

working on today. Um, it's come a long

9:33

way, but it's a toolkit for building

9:35

these things. And it's been a lot of fun

9:38

over this past year just playing with it

9:40

and extending it to support more

9:42

features and kinds of environments. But

9:44

the idea with verifiers is to give

9:45

people a toolkit that is uh essentially

9:48

a bunch of components that you can mix

9:49

and match and compose to do things like

9:51

from simple evals or QA or games to

9:54

things like tool use or using sandboxes

9:55

or agent frameworks or uh uh like CLI

9:58

coding agents or math problems. There's

10:00

all sorts of things you might want

10:01

models to do or agents to do. And it's a

10:04

toolkit for building environments that

10:05

is then uh ready to be automatically

10:07

trained with reinforcement learning. And

10:10

the way we thought about this design,

10:12

it's been a lot of fun and also a big

10:14

challenge to think like okay, how do you

10:16

make a toolkit for this stuff that

10:18

actually covers all the bases? And I

10:19

think there's a lot of different

10:21

approaches I've seen people go about and

10:23

I I think they all make sense depending

10:24

on what sorts of things you're wanting

10:26

to work on. But we took a very kind of a

10:29

general approach where we tried to say

10:31

we are not going to know all the answers

10:33

right away. There are going to be lots

10:34

of pattern. There's going to be lots of

10:36

special cases. There's going to be

10:38

hierarchies of complexity. There's going

10:39

to be patterns. And we really wanted to

10:41

prioritize extensibility. So we think

10:43

about these things hierarchically where

10:45

let's say you want to do a a coding

10:46

agent environment for client bench. uh

10:49

this which is an instance of the harbor

10:51

framework which is a example of a CLI

10:54

agent which is a multi-turn environment

10:56

which is an environment uh similar for

10:58

text arrina and Wordle or for search

11:00

with MCP or for giving a model a Python

11:02

ripple in a sandbox and so thinking of

11:04

these things hierarchically allows us to

11:06

kind of really determine like what are

11:08

the foundational pieces what is generic

11:09

across all environments and then how do

11:11

you build up the stack towards

11:13

applications

11:14

and so for one like example of this that

11:16

I'll kind of walk through the whole

11:17

process end to And we we call this one

11:19

wiki search, but it's basically a simple

11:21

search setting where we give an agent

11:23

the ability to uh call some tools to

11:26

search over Wikipedia pages and find

11:28

some answers. And so here is the

11:29

environments hub page. So the

11:30

environments hub is a kind of full stack

11:32

uh code management package registry. So

11:36

every environment is a Python project

11:37

where you can have dependencies and

11:39

versions and uploading your evals and

11:41

whatnot. Um but the environments are

11:43

very simple. They start simple. They can

11:45

get really complicated, but this one's

11:46

pretty simple where we just kind of

11:48

define our tools as async Python

11:49

functions. We have our data set and we

11:52

have what we call a rubric. And so a

11:53

rubric is the abstraction for managing

11:55

the different pieces of your rewards

11:57

where you can kind of compose different

11:58

things. You can also have metrics that

12:00

are just a zero award but are for in uh

12:03

observability of what's going on. And

12:05

then the other piece of doing training

12:07

will be a config. And so the config here

12:08

is for our prime RL trainer, which is

12:10

our kind of large scale training stack,

12:12

which has been our uh culmination of all

12:15

the best practices from the research

12:16

literature for large scale asynchronous

12:18

RL training. Um, but the config files

12:20

are intended to expose kind of the

12:21

pieces that people need to think about

12:23

in ways that are starting to get you

12:25

more into the algorithm, but are also

12:27

still designed to be pretty high level,

12:30

pretty self-contained, and with with

12:31

defaults that we think are going to be

12:32

sensible for a lot of people. And so

12:35

running this is just kind of running a

12:37

command line with uh you specify the

12:39

environment and if it's in the

12:40

environment hub it'll automatically

12:42

install it and start your training run

12:43

and then you can if you're lucky see

12:45

your reward curve just shoot right up.

12:47

Um and sometimes it doesn't go this

12:50

nicely but the process of doing this is

12:52

iterating on your environment on your

12:54

rewards and your data and your tasks to

12:57

understand what makes this task

12:59

holistically actually tangible in

13:01

practice. How do you tune the

13:02

parameters? How do you look at your

13:03

data? How do you define your rewards?

13:05

Uh, and if you do this right, you can

13:06

get really good improvements, especially

13:08

from really small models, but also for

13:09

much larger models. And so in this

13:11

example for the the wiki search one, we

13:13

started with a a Quen 3 4B model, which

13:15

was about 55%. And after training, it

13:17

was at 89% on par with uh much larger

13:20

models like GPT4.1 as well as reasoning

13:22

models like uh GBD5 mini. And so I think

13:25

this practice of taking small models and

13:27

being able to make them much better is a

13:30

big win for a lot of applications where

13:31

you either you want a really fast model,

13:33

you want a really cheap model, you want

13:34

a really really powerful model because

13:35

the best models out there just aren't

13:37

quite good enough. These are all the

13:38

different things you can do with model

13:40

customization. And this practice of

13:42

doing of creating environments isn't

13:44

only for customization, but it gives you

13:46

this option. And so if you need to do

13:48

eval anyways, it's useful to think of

13:50

them as environments because the

13:51

environment opens a lot of doors for

13:55

whether this is prompt tuning or whether

13:56

it's model selection or whether it's

13:59

just getting a better sense of how your

14:01

system could work at scale with many

14:03

many users in parallel. It's a design

14:05

process that really forces you to kind

14:06

of pin down what is the thing I care

14:08

about? What is my agent? What is my

14:10

product? What is my harness? What am I

14:12

optimizing for? Um, and so to kind of

14:15

fully stress test this, we've been

14:16

training a large model which will be out

14:18

into the world quite soon called

14:19

Intellect 3 with our full primal L

14:21

stack. And this has been us really kind

14:23

of validating the efficiency and

14:25

performance at a very large scale. So

14:27

this is a 100B plus model trained on 500

14:29

GPUs where we've kind of done the

14:31

endtoend uh post train of SFT and RL

14:34

which the primaril stack also has SFT if

14:36

people want to do that. But it's also

14:38

been about just understanding all the

14:39

best practices. We love reading papers

14:41

and we try to kind of try out all the

14:43

tricks and see which ones work and see

14:45

which ones don't and then distill this

14:47

into a library with Primaril that can

14:49

then be kind of consumed by the end user

14:51

without needing to do all this uh

14:53

implementation themselves. And so for us

14:56

it being open is very important. So

14:57

Primaril is on GitHub. You can go find

14:59

it. Verifiers is on GitHub if you want

15:01

to check it out. And for us, this is

15:03

really about opening the door for more

15:05

people to start learning about these

15:07

things and for incorporating it into

15:08

their workflows for optimizing their

15:10

models and their products. Um, and

15:13

[clears throat] the only way to do this

15:14

that we've what we see as the best way

15:16

to do this is through growing community.

15:17

And so for us, it's been really

15:18

important to really think about getting

15:21

good feedback loops from the people who

15:23

are building with this and understanding

15:25

what they want, understanding what's

15:26

going well, understanding what's

15:28

painful, and addressing those problems.

15:30

And so we've done a number of community

15:31

programs in terms of sponsoring

15:33

different kind of small tasks to uh a

15:35

research residency program with uh grad

15:37

students around the world uh and

15:39

collecting like uh a smaller subset of

15:41

the environment hub ones where we'll

15:42

actually review them manually. And so

15:44

this repo here, the prime environments

15:45

repo is the ones where we are doing

15:47

these directly where we're kind of

15:49

offering to look over someone's kind of

15:51

example. And so we've had hundreds of

15:54

these come in and there will be hundreds

15:55

more. And uh it's been a great learning

15:57

process because it's forced us to fix a

15:59

lot of things. We kind of understand the

16:00

rough edges. We understand what we need

16:02

to add. And we're kind of then

16:05

distilling [clears throat] all of these

16:06

learnings into what will be our kind of

16:08

upcoming uh platform product which we're

16:10

calling lab. And the idea of lab is to

16:13

give people an interface, a platform

16:15

where they can browse environments, they

16:17

can run their evals, they can do their

16:18

inference, they can do their fine-tuning

16:20

and they can have research be more

16:23

accessible in a way that it hasn't been

16:25

historically because I think a lot of

16:27

people find infrastructure very painful.

16:29

They find dealing with torch versions

16:32

painful, flash attention and VLM and

16:34

getting all these things to work. We are

16:36

happy to do that, but we understand that

16:38

a lot of people may not want to. Um, and

16:40

so the idea with this is that if you

16:42

want to go read the code, you can go

16:44

read the code, but you don't have to run

16:46

it. We can run it for you. Um, and so

16:48

this has been our version, which will be

16:50

kind of out into the world in the near

16:52

future of trying to allow people to

16:55

really focus on the environment where

16:56

the entry point to lab will be the

16:59

environment. If you want to do synthetic

17:00

data and SFT build, let's build an

17:02

environment. If you want to do your

17:03

evals, you build that as an environment.

17:04

If you want to do RL, you build an

17:05

environment. And I think building an

17:08

environment is the kind of thing that

17:11

I imagine a lot more people are going to

17:13

want to be doing as we start really

17:17

seeing where models are headed. In some

17:19

cases, this will be we're going to use

17:20

fine-tuning services from the labs

17:21

because they're going to offer this

17:23

because people want it. In some cases,

17:24

this will be we really care about the

17:26

smallest model we can run on prem at the

17:28

lowest latency and we're really just

17:30

going to optimize for our one thing. or

17:32

it could just be research for the sake

17:34

of research and advancing our kind of

17:36

collective understanding of how this

17:37

stuff all works. And I think that's

17:38

really our goal is to have a world where

17:41

there's going to be a lot of AI and

17:43

where we can all kind of talk about it

17:45

and understand it and look at it and

17:46

poke at it and tweak it and have a

17:50

better sense of what we're actually

17:51

building because I think there's a lot

17:52

of times when it feels like we're just

17:53

kind of the model is a black box and

17:56

digging into the research and going

17:58

under the hood and changing things and

18:00

breaking things tells you a lot about

18:02

how these models work. It tells you a

18:03

lot about understanding where they came

18:05

from, where they could be going, where

18:06

they might be headed, and preparing for

18:09

that future. Thanks.

18:11

[applause]

18:13

[music]

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video discusses the importance of scaling AI research not just through compute and data, but by fostering a community and increasing accessibility through an 'open super intelligence stack.' The speaker introduces 'environments' as a central, accessible abstraction for AI research—comparable to web apps for software development—that allow engineers to define tasks, metrics, and rewards for training and evaluation. Through the Prime Intellect platform, they aim to lower the barrier to entry for model customization, fine-tuning, and RL, enabling developers to build specialized models efficiently without needing massive infrastructure or a PhD.