Building Cursor Composer – Lee Robinson, Cursor

Watch on YouTube

Now Playing

Transcript

448 segments

0:00

[Music]

0:13

[Music]

0:20

It's great to be back in New York and

0:22

I'm very excited to be here and talk on

0:24

behalf of all of our engineering and

0:26

research teams at Curser about building

0:28

Cursor Composer, our first agent model

0:30

and my colleague Sasha actually gave a

0:32

version of this talk recently. So I'm

0:34

excited to give my own uh my own take on

0:36

it. So cursor composer is a model

0:39

designed for real world real world

0:41

software engineering and it tries to be

0:43

both fast and smart. So as we've

0:46

measured it against our own benchmarks.

0:48

It's better than the best open source

0:49

models. It's like up against recent

0:51

Frontier models but kind of slightly

0:53

below the latest frontier with Sonnet 45

0:56

GPT 5.1 codecs. But where it really

0:58

shines is it's about four times more

1:01

efficient at token generation than

1:03

models at a similar level of

1:04

intelligence. So we're trying to mesh

1:06

speed as well as intelligence. So why

1:10

did we build this model? I mean

1:12

obviously cursor has an IDE. Why are we

1:14

getting into the model space? Why do we

1:15

care about this? Well, our research and

1:17

product teams have been building a model

1:19

called tab which you can use for

1:20

autocomplete. Maybe some of you use that

1:22

inside of cursor. and we wanted to take

1:24

that same approach for a very low

1:26

latency model and apply it to coding

1:28

with agents. But honestly, we weren't

1:31

really sure if it would work. So, we

1:33

started prototyping some early versions

1:34

of what this model could look like,

1:36

started to put it out and get some

1:37

feedback from users. And we were pretty

1:40

surprised that this cheetah slug we

1:41

released for this model, people actually

1:43

really liked it. Uh they really like the

1:45

speed, but the feedback we got was it's

1:48

not really smart enough yet to be a

1:50

daily driver for a lot of their coding.

1:52

So we needed it to be smart and fast.

1:55

Definitely needed to be smart. So we

1:56

really worked on making this internal

1:58

benchmark that represented our usage on

2:00

our own repos and how we actually built

2:02

software. Like if we had a model that

2:05

was both fast and smart and a checkpoint

2:07

that our developers would use every

2:08

single day to build the product and to

2:10

build all of our software, then we knew

2:12

that we would be on to something. And

2:14

for example, one big change here that

2:16

helped actually push this towards a

2:17

level where we had a checkpoint where

2:19

people would use it was being able to

2:20

call tools in parallel and being able to

2:22

very effectively use our semantic search

2:24

tool. And we'll talk about that a little

2:26

bit more here later. So if you haven't

2:29

seen it, uh here's cursor in cursor 2.0

2:31

in our new view and we're going to use

2:34

the composer one model and you'll notice

2:36

that it is doing a lot of things very

2:38

quickly. It's calling a bunch of tools

2:40

in parallel like gp so reading a lot of

2:42

files. It's making shell commands. Uh

2:44

it's making file edits. It's writing and

2:47

managing uh a list of to-dos. And you

2:50

can kind of very quickly work through

2:52

tasks in the foreground here. Uh in this

2:54

case, I'm investigating an issue in an

2:56

open source repo. And I don't know about

2:58

y'all, but this has been a quite

3:00

different programming experience for me.

3:02

Uh having working with coding agents for

3:04

a little bit of time now versus kind of

3:06

firing off an agent and waiting, let's

3:08

call it 20 minutes for it to complete

3:09

where you can kind of context switch

3:11

away. This really does help keep you in

3:12

the flow and is a kind of a different

3:14

style of programming I think. So I want

3:16

to talk about how we did this in a way

3:18

that's hopefully accessible for you all.

3:20

I'm not a machine learning researcher

3:21

but I do really enjoy this stuff. Uh

3:24

what we learned some of the

3:25

infrastructure challenges and then a

3:26

little bit on where we're going uh

3:28

moving forward. So in cursor a user kind

3:31

of submits a query to our backend. The

3:33

agent reads that query and then decides

3:35

to make a series of tool calls. And our

3:37

agent has about 10 tools give or take,

3:40

but we're going to focus on five here.

3:41

So reading files, editing files,

3:43

searching your codebase, looking at

3:45

lints, and then also running terminal or

3:47

shell commands. And the agent then is

3:49

able to autonomously decide, do we call

3:51

these serially or do we run these in

3:53

parallel? And our goal with

3:56

reinforcement learning here is to try to

3:57

mirror the cursor production environment

4:00

as close as we possibly can. So this

4:02

data that we have in training, we want

4:03

to kind of pretend like we're actually

4:05

calling real cursor queries. Uh so to do

4:08

that, we are running a series of

4:09

rollouts. Um for example, in this roll

4:11

out, we're calling a series of tools

4:13

like reading files and editing files.

4:16

And when we run more rollouts, we can

4:18

start from that same initial starting

4:20

point, but we might call a completely

4:21

different set of tools. So in this one,

4:23

we're also doing codebase search. So we

4:26

score the output, we decide which one is

4:28

better and then we update the parameters

4:31

of our model based on that change. So

4:33

conceptually a pretty simple idea. The

4:36

challenges come from when you take the

4:37

simple idea and then you try to scale it

4:39

up to a very large amount. So there's

4:40

kind of three challenges. The first one

4:42

is trying to match the training and

4:44

inference environment. So when the model

4:46

is actually being used in the product.

4:48

Um, in this case with composer, we're

4:50

training a large mixture of experts

4:51

model and it's being parallelized across

4:54

thousands of GPUs and if we don't speed

4:56

that up, it's going to take forever to

4:58

train the thing. So, we want to make it

4:59

really fast and match the training and

5:02

kind of sampling version to be as close

5:04

as possible. The second challenge is

5:06

that the rollouts can get pretty complex

5:08

when you start to look at real world

5:09

data here. So, models are going to use

5:11

hundreds of thousands to millions of

5:13

tokens. They're going to make hundreds

5:15

of different tool calls. And each of

5:16

these rollouts could take a, you know, a

5:18

pretty different amount of time. One

5:20

might make a lot of tool calls, one

5:22

might make not as many, and they'll

5:23

complete a different time. So, we have

5:24

to figure out how to deal with that

5:26

challenge. And finally, there's this

5:28

challenge of consistency. If we want to

5:30

mimic the production cursor environment

5:32

as close as possible, we need to use

5:34

exactly the same tool format and the

5:36

tool response. But in training, we have

5:39

this really bursty amount of compute.

5:41

Basically, we're like doing all of this

5:42

training all at once, which is different

5:44

than at production. So, it is really an

5:47

infrastructure challenge. We have these

5:50

three machine learning challenges and

5:52

all of the solutions coincidentally are

5:54

actually infrastructure problems. So,

5:56

let's talk through a few of these

5:57

problems and how we solved it at the

5:59

infrastructure layer. So, our

6:01

architecture is probably familiar for

6:03

some of you who have been involved in

6:05

this space a little bit, but I still

6:06

think it's really interesting to talk

6:07

about at kind of a high level. Uh, we

6:09

have three different servers. We have an

6:11

inference server. We have kind of the

6:12

standard ML stack with PyTorch. We have

6:15

an inference server. So the rollouts

6:17

that I just talked about, that's where

6:18

we use Ray. And then we have environment

6:21

servers. And these are the ones where

6:22

we're kind of simulating that cursor

6:23

environment that I talked about. And all

6:26

these servers talk to each other. So for

6:28

example, the inference server can

6:30

basically send these advantages back to

6:32

the trainer, which is like nudging it up

6:34

or down uh based on the roll out and

6:36

then updating the model and getting new

6:39

parameters.

6:40

So this this one is a bit more on the ML

6:42

side, but we're we're trying to train a

6:44

model that's very very large and to do

6:46

it as fast as possible. And one way that

6:48

our team was able to do this on the

6:50

research side was to develop a library

6:52

of custom kernels that allowed for very

6:54

low precision training. And basically

6:56

this allows us to just speed up the

6:58

training process in a big way and also

6:59

make it much easier to ship to our

7:02

inference server. So, if you're the type

7:03

of person who loves this, we wrote a

7:05

blog post going way in depth on all of

7:07

this that talks about our custom

7:08

kernels. Uh, if you're interested, the

7:10

TLDDR here is we found for the mixture

7:12

of experts layer was about three and a

7:14

half times faster uh a speed up on

7:17

Nvidia Blackwell chips. So, it made a

7:19

pretty significant uh impact on our

7:21

training runs. So, once we update the

7:24

weights, we need to send them back over

7:26

to the inference server uh during this

7:28

training process. and the inference

7:29

server is the one that's doing all the

7:31

rollouts that I talked about calling the

7:32

tools and kind of managing um what we

7:34

sent. The challenge here uh is that they

7:37

all complete at different times. So kind

7:39

of a naive version of this there will be

7:41

a lot of wasted time. So what we were

7:44

able to do is do load balancing across

7:46

the different threads and processes to

7:48

basically shift the work around and and

7:50

not have a bunch of idle time. So if one

7:52

roll out for example makes a ton of tool

7:54

calls, maybe it installs some packages,

7:56

installs some library, we're not just

7:58

sitting there waiting for all of the

7:59

other ones to finish. The inference

8:02

server is spending all this time going

8:03

back and forth making the tool calls to

8:05

the environment uh and getting the tool

8:07

results back. So again, communicating

8:09

between these servers and we want that

8:11

environment to be as close as possible

8:13

to the cursor product. One thing that's

8:15

nice about having both the coding agent,

8:18

the IDE, as well as what we're doing

8:20

with the model research and training our

8:21

own models is we can kind of co-design

8:23

these things together. So, as we were

8:25

building out a lot of our RL work for

8:27

this model, we were also building our

8:29

cloud agents product. Um, this is how

8:31

you can run a cursor agent kind of

8:33

offline. You can run it from your phone

8:35

or on the web or kick it off from Slack,

8:37

for example. And to do this, we spin up

8:39

virtual machines in the cloud. So each

8:41

one of these VMs loads up the user's

8:43

code. It allows the agent to kind of

8:45

like make file changes, run tools, and

8:48

edit code in a secure sandbox. And

8:50

coincidentally, this is the perfect

8:52

Impra for RL and our use in training. So

8:55

we have this like fleet of cloud VMs and

8:58

we have an environment that very closely

9:01

matches the production cursor

9:02

environment and we can then use that for

9:04

training. This does still have some

9:06

challenges though. I kind of talked

9:07

about how the training workload is very

9:09

spiky and it's different than the kind

9:11

of standard inference when you're

9:13

running the cloud agents product. So we

9:15

needed to build infrastructure to

9:16

support all of these VMs and

9:19

orchestrating between them. So you know

9:21

we have many different clusters,

9:23

hundreds of thousands of VMs here and

9:25

you can see behind me one of the

9:26

internal dashboards we built uh with

9:28

composer actually to visualize uh all of

9:30

the different VMs in the fleet.

9:33

So why spend all this time trying to

9:36

match the environment to be as close as

9:38

possible to cursor production. I've kind

9:40

of mentioned that a few times. We could

9:42

mock it, we could simulate it out. Um,

9:44

but one of the really nice benefits is

9:46

we get to give the model uh specific

9:48

tools that we think are very valuable

9:50

inside of the agent. So, one of those is

9:53

that we've trained our own embedding

9:54

model that allows you to do semantic

9:56

search. So when you use cursor, we go

9:58

and index your codebase and then it

10:00

allows the agent to make natural langu

10:03

natural language queries to find files

10:05

that it might want to edit. And we did

10:08

some research on this recently. We found

10:10

that semantic search not only helped

10:12

basically every single model inside of

10:13

the cursor agent harness, but it was

10:15

particularly helpful with composer,

10:18

which kind of makes sense when you think

10:19

about it. Like we trained composer in

10:21

the exact same environment that we're

10:23

using at inference time. And so the

10:25

model kind of becomes a power user of

10:27

this tool which is really effective.

10:30

So let's talk about uh how the release

10:32

has been going and kind of where we're

10:34

going next. Um as we were doing the

10:37

training process we kind of knew that RL

10:40

was working when we were able to

10:42

continuously improve the model and start

10:44

to see more and more improvements after

10:46

more and more rollouts. So we started

10:48

about kind of the same performance as

10:50

the best open model and then as we

10:52

trained and kind of threw more compute

10:54

at it the performance continued to

10:56

increase and to a point today where

10:57

we're close to the frontier in terms of

10:59

kind of the best coding agents that are

11:01

available and personally I think this is

11:03

a great sign just for being able to take

11:05

and scale RL and apply it to these very

11:08

hard specialized tasks like in our

11:10

example coding but it could be applied

11:12

to other domains as well. uh RL also

11:15

allowed us to kind of change properties

11:18

of the model in a way that was very

11:20

useful for the cursor product. We wanted

11:22

the model to be both kind of fast at

11:24

generating tokens but also the end toend

11:26

experience of getting a result that's

11:28

helpful. So for example, instead of

11:30

reading a file one by one, you can read

11:32

10 files in parallel with tool calling.

11:35

And as you saw in the demo earlier, it

11:37

makes composer feel much faster when you

11:39

have that. And we think this is kind of

11:40

just the start. there's a lot more we

11:42

can do in this area to speed up the

11:43

model. Uh, and the second one is the

11:45

model learned how to behave better as an

11:48

agent. So, in the beginning, the model

11:50

was was kind of making too many edits.

11:52

Sometimes the edits were made

11:54

unnecessarily, but as we trained more

11:56

and more, the model actually got

11:58

surprisingly better at learning to

11:59

search and read files more. So, it would

12:02

go and find the right thing before it

12:03

tried to make edits. Overall, just

12:05

being, you know, a bit more effective.

12:08

So, we released composer last month in

12:10

comp uh cursor 2.0 and so far seems like

12:13

people seem to like it. Has anyone here

12:15

tried the model by chance? Okay, that's

12:18

pretty great. That's more than I

12:18

expected. So, that's great to hear. I

12:20

think from my perspective using this

12:22

model and using coding agents for some

12:24

time. I kind of describe this problem as

12:26

like airplane Wi-Fi. So, when you're on

12:29

airplane Wi-Fi, uh it works, but it's

12:32

kind of frustrating. you really want to

12:34

do whatever you're trying to do, but

12:35

it's just it's a little slow almost to

12:36

where sometimes you wish that you just

12:38

didn't have Wi-Fi at all. And I think

12:40

for some of us who adopted coding agents

12:42

very early, it kind of feels like

12:44

airplane Wi-Fi sometimes cuz if it's

12:46

taking 10 or 20 minutes, you're in this

12:48

weird I think Swiss called it semi async

12:51

valley of death where you either want

12:52

something that's really fast or you want

12:54

the most powerful most intelligent model

12:56

that can run for you know a

12:58

significantly long amount of time maybe

13:00

in the background maybe you know 30

13:01

minutes days and I think when you're

13:04

stuck in the middle that's that's very

13:05

very painful. So for me composer and I

13:08

think other people it's brought a lot of

13:10

joy back to coding with agents that felt

13:12

more like when you were writing code by

13:14

hand where you're very in the loop very

13:16

synchronous. So I'm excited to see more

13:18

people exploring this space as well. For

13:20

me daily uh I'm writing a lot of plans

13:23

with kind of the latest uh model like

13:25

the the highest frontier. So GPT 5.1

13:27

codec is is really great for plans. uh

13:29

and then I'm using composer to actually

13:31

take that plan kind of like what Dex

13:32

talked about like take the context

13:34

engineering work and then actually go

13:36

and build the thing with it. So uh a few

13:39

reflections from our research and

13:41

products team on building composer. The

13:44

first is that RL can work surprisingly

13:46

well for training very specific models

13:50

and you know giving it this high quality

13:52

data and a decent amount of compute. You

13:55

know at cursor we're not trying to build

13:57

general intelligence. We're not trying

13:58

to build AGI. We're trying to build very

14:01

good coding models and RL RL has worked

14:04

surprisingly well for that. The second

14:06

one is uh how much tools AI tools like

14:09

cursor it doesn't have to be cursor but

14:11

like cursor really helps speed up

14:13

research and development. You know of

14:15

course our entire team uses cursor to

14:17

help them write code and debug code more

14:20

efficiently but that speed up that

14:22

increase really compounds across all of

14:24

our engineering efforts. So we're able

14:26

to try more ideas, ship product faster,

14:29

try new research. Um, so it's been

14:31

really really helpful there. And the

14:32

last one that's, you know, personally

14:34

pretty interesting for me is that

14:36

it was interesting to see how much of

14:38

the ML work and the training process was

14:40

actually also an infrastructure problem.

14:42

They were very correlated. And going

14:45

back to my time at Verscell, we saw a

14:47

very similar thing where a lot of the

14:49

magic moments that you can have in

14:51

working in frameworks in the JavaScript

14:52

or Python space, you also need to think

14:54

a little bit about the infrastructure of

14:56

where they're actually deployed. So

14:57

these things are are more related than

14:59

people might think. So those are some of

15:01

our reflections. Uh sounds like some of

15:03

you have tried it out. If this is

15:04

something that you're interested in and

15:05

working on, we're hiring pretty much

15:07

across the board at Cursor right now. We

15:09

just opened up an office in New York if

15:11

you're here based in New York. and we'd

15:12

love to talk to you about building the

15:13

best coding models in the world. Thank

15:15

you.

15:18

[Music]

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This talk explores the development of Cursor Composer, an AI agent model designed by the engineering team at Cursor for efficient real-world software engineering. The speaker discusses the challenges of creating a model that is both fast and smart, highlighting the use of reinforcement learning (RL) and infrastructure optimizations. Key topics include enabling parallel tool execution, utilizing custom kernels for faster training on Nvidia hardware, and the importance of creating a training environment that closely mirrors production. The talk also emphasizes how integrating coding agents into the development workflow significantly improves productivity, effectively solving the 'semi-async' latency issues often found in early AI coding tools.