HomeVideos

How we improved Claude Code and Cline with Prompt Learning – Aparna Dhinakaran, Arize

Now Playing

How we improved Claude Code and Cline with Prompt Learning – Aparna Dhinakaran, Arize

Transcript

267 segments

0:13

[music]

0:20

Hi everyone. Thanks so much for coming.

0:23

Um, well, today I'm excited. We're going

0:24

to talk a little bit about prompt

0:26

learning and how to use that with eval.

0:29

uh if any of you guys um are spending a

0:33

lot of time thinking about the frontier

0:34

coding models, I think there's so much

0:36

attention on on them. But just what's

0:40

not so obvious is how much time is

0:42

actually spent uh on the system prompts

0:45

uh for those building these coding

0:47

agents. So here's actually a look um

0:49

this is a tweet that went viral about

0:51

the whole system prompt uh of Claude

0:54

that's been leaked. I'm sure you know

0:55

they've changed it since then. Um, but

0:58

you can actually see that Claude,

0:59

there's cursor, there's Clyde. Um, and

1:01

just the length of the actual system

1:03

prompt um, for each one of these. And I

1:06

think what's not as obvious is these

1:08

actually aren't just static. They are

1:11

repeatedly iterated on. And it's such an

1:13

important piece of context that actually

1:15

goes into making these coding agents the

1:17

most successful agents out there.

1:20

Um, it's not just us talking about it.

1:22

Karpathi talks about it a lot. Um, and

1:25

this was a viral tweet that that he

1:27

posted, which was there's this paradigm

1:29

around iterating on these prompts that

1:32

he he's kind of coined it system prompt

1:34

learning. And what he said is that it

1:37

almost feels like humans learning

1:40

because they take back English feedback

1:43

uh and use that to actually iterate on

1:45

what they should do differently the next

1:47

time. And I think he wrote something

1:49

like it's almost like that movie momento

1:51

where the guy forgets uh what you know

1:54

what he learns and then he starts

1:56

writing it down and then uses that to

1:58

actually kind of go through his next

1:59

day. And so this is a little bit of the

2:01

concept behind system prompt learning.

2:04

And what we wanted to do was show you

2:06

guys a little bit of how that works and

2:08

then put that to test on two of the most

2:10

popular coding agents uh Claude and

2:13

Klein today. So first off, how does

2:15

prompt learning actually work? So for

2:17

those of you who are familiar with RL,

2:18

what I thought we'd do is just do a

2:20

little analogy compare how does RL work

2:22

versus system prompt learning. For RL,

2:24

you know, if we just took an analogy of

2:26

a student who's trying to improve their

2:28

exam scores. They take an exam, you

2:31

know, somebody grades the exam, you have

2:33

a scalar reward, which is like, you

2:34

know, they got a 70%, an 80%, 90%, and

2:37

then they have to figure out almost

2:40

blindly just with that score how to

2:42

actually improve their score on the next

2:45

exam. And I think this is actually one

2:47

of the flaws of I mean RL works, don't

2:50

get me wrong, amazing in so many

2:52

concepts and domains, but it can be, you

2:55

know, a long path to actually figure out

2:57

what the right solution is. And I think

3:01

some of the things that we've noticed is

3:02

that it can be sample inefficient. It

3:04

takes a lot of data to get what you

3:05

want. It's time inensive. It's data

3:08

hungry. You need to have a whole data

3:10

science team to do this. and it just

3:12

might be overkill for teams who are

3:13

trying to build agents because LLMs are

3:15

already so good. So if you're a team

3:18

who's actually trying to build an agent,

3:20

maybe prompt learning is actually

3:22

slightly

3:23

might be slightly more of an interesting

3:25

paradigm for you. So in this scenario,

3:27

same same analogy. You have a student

3:29

who's taking an exam, there's some exam

3:31

score, except in this case, what

3:33

actually gets outputed isn't just the

3:35

score. They got a 70, they got an 80,

3:37

but you also get back some kind of

3:39

English feedback. Why did they get this

3:41

answer right? What did they mess up on?

3:43

Here's concepts that they missed on,

3:45

what do they need to go study? And then

3:47

they use this information to actually go

3:49

and and prepare on what to do next um to

3:52

to get a better score. This is basically

3:56

the the concept that we applied to

3:59

coding agents. And we ran this kind of

4:01

test on both Claude as well as Klein.

4:05

Um, both of these, as you know, start

4:07

off with some kind of uh system prompt,

4:10

which in cloud code, this is kind of a

4:12

snippet of it. And they both kind of

4:14

come with something that you can append

4:15

rules to. So, client has rules, cloud MD

4:17

has the cloud MD file, and it starts off

4:19

empty. You can go in and add whatever is

4:22

important for your repo. So, what we did

4:25

was actually took, you know, just

4:27

benchmark both client and cloud code on

4:30

Swebench. I'm going to kind of run

4:31

through theam uh this entire example at

4:33

Sweetbench, but this entire thing we

4:35

also ran on BBH and a ton of other uh

4:38

software engineering data sets, but you

4:40

can see here just on vanilla client

4:42

vanilla cloud code um nothing added to

4:45

the cloud MD or the client rules. Um

4:47

they had you know about I think with

4:50

client somewhere on you know cloud

4:52

sonnet 45 it was about 30% of the github

4:54

issues actually resolved uh cloud code

4:56

it was about 40% of the github issues

4:59

resolved. So we took this as kind of our

5:01

starting benchmark and the thesis is is

5:04

could we actually use prompt learning to

5:05

see if we can improve the system prompt

5:07

and see if um it was able to with the

5:11

new system prompt actually you know give

5:14

us a better uh score on these

5:15

benchmarks. We didn't do anything on

5:17

fine-tuning. We didn't change the models

5:19

anything like that. It was just focused

5:21

on the system prompt. Um this is the

5:24

process that we went through. We took

5:25

the coding agent. Uh we had it actually

5:27

write some code. Um we ran unit tests

5:31

and then um we then passed that through

5:34

to some kind of um model that was doing

5:37

the LLM as a judge evals. And I'll show

5:39

you guys what that looks like. But the

5:41

LLM as a judge eval actually gave back

5:43

why did it fail? Did it fail because of

5:46

this? Uh can you give some examples of

5:48

you know what were common scenarios that

5:50

it didn't do good on? and then it

5:52

actually use those kind of evals to then

5:54

go back and add it to a meta prompt to

5:56

come back with kind of the the system

5:59

prompt rules that we're going to append

6:00

to. So let's talk through kind of the

6:03

process. So first we had kind of the

6:04

SWEBench data set. Uh SWEBench in this

6:07

scenario is just 150 examples. Uh we did

6:10

this for both client and cloud code

6:11

where we took the original prompt which

6:14

had no rules. We gave it kind of the

6:18

software engineering problem and then it

6:20

generated some kind of patch to actually

6:22

solve that and then we ran the generated

6:24

solution through the unit test.

6:27

Then whatever the unit test came back

6:29

with whether it was right or wrong, we

6:32

then passed this into an LLM as a judge

6:34

eval. And this is kind of the most

6:36

important part because this actually

6:38

generated the explanation for us. So we

6:40

passed in the problem statement. We

6:42

passed in what the coding agent solution

6:44

was, the unit tests, and then the actual

6:46

solution that it came up with. Uh, pass

6:49

that in. And this that you're looking at

6:51

in the center here is actually the LLM

6:53

as a judge eval. And these evalu

6:59

engineering is a whole kind of concept

7:02

that, you know, we spend a lot of time

7:04

on. And writing really good evals is I

7:06

think um how you get the best kind of

7:09

insight into what you could do to

7:11

improve your agents. So in this

7:12

scenario, what we did was we wrote a

7:14

good LM as a judge eval prompt. It

7:16

outputed whether it failed or passed.

7:18

And then this is the key part. We

7:20

actually asked for an explanation. Why

7:22

did it actually mess up? um you know for

7:25

specific libraries in the Sweetbench

7:27

light test um you know it was parsing

7:30

errors or it was not handling um

7:34

there there's all sorts of actually

7:35

different categories of errors but we

7:36

went through and we we kind of looked at

7:38

the explanation of what went wrong in

7:40

each scenario. We then passed into a

7:43

huge meta prompt. So this is actually

7:45

what's helping us iterate on our system

7:47

prompt. We passed in the original claude

7:50

or client system prompt. We passed in

7:52

the original rules which for us started

7:54

off empty. Um and then we passed in here

7:56

was the input, here was the LM is a

7:59

judge eval, and then here was the actual

8:01

explanation from that eval.

8:04

Passed that all into the meta prompt and

8:06

then we did kind of a diff comparing you

8:08

know the old world. So just for you just

8:10

to remember the old world had the

8:12

original clawed system prompt no rules

8:15

kind of added or appended to it. And

8:17

then the new world where it generated

8:19

this entire rules of what to avoid or

8:23

what to um what it had learned

8:26

essentially from all those mistakes it

8:28

had actually made. And then we ran this

8:31

basically on the entire Sweetbench light

8:33

again. Um and what we saw was that you

8:36

know on 150 examples we were able to get

8:39

cloud code up by 5% more GitHub issues

8:42

resolved client um you know 15% and this

8:47

was literally on I think the key thing

8:49

is like 150 examples of just training

8:53

data that was used um on the most kind

8:57

of powerful coding agents that are out

8:59

there. Um, and so just think about kind

9:02

of the impact that could have for your

9:04

agents. Many of you guys in this room

9:05

might be thinking, okay, well, prompt

9:07

learning is cool, but how does that

9:08

compare to GEA? If you're familiar with

9:10

DSPI and you've kind of seen, I don't

9:12

know if it's GEA or Jeepa. I've heard

9:14

both. Um, but you know, you guys might

9:17

be asking, well, how is this different?

9:18

Um, so GEA, just just in case you guys

9:22

aren't familiar, it's a prompt optimizer

9:24

from DSPI that is essentially very very

9:28

similar to what we're talking about,

9:29

which is taking English feedback using

9:31

that English feedback inside of the

9:33

actual prompt. Um, and what we did was

9:37

actually run a sidebyside benchmark

9:39

against GEA where we compared kind of

9:42

our prompt learning against GEA. And um

9:45

I think what we saw was that GEA

9:48

required many many loops and rollouts

9:50

compared to um kind of a a fraction of

9:53

that which was our approach. And I think

9:57

the key difference here, I mean the

9:58

underlying approach around using English

10:00

feedback is the same, but I think the

10:02

key thing that was really different here

10:03

was we spent a lot of time actually

10:06

developing and iterating on the evals

10:08

and the eval prompts really mattered to

10:10

making sure that you gave really good

10:12

explanations back to the agent. Um, and

10:15

so eval.

10:18

This was super critical for us to be

10:19

able to get this to work. Um, and if you

10:23

guys are curious about learning more,

10:25

reading more about kind of what we do,

10:27

um, check out kind of our blog. We write

10:29

a lot about eval prompt optimization

10:32

and, uh, we're actively hiring, [music]

10:34

so come check us out. Awesome.

Interactive Summary

The presentation discusses prompt learning and its application with evaluation (evals) for coding models. It highlights the critical, iterative nature of system prompts for coding agents and introduces "system prompt learning" as a paradigm where models improve by receiving English feedback on their failures, contrasting this with the scalar rewards of Reinforcement Learning (RL). The method involves coding agents generating solutions, running unit tests, and then using an LLM as a judge to provide detailed English explanations for errors, which are then used to refine the system prompt rules. This approach significantly improved the performance of Claude Code and Client on the Swebench dataset, demonstrating higher efficiency compared to other prompt optimization methods like GEA, largely due to the emphasis on well-crafted evaluation prompts.

Suggested questions

7 ready-made prompts