How we improved Claude Code and Cline with Prompt Learning – Aparna Dhinakaran, Arize
267 segments
[music]
Hi everyone. Thanks so much for coming.
Um, well, today I'm excited. We're going
to talk a little bit about prompt
learning and how to use that with eval.
uh if any of you guys um are spending a
lot of time thinking about the frontier
coding models, I think there's so much
attention on on them. But just what's
not so obvious is how much time is
actually spent uh on the system prompts
uh for those building these coding
agents. So here's actually a look um
this is a tweet that went viral about
the whole system prompt uh of Claude
that's been leaked. I'm sure you know
they've changed it since then. Um, but
you can actually see that Claude,
there's cursor, there's Clyde. Um, and
just the length of the actual system
prompt um, for each one of these. And I
think what's not as obvious is these
actually aren't just static. They are
repeatedly iterated on. And it's such an
important piece of context that actually
goes into making these coding agents the
most successful agents out there.
Um, it's not just us talking about it.
Karpathi talks about it a lot. Um, and
this was a viral tweet that that he
posted, which was there's this paradigm
around iterating on these prompts that
he he's kind of coined it system prompt
learning. And what he said is that it
almost feels like humans learning
because they take back English feedback
uh and use that to actually iterate on
what they should do differently the next
time. And I think he wrote something
like it's almost like that movie momento
where the guy forgets uh what you know
what he learns and then he starts
writing it down and then uses that to
actually kind of go through his next
day. And so this is a little bit of the
concept behind system prompt learning.
And what we wanted to do was show you
guys a little bit of how that works and
then put that to test on two of the most
popular coding agents uh Claude and
Klein today. So first off, how does
prompt learning actually work? So for
those of you who are familiar with RL,
what I thought we'd do is just do a
little analogy compare how does RL work
versus system prompt learning. For RL,
you know, if we just took an analogy of
a student who's trying to improve their
exam scores. They take an exam, you
know, somebody grades the exam, you have
a scalar reward, which is like, you
know, they got a 70%, an 80%, 90%, and
then they have to figure out almost
blindly just with that score how to
actually improve their score on the next
exam. And I think this is actually one
of the flaws of I mean RL works, don't
get me wrong, amazing in so many
concepts and domains, but it can be, you
know, a long path to actually figure out
what the right solution is. And I think
some of the things that we've noticed is
that it can be sample inefficient. It
takes a lot of data to get what you
want. It's time inensive. It's data
hungry. You need to have a whole data
science team to do this. and it just
might be overkill for teams who are
trying to build agents because LLMs are
already so good. So if you're a team
who's actually trying to build an agent,
maybe prompt learning is actually
slightly
might be slightly more of an interesting
paradigm for you. So in this scenario,
same same analogy. You have a student
who's taking an exam, there's some exam
score, except in this case, what
actually gets outputed isn't just the
score. They got a 70, they got an 80,
but you also get back some kind of
English feedback. Why did they get this
answer right? What did they mess up on?
Here's concepts that they missed on,
what do they need to go study? And then
they use this information to actually go
and and prepare on what to do next um to
to get a better score. This is basically
the the concept that we applied to
coding agents. And we ran this kind of
test on both Claude as well as Klein.
Um, both of these, as you know, start
off with some kind of uh system prompt,
which in cloud code, this is kind of a
snippet of it. And they both kind of
come with something that you can append
rules to. So, client has rules, cloud MD
has the cloud MD file, and it starts off
empty. You can go in and add whatever is
important for your repo. So, what we did
was actually took, you know, just
benchmark both client and cloud code on
Swebench. I'm going to kind of run
through theam uh this entire example at
Sweetbench, but this entire thing we
also ran on BBH and a ton of other uh
software engineering data sets, but you
can see here just on vanilla client
vanilla cloud code um nothing added to
the cloud MD or the client rules. Um
they had you know about I think with
client somewhere on you know cloud
sonnet 45 it was about 30% of the github
issues actually resolved uh cloud code
it was about 40% of the github issues
resolved. So we took this as kind of our
starting benchmark and the thesis is is
could we actually use prompt learning to
see if we can improve the system prompt
and see if um it was able to with the
new system prompt actually you know give
us a better uh score on these
benchmarks. We didn't do anything on
fine-tuning. We didn't change the models
anything like that. It was just focused
on the system prompt. Um this is the
process that we went through. We took
the coding agent. Uh we had it actually
write some code. Um we ran unit tests
and then um we then passed that through
to some kind of um model that was doing
the LLM as a judge evals. And I'll show
you guys what that looks like. But the
LLM as a judge eval actually gave back
why did it fail? Did it fail because of
this? Uh can you give some examples of
you know what were common scenarios that
it didn't do good on? and then it
actually use those kind of evals to then
go back and add it to a meta prompt to
come back with kind of the the system
prompt rules that we're going to append
to. So let's talk through kind of the
process. So first we had kind of the
SWEBench data set. Uh SWEBench in this
scenario is just 150 examples. Uh we did
this for both client and cloud code
where we took the original prompt which
had no rules. We gave it kind of the
software engineering problem and then it
generated some kind of patch to actually
solve that and then we ran the generated
solution through the unit test.
Then whatever the unit test came back
with whether it was right or wrong, we
then passed this into an LLM as a judge
eval. And this is kind of the most
important part because this actually
generated the explanation for us. So we
passed in the problem statement. We
passed in what the coding agent solution
was, the unit tests, and then the actual
solution that it came up with. Uh, pass
that in. And this that you're looking at
in the center here is actually the LLM
as a judge eval. And these evalu
engineering is a whole kind of concept
that, you know, we spend a lot of time
on. And writing really good evals is I
think um how you get the best kind of
insight into what you could do to
improve your agents. So in this
scenario, what we did was we wrote a
good LM as a judge eval prompt. It
outputed whether it failed or passed.
And then this is the key part. We
actually asked for an explanation. Why
did it actually mess up? um you know for
specific libraries in the Sweetbench
light test um you know it was parsing
errors or it was not handling um
there there's all sorts of actually
different categories of errors but we
went through and we we kind of looked at
the explanation of what went wrong in
each scenario. We then passed into a
huge meta prompt. So this is actually
what's helping us iterate on our system
prompt. We passed in the original claude
or client system prompt. We passed in
the original rules which for us started
off empty. Um and then we passed in here
was the input, here was the LM is a
judge eval, and then here was the actual
explanation from that eval.
Passed that all into the meta prompt and
then we did kind of a diff comparing you
know the old world. So just for you just
to remember the old world had the
original clawed system prompt no rules
kind of added or appended to it. And
then the new world where it generated
this entire rules of what to avoid or
what to um what it had learned
essentially from all those mistakes it
had actually made. And then we ran this
basically on the entire Sweetbench light
again. Um and what we saw was that you
know on 150 examples we were able to get
cloud code up by 5% more GitHub issues
resolved client um you know 15% and this
was literally on I think the key thing
is like 150 examples of just training
data that was used um on the most kind
of powerful coding agents that are out
there. Um, and so just think about kind
of the impact that could have for your
agents. Many of you guys in this room
might be thinking, okay, well, prompt
learning is cool, but how does that
compare to GEA? If you're familiar with
DSPI and you've kind of seen, I don't
know if it's GEA or Jeepa. I've heard
both. Um, but you know, you guys might
be asking, well, how is this different?
Um, so GEA, just just in case you guys
aren't familiar, it's a prompt optimizer
from DSPI that is essentially very very
similar to what we're talking about,
which is taking English feedback using
that English feedback inside of the
actual prompt. Um, and what we did was
actually run a sidebyside benchmark
against GEA where we compared kind of
our prompt learning against GEA. And um
I think what we saw was that GEA
required many many loops and rollouts
compared to um kind of a a fraction of
that which was our approach. And I think
the key difference here, I mean the
underlying approach around using English
feedback is the same, but I think the
key thing that was really different here
was we spent a lot of time actually
developing and iterating on the evals
and the eval prompts really mattered to
making sure that you gave really good
explanations back to the agent. Um, and
so eval.
This was super critical for us to be
able to get this to work. Um, and if you
guys are curious about learning more,
reading more about kind of what we do,
um, check out kind of our blog. We write
a lot about eval prompt optimization
and, uh, we're actively hiring, [music]
so come check us out. Awesome.
Ask follow-up questions or revisit key timestamps.
The presentation discusses prompt learning and its application with evaluation (evals) for coding models. It highlights the critical, iterative nature of system prompts for coding agents and introduces "system prompt learning" as a paradigm where models improve by receiving English feedback on their failures, contrasting this with the scalar rewards of Reinforcement Learning (RL). The method involves coding agents generating solutions, running unit tests, and then using an LLM as a judge to provide detailed English explanations for errors, which are then used to refine the system prompt rules. This approach significantly improved the performance of Claude Code and Client on the Swebench dataset, demonstrating higher efficiency compared to other prompt optimization methods like GEA, largely due to the emphasis on well-crafted evaluation prompts.
Videos recently processed by our community