How Claude Code Works - Jared Zoneraich, PromptLayer
1674 segments
[music]
So, welcome to the last workshop. Um,
you made it. Congrats. Y
out of like 800 uh people, you're you're
the last standing uh sort of very very
dedicated engineers. Uh yeah, so this
one's a weird one. I got in trouble with
entropic on this one. Uh obviously
because of the title. I actually also
gave him the title and I was like, do
you want to change it? He was like, no,
I just roll with it. It's kind of funny.
Uh uh so so yeah, this is not officially
endorsed by Copic, but we're hackers,
right? And Jared is like super
dedicated. He's um and the other thing I
also like really enjoy is featuring like
notable New York AI people, right? Like
so don't take this as like one is the
only thing that Jared does. He has a
whole startup that you should definitely
ask him about. Um but like you know I'm
just really excited to feature more
content for local people. So yeah,
Jared, take it away.
>> Thank you very much. Thank you very
much. And what an amazing conference.
very sad we're ending it, but hopefully
it'll be a good ending here. Um, and
yeah, uh, my name is Jared. Uh, this
will be a talk on how Claude Code works.
Again, not affiliated with Anthropic.
Uh, they don't pay me. I would take
money, but they don't. Um, but we're
going to talk about a few other coding
agents as well. And kind of the highle
goal that I'll go into is me personally,
I I'm a big user of all the coding
agents, as is everyone here. and they
kind of exploded recently and as a
developer I was curious what changed
what made it finally what made coding
agents finally be good. So let's get
started. I'll start about me. I'm Jared.
You can find me I'm Jared Z on X on
Twitter whatever. Um I'm building the
workbench for AI engineering. So uh my
company is called Prompt Layer. We're
based in New York. You can kind of see
our office here. It's like a little
building. So it's blocked by a few of
the other buildings. So we're we're a
small team. We launched the product 3
years ago. So uh long for AI but small
for everything else. And uh yeah, what
kind of our core thesis is that we
believe in rigorous prompt engineering,
rigorous agent developing development
and we believe that the product team
should be involved, the engineering team
should be involved. We believe if you're
building AI lawyers, you should have
lawyers involved as well as engineers.
Um so that's kind of what we do. uh
processing millions of LM requests a
day. And a lot of the insights in this
talk come from just conversations we
have with our customers on how to build
coding agents and stuff like that. And
also feel free throughout the talk we
can make this casual. So if there's
anything I say if you have a question
feel free to just throw it in. Uh and I
spend a lot of my time kind of dog
fooding the product. It's kind of weird
the job of of a founder these days
because it's half like kicking off
agents and then half just using my own
product to build agents and feels weird
but it's kind of fun. And uh yeah, the
last thing I'll add here is I'm a big
enthusiast. We literally rebuilt our
engineering org around cloud code. I
think the hard part about building a
platform is that you have to deal with
all these edge cases and oh uh we're
uploading data sets here it doesn't work
and you could die a death by a thousand
cuts. So we made a rule for our
engineering organization if you
can complete something in less than an
hour using cloud code. Just do it. Don't
prioritize it. And we're a small team on
purpose but uh it's helped us a lot and
I think it's really taken us to the next
level. So I'm a big fan and let's dive
into how these things work. So this is
what as I was saying the goal of this
talk. First, why have these things
exploded? What is the
what was the innovation? What was the
invention that made coding agents
finally work? If you've been around this
field for a little bit, you know that uh
a lot of these autonomous coding agents
sucked at the beginning and we all tried
to use them. Uh but it's it's night and
day. uh we'll dive into the internals
and and lastly we like everything in
this talk is oriented around how do you
build your own agents and how do you use
this to do AI engineering for yourself.
So let's just go uh talk about history
for a second here. How did we get here?
uh everybody knows started with uh
remember the workflow of you just copy
and paste your code back from chat GPT
back and forth and that was great and
that was kind of revolutionary when it
happened. Uh, step two, when cursor came
out, if we all remember, it was not not
great software at the beginning. It was
just the VS Code fork with the command K
and we all loved it. But, uh, now now
we're not going to be doing command K
anymore. Then we got the cursor
assistant. So, that little agent back
and forth and then cloud code. And
honestly, in the last few days since I
made this slide, maybe there's a new
version we could talk about here. And uh
at the end I'll talk about like kind of
what's next. But this is how we got
here. And this is really I think the
cloud code is kind of this headless uh
not even this this new workflow of not
even touching code. And it it has to be
really good. So why is it so good? What
what was uh what was the big
breakthrough here? Let's try to figure
that out. And again throw this in one
more time. These are all my opinions uh
and what I think is the breakthrough.
Maybe there's other things but simple
architecture. I think a lot of things
were simplified with how the agent was
designed and then better models, better
models and better models. Uh I think the
a lot of the breakthrough is kind of
boring in that it's just anthropic
releasing a better model that works
better for these type of tooling calls
and these type of things. But the simple
architecture relates to that. So we can
dive into that. the architecture and and
this is our little you'll see uh prompt
wrangler is our little mascot for our
company. So we made a lot of graphics
for these slides but uh
basically give it tools and then get out
of the way is what a oneliner of the
architecture is today. I think if you've
been building on top of LMS for a little
bit this has not always been true.
Obviously tool calls haven't always
existed and tool calls is kind of this
new abstraction for JSON formatting and
if you remember the GitHub libraries
like JSON former and stuff like that in
the olden days but give it tools get out
of the way. Uh the models are built for
these things and being trained to get
better at tool calling and better at
this. So the more you want to
overoptimize and every engineer uh
including my especially myself loves to
overoptimize and when you first have an
idea of how to build the agent you're
going to sit down and say oh and then
I'm going to prevent this hallucination
by doing this prompt and then this
prompt and then this prompt don't do
that just a simple loop and get out of
the way and just delete scaffolding and
less less scaffolding more model is kind
of the tagline here and you know This is
uh the leaderboard from this week.
Obviously, these models are getting
better and better. Uh we could have a
whole conversation and I'm sure there's
been many conversations about is it
slowing down? Is it plateauing? It
doesn't really matter for this talk. We
know it's getting better and they're
getting better at tool calling and
they're getting better optimized for
running autonomously. And don't this is
I I think Anthropic calls this like the
AGI pill to way to think about it is
don't try to overengineer around model
flaws today because a lot of the things
will just get better and you'll be
wasting your time. So here's the
philosophy, the way I see it of cloud
code,
ignoring embeddings, ignoring
classifiers, ignoring par matching. The
we had this whole rag thing actually
cursors bring back a little bit of rag
and how they're doing it and they're
mixing and matching. But I think the
genius with cloud code is that they they
scratched all this and they said we
don't need all these fancy uh paradigms
to get around how the model's bad. let's
just make a better model and then let it
let it cook and uh just leaning uh on
these tool calls and
simplifying the tool calls which is a
very important part part instead of
having a workflow where the master
prompt can break into three different
branches and then go into four different
branches there there's really just a few
simple tool calls uh including GP
instead of rag and uh yeah and that's
kind of what it's trained on. So uh
these are very optimized tool calling
models.
So this is uh the zen of Python if if
you guys are familiar if you do import
this in Python. This is I love this
philosophy when it comes to building
systems and I think it's really
apt for how cloud code was built. So
really just simple is better than
complex, complex is better than
complicated, flat is better than nested.
This is this is all you need to this is
the whole talk. This is all you need to
know about how cloud code works and why
it works specifically that just in we're
going back to engineering principles
such that simple design is better
design. Uh I think this is true whether
you're building a
database schema uh but this is also true
when you're building these autonomous
coding agents. So let's I'm going to now
kind of break down all the specific
parts of this coding agent and uh why I
think they're interesting. So the first
is the constitution. Now a lot of the
stuff we kind of take for granted even
though they started doing it a month or
two ago or maybe three or four months
ago. So this is the cloud MD codeex or
others use agents MD. The interesting
thing I think I assume most of you know
what it is. Uh it's again it's where you
put the instructions for your library.
But the interesting thing about this is
it's basically the team saying we don't
need to overengineer a system where the
model first researches the repo and
cursor uh like cursor 1.0 as you know
makes uh vector DB locally to understand
the repo and kind of does all this
research. They're just saying, "Ah, just
put a markdown file. Let the user change
stuff when they need. Let the agent
change stuff when they need very simple
and kind of goes back to prompt
engineering, which I'm a little biased
towards because prompt layer is a prompt
engineering platform, but uh
everything's prompt engineering at the
end of the day or context engineering.
Everything is how do you uh how do you
adapt these general purpose models for
your usage?" And the simplest answer is
the best one here, I think.
So this this is the core of the system.
It's just a simple master loop. Uh and
and this is actually kind of
revolutionary considering how we used to
build agents. Everything in cloud code
and and all the coding agents today,
codeex and and and the new cursor and
AMP and all that, it's just one while
loop with tool calls just running the
master while loop calling the tools and
going back to the master while loop.
This is basically four lines of what
it's called. I think they call it N0
internally. Uh at least based on my
research, but while there are tool
calls, run the tool, give the tool
results to the model, and do it again
until there's no tool calls and then ask
the user what to do. The first time I
did this, uh, the first time I used tool
calls, it was very shocking to me that
the models are so good at just knowing
when to keep calling the tool and
knowing when to fix their mistake. And I
think that's one of the most interesting
thing about LM just they're really good
at fixing mistakes and being flexible.
And the more just going back, the more
you lean on the model to explore and uh
figure it out, the better and more
robust your system is going to be when
it comes to better models.
So,
so these are the core tools uh we have
in cloud code today. And to be honest,
these change every day. you know,
they're doing new releases every few
days, but these are the core ones that I
found most interesting to talk about. Uh
there could be 15 tomorrow, there could
be down to five tomorrow, but this is
what I find interesting. So, first of
all, read. Uh yeah, they could just do a
cat. Uh but what's interesting is read
is we have token limits. So, if you've
used cloud code a lot, you've seen that
sometimes it'll say this file's too big
or something like that. That's why it's
worth building this read tool. Grep
glob. Uh,
this one's very interesting too because
it goes against a lot of the wisdom at
the time of using rag and using vectors.
And I'm not saying rag has no place by
the way either. But in these general
purpose agents, GP is good and and and
GP is uh how users would do it. And I
think that's actually a highle point
here. As as you're as I'm talking about
these tools, remember these are all
human tasks. They're not we're not
making up a brand new tool for the model
to use. We're kind of just mimicking the
human actions and what you and I would
do if we were at a terminal trying to
fix a problem. Edit. Edit makes sense. I
think the interesting thing to note in
edit is it's using diffs and it's not
rewriting files most of the time. uh way
faster, way way uh less context used,
but also way less
uh issues. Uh if if I asked you to if I
if I gave you these slides and asked you
to review the slides and you read it and
had to write down all the slides for me
in your new revisions versus if you
could just cross out things in the
paper, the crossing out is way easier.
Diff is kind of a natural thing to
prevent mistakes.
Bash. Bash is uh bash is the core thing
here. I think you could probably get rid
of all these tools and only have bash.
And the first time I saw this when when
you run something in claw code and
claude code creates a Python file and
then runs the Python file then deletes
the Python file. That's that's the
beauty of why this thing works. So bash
is the most important. I'd say web
search, web fetch. Uh the interesting
thing about these is they move move it
to a cheaper and faster model. So for
example, if you're building a some sort
of agent maybe on your platform and
you're building an agent and it needs to
connect to some endpoints, some list of
endpoints, might be worth to bring that
into a kind of sub tier as opposed to
that master while loop. That's why this
is its own tool. To-dos, uh we've all se
seen to-dos. talk about it a little bit
more later, but keeping the model on
track, steerability, and then tasks.
Tasks is very interesting. It's context
management. It's how do we how do we run
this long process, read this whole file
without cluttering the context? Because
the biggest enemy here is when your
context is full, the model gets stupid
for lack of better words. So basically,
bash is all you need. Uh I think this is
the one thing I want to drill down. The
amazing thing about there's two amazing
things about bash for coding agents. The
first is that it's simple uh and it does
everything. It's it's very robust. But
the second thing that's equally
important is there's so much training
data on it because that's what we use.
It's not it's the reason that models are
not as good at Rust or less common
programming languages just because
there's less people doing it.
So it's really the universal adapter.
Um, you thousands of tools, you could do
anything. Uh, this is that Python
example I gave. I I I always find it so
cool when it does the Python script
thing or creates tests and I always have
to tell it not to. But it all these
shell tools are in it. And this is I
mean I find myself using cloud code to
spin up local environments where
normally I'd have like five commands
written down on some file somewhere and
then they get out of date. It's really
good at figuring this stuff out and
running the stuff you'd want to do.
uh and it specifically lets the model
try things.
So uh yeah, the other suggestions here
and the tool usage uh I think there's a
little bit of a system prompt uh that
tells it which to use and when to use
which tool over which and this changes a
lot but the these are kind of like the
edge cases and the corners you find the
model getting stuck in. So reading
before editing uh they actually make
make you do that using GP the tool
instead of the bash. So if you look at
the tool list here there's a special GP
tool. Uh there could be a lot of reasons
for that. I think security is a big one
uh and sandboxing but then also just
that token limit thing running
independent operations in parallel. Uh,
so kind of pushing the model to do that
more. And then also like these trivial
things like quoting paths with spaces.
It's just the common common things. I'm
sure they're just dog fooding a lot at
anthropic and they find it and they're
like, "All right, we'll throw it in the
system prompt."
Okay, so let's talk about to-do lists.
Uh, now again, a very common thing, but
was not a common thing before. The the
So this is actually I think a to-do list
for from some of my my research for this
slide deck. Um, but the really
interesting thing about to-do lists is
that they're structured but not
structurally enforced. So, here are the
rules. One task at a time. Uh, mark them
completed. This is kind of stuff you
would expect. Uh, keep working on the in
progress if there's block blocks or
errors and kind of break up the tasks
into different instructions. But the
most interesting thing to me is it's not
enforced deterministically. It's purely
prompt based. It's purely in the system
prompt. It's purely because our models
are just good at instruction following
now. And this would not have worked a
year ago. This would not have worked two
years ago. Um there's tool descriptions
at the top of the system prompt. We're
kind of uh injecting the todos into the
system prompt. uh there's they're not
but it but it's not enforced in actual
code and again uh maybe there's other
agents that take an opposite path. Uh I
just found this pretty interesting that
this at least as a user makes a big
difference and it doesn't even see it
seems it was it seems like it was very
simple to implement almost a a weekend
project someone did and seemed to work.
could be wrong about about that as well,
but uh um so yeah, it's literally a
function call. Uh
it's the first time you ask something,
the reasoning exports this to-do block,
and I'll show you what the structure is
on the next slide. Uh there's ids there.
There's some kind of structured schema
and determinism, but
it it's just injected there. So here's a
example of what it could look like. You
get a version, you get your ID, uh a
title of the to-do, and then it could
actually inject evidence. So, this is uh
seemingly arbitrary blobs of data it
could use. And the ids are hashes that
it could then refer to
title, something human readable, but
this is a just another way to structure
the data. And in the same way that
you're going to organize your desk when
you work, this is how we're trying to
organize the model.
So I think there's uh these are kind of
the four benefits we're getting. We're
forcing it to plan. Uh we get to resume
after crashes. Uh clog code fails. I
think UX is a big part of this. As a
user, you know how it's going. It's not
just running off in a loop for 40
minutes without any uh signal to you. So
UX is non-negligible. Even though UX
might not make it a better coding agent,
it might make it better for us all to
use. and uh the steerability one. So
here's two other parts that were under
the hood. Async buffer, so they called
it H2A. Uh it's kind of uh the IO
process and how to decouple it from
reasoning and and how to manage context
in a way that you're not just stuffing
everything you're seeing in the terminal
and everything back into the model,
which again context is our biggest enemy
here. It's going to make the model
stupider. So we need to uh be a little
bit smart about that and and how we do
compact and how we do summarization. So
here you see when it reaches capacity it
kind of drops the middle summarizes the
head and tail. Um then we have the
that's the context compressor there. So
what is the limit 92% it seems like
something like that. Uh and and how does
it how does it save long-term storage?
That's actually another kind of
advantage of bash in my opinion and
having a sandbox. I would even make a
prediction here that all your all chat
GPT windows, all clawed windows are
going to come with a sandbox in the near
future. It's just so much better because
you can store that long-term memory. And
I do this all the time. I have I have
cloud code skills for deep research and
stuff like that. And I'm always
instructing it save markdown files
because the shorter the context, the
quicker it is and the smarter it is.
So this is what I'm most excited about.
We don't need DAGs like this. We
I'll give you I'll give you a real
example. Uh so some users at prompt
layer uh different agents like customer
support agent basically everybody was
building DAGs like this for the last two
two and a half years. Uh and it was
crazy. Hundreds of nodes of okay this if
this user wants a refund route them to
this prompt if they want this and a lot
of uh classifying prompts. The advantage
of this is you can kind of guarantee
there's not going to be hallucinations
or guarantee there's not going to be
refunds to people who shouldn't be
having refunds or kind of that pro it
solves the prompt injection problem
because if you're in a prompt that
purely classifies it as X or Y injecting
doesn't really matter especially if you
throw out the context. Now we kind of
brought back bring back that attack
vector but the but the major benefit is
we don't have to deal with this web of
engineering uh madness and uh it just
it's 10x easier to develop these things
10x more maintainable and it actually
works way better because our models are
just good now.
So this is this is kind of a takeaway is
rely on the model. uh when in doubt,
don't don't try to think through every
edge case and think through every if
statement. Just rely on the model to
explore and figure it out. And I was
actually two days ago, I think, or
yesterday, sometime this week, I was
doing an experiment on our dashboard to
add like trying these browser agents.
And I wanted to see if I could add
little titles to all our buttons and it
would help the agent navigate our
website automatically. And it actually
made it worse, surprisingly. Uh, and
maybe I could run it again and maybe I
did something wrong with this test, but
it made the agent navigate prompt layer
worse because it was getting distracted
because I was telling it you have to
click this button, then you have to
click this button and then
it's it didn't know what to do. So, it's
better to rely on exploration. You have
a question?
>> Yeah, I'll I'll push back a little bit,
>> please. I'll admit any
scaffolding we create today to resolve
the idiosyncrasies of limitations will
be that'll be obsolete 3 to 6 months
even if that's the case they help a
little bit today I how do you balance
that like wasted engineering to solve a
problem we only have for three months
>> it's a great question so just to repeat
uh the question is basically
what is the trade-off between solving
the actual problems we have today and if
you're relying on the model that can't
do it yet but it'll be able to do it in
three months, right? Um it's case by
case. It depends what you're building.
If you're building a chatbot for a bank,
you probably do want to be a little bit
more comp be careful. To me, the happy
middle ground is to use this agent
paradigm of a master while loop and tool
calls, but make your tool calls very
rigorous. So I think it's okay to have a
tool call that looks like this or looks
like half of this uh in the same way
that claude code uses read as a tool
call or GP as a tool call. So for the
edge cases,
throw it in a structured tool that you
can then eval in version and stuff like
that. And I could talk I'm going to talk
a little bit more about that later, but
throw it in that structured tool. But
for everything else, uh for the
exploration phase, leave it to the model
or throw some system prompt. Uh so
it's a trade-off and it's very use case
dependent, but I think it's a good
question. Thank you. So yeah, uh just
back to cloud code. Uh we're we're
getting rid of all this stuff. We're
saying we don't want MLbased intent
detection. We don't want reax. We don't
want the I mean it uses reax a little
bit, but we don't want reax baked into
it. We don't want classifiers. And and
there was a long time we actually built
a product for prompt layer. We never
released it because there's only a
prototype of using a MLbased like a
nonlm based classifier in your prompt
pipeline instead of LMS. A lot of people
have a lot of success with it, but
it it feels more and more like it's not
going to be that helpful unless cost is
a huge concern for you. And even then
cost is the smaller models is going less
and less as uh kind of financial
engineering between all these companies
pays for our tokens. Um so Claude does
also this smart thing I think with the
trigger phases. you know, you have
think, think hard, think harder, and
ultra think is my favorite. Uh, and this
lets us use the reasoning budget, the
reasoning token budget as another
parameter that the model can adjust. And
this is actually the model can adjust
this, but this is how we force it to
adjust. And as opposed to you could make
a tool call for
hard planning. And actually, there's
some coding agents that do this. or you
can uh let the user specify it and then
just on the fly change it.
So this is this is one of the biggest
topics here. Sandboxing and permissions.
I'm going to be completely honest, it's
the most boring part of this to me
because I just run it on YOLO mode half
the time. Um it's uh
some people on our team actually dropped
all their local databases. So you do
have to be careful. Uh so uh you know we
don't yolo mode with our enterprise
customers obviously but uh I but but I
think this stuff is it feels like it's
going to be solved but but we do need to
know how to works a little bit. So
there's a big issue of in prompt
injection from the internet. If you're
connecting this agent that has shell
access and you're doing web fetch that's
a pretty big attack vector. Uh, so
there's some containerization of that.
There's blocking URLs. You could see
cloud code's pretty annoying about can I
fetch from this URL? Can I do this? And
it kind of puts it into a sub agent. And
uh, yeah, most of the most of the
complex code here is in this sandboxing
and permission set.
I think there's this whole pipeline to
gate bash command. So it depending on
the prefix is how it goes through the
sandboxing environment and a lot of the
other models work differently here. Uh
but this is how cloud code does it. I'll
explain the other ones later at the end.
The next topic uh of relevance here is
sub aents. Uh so this is going back to
context management and this this problem
we keep going back to of the longer
context the the stupider our agent is.
This is a this is an answer to it. So
using sub aents for specific tasks and
the key with the sub aent is it has its
own context and it feeds back only the
results and this is how you don't
clutter it. So we got the researcher
these are just four examples researcher
docs reader testr runner code reviewer
in that example I was talking about
earlier when I added all the tags to our
website to let the agent do it better. I
obviously I use a coding agent to do
that and I said read our docs first and
then do it and it's going to do this in
a sub agent. It's going to feed back the
information and the the key thing here
is the forks of the agent and how we
aggregate it back into our main context.
So here's an example. I think this is
actually very interesting. I want to
call out a thing or two here. So task is
what a sub aent is. We're giving task
two things. Description and a prompt.
The description is what the user is
going to see. So you're going to say
task
uh find default chat context
instantiation or something. And then the
prompt you're going to give a long
string which is really interesting
because now we have the coding agent
prompting its own agents. And I've
actually used this paradigm in agents
I've built for our product. Uh if you
can you can just have the agent stuff as
much information as it wants in this
string. And if we're going back to
relying on the model if this task
returns an error now stuff even more
information and let it solve the
problems. It's better to be flexible
rather than rigid.
If I was building this I would consider
switching a string to maybe an object
here uh depending on what you're
building and maybe let it give actually
more structured data. Yes. So I can see
this prompt has quite a couple
sentences. Is that in the main agent? Is
that taking the context of the main
agent or is there some sort of
intermediate step where the sub agent
double reads over you know like what the
main agent is doing and then generates
>> right? So the question is does the task
just get the prompt here or does it also
get your chat history? Is that the
question?
The question is is are all of the I have
my main agent. Is all of this in the
system prompt of the main agent to
inform how that prompts the sub agent?
>> No. No. Like it's not in the system.
It's in the whole context. Is the all of
this context of the main agent
>> the task it calls or or you're saying
the structure for the task
>> this whole JSON right or
>> yes. So this is a tool call. So the tool
called structure of what a task is is in
the maiden agent. Uh and then these are
generated on the fly. Uh so as you want
to run a task, it's generating the
description and the prompt. Task is a
tool call. They could be run in parallel
and then they're returning the results
of it. Hopefully that helps.
Um so we could go back to the system
prompt. So there's some leaks of the
cloud code system prompt. So that's what
I'm basing this on. Uh you can find it
online. Um here are some things I I
noted from it. Uh concise outputs. Uh
obviously don't give anything too long.
No here is or I will just do the do the
task the user wants. Uh kind of pushing
it to use tools more more instead of
text explanations. Obviously, I think
when we we've all built coding agents
and when we do it, it usually says,
"Hey, I want to run this SQL." No, push
it to use the tool. Um,
matching the existing code, not adding
comments. This one does not work for me,
but uh running commands in parallel
extensively and then the to-dos and
stuff like that. There's a lot that you
can nudge it to do with the system
prompts. But as you see, I think there's
a really interesting point to the
earlier question you had about where
what's the trade-off between DAGs and
loops.
A lot of these things you could see are
feel like they came from someone using
it clawed code and saying, "Oh, if only
it did this a little less or if it did
this a little bit more." That's where
prompting comes in because it's so easy
to iterate and it's not you're not it's
not a hard requirement but if only it
said here is a little bit more. It's
okay to say it sometimes but
all right skills. Skills is great. It's
a slightly newer. I've I honestly got
convinced of it only recently. So good.
I built these slides with skills. uh
it's basically I think in the context of
this talk about architecture, let's
think of it as a extendable system
prompt. So in the same way that we don't
want to clutter the context, there's a
lot of different type of tasks you're
going to need to do where you want a lot
more context. So this is how we give
cloud code a few options of how it could
tap into more information. Here are some
examples. Uh, I use this for I have a
skill for docs updates to tell it my
writing style and and my product. So, if
I want to do a docs update, I say use
that skill. Load in that skill. Uh,
editing Microsoft Office uh Microsoft do
Microsoft Word and Excel. Um, I I don't
use this, but I've seen a lot of people
using it. It kind of like decompiles the
f it's really cool. Uh, but it lets
cloud code do this design style guide.
This is a common one. Deep research. I
the other day I threw in a like article
or GitHub uh repo on how deep research
works and I said rebuild this as a cloud
code skill works so well it's amazing.
So unified diffing I think this is worth
its own slide. Uh it's very obvious
probably not too much we need to talk
about here but
it makes this so much better and it
makes the token limit shorter. It makes
it faster and makes it less prone to
mistakes like I gave with that example
when you rewrite an essay versus marking
it with a red line. It's just better. I
highly recommend using diffing in any
agents you're doing. Unified diff is a
standard. When I looked into a lot of
these coding agents, some actually built
their own kind of standard uh and like
with slight variations on unified diff
because you don't always need the line
numbers and but unified diff works. You
had a question
>> to go back to skills.
I are uh I don't know if anyone's seen
the cloud the cloud code warns you and
in yellow text if your quad indeed is
like greater than 40k characters and so
I was like okay I'm up. Let me
break this down into skills. So I bet
spent some time and then Claude ignored
all of my skills and so I put them in
some. So what am I? I don't know. Skills
[clears throat] feel globally
misunderstood or like not I don't know
I'm missing something. Help me
understand. [laughter]
>> Yeah. So the the question was on okay so
cloud code system cloud MD it tells you
when it's too long. So uh you move it
into skills and then it's not
recognizing the skills and not picking
it up when it's needed.
>> Yeah.
take that up with the anthropic team I'd
say. Uh but that's also a good example
of maybe the system prompt
>> that was the intention like skills you
need to invoke them and like the agent
itself shouldn't like just call them all
the time,
>> right? It does give a dis description of
each skill to the model or it should uh
tell it okay here's like a oneliner
about each skill. So theoretically in a
perfect world it would pick up all the
skills all the time. But you're right, I
generally have to call the skill myself
manually. I but I think this is a good
tieback into when is prompting the right
solution or when is the DAG the right
solution or maybe this is a model
training problem. Maybe they need to do
a little bit more in post-raining of
getting the model to call the skills is
almost like calling a tool call. You
have to know when to call it. So maybe
this is just uh a functionality that's
not that good yet, but I think the
paradigm is very interesting, but it's
not perfect as we're learning.
So diffing we just talked about what's
next. So this is more opinion based, but
where I see these things going and where
the next kind of innovations might
likely be. So
I I think there's two schools of
thoughts here. A lot of people think
we're going to have one master loop with
hundreds of tool calls and just tool
calling is going to get much better.
That's highly likely. Uh I take the
alternate view which I think we need to
reduce the tool calls as much as
possible and just go back to just bash
and maybe even put scripts in the local
directory. I think I am on the proponent
of one mega tool call instead of a lot
of tool calls. Maybe not actually one. I
actually think that slide I showed you
before is probably a good list, but a
lot of people think we need hundreds of
tool calls. I just don't think it's
going there. Adaptive budgets, uh,
adjusting reasoning, we do this a little
bit, uh, the thinking and ultra think
and stuff like that, but I I think
reasoning models as a tool makes a lot
of sense as a paradigm. Can you use I
think a lot of us would make a trade-off
of a 20 times quicker model with
slightly stupider results and being able
to call a tool call for a very good
model. I think that's a trade-off we
we'd make in a lot of cases. Maybe not
our planner. Maybe we go to the planner
first with GPD 51 codeex or opus or
whatever if when the new opus comes out.
Uh but
I think I think there's a lot of uh mix
and matching we can do and that's I
think the next frontier and I think the
last frontier
I think there's a lot we can learn from
to-do lists and and new first class
paradigms we can build skills is another
example of a first class paradigm we can
kind of try to build into it maybe it
doesn't work perfectly uh but I think
there's a I think there's a lot of new
discoveries to be made there in my
opinion do I have them I don't know uh
so now I I want to for the for the
latter part of this talk I want to talk
about the other frontier agents and the
other philosophies they've designed
philosophies they've chosen and
we all have the benefit we can mix and
match when we were building our agent we
could do whatever we want and learn from
the best and the frontier labs are very
good at this so
uh something I like to go back to a lot
I call it the AI therapist problem may
maybe there's a better name to give it
uh but I believe there's a lot of
problems, the most interesting AI
problems around. There isn't a global
maximum. Meaning,
all right, we're in New York City. If I
need to see a therapist, there's six on
every block here. There's no global
answer for what the best therapist is.
There's different strategies. There's a
therapist that does meditation or CBT or
maybe one that gives you Iawaska. and
and these are just kind of like
different strategies for the same goal
in the same way that if you're building
an AI therapist, there isn't a global
maxima. This is kind of my anti-AGI
take, but this is also the take to say
that when you're building these
applications, taste comes into it a lot
and design architecture matters a lot.
You can have five different coding
agents that are all amazing. Nobody
knows which today. Nobody knows which
one's best to be honest. I don't think
Anthropic knows. I don't think OpenAI
knows. I don't think source graph knows.
Nobody knows whose has the best, but
some are better at some things. I
personally like claude code for I said
like running my local environment or
using git or using these kind of like
human actions that require back and
forth, but I go to codeex for the hard
problems or I go to composer from cursor
because it's faster. And there's a lot
basically all this to say there's value
in having different philosophies here.
And I don't think there's going to be
one winner to this. I think there's
going to be different winners for
different use cases. And and this is not
just coding agents, by the way. This is
all AI products. This is this is kind of
why our whole company focuses on domain
experts and bringing in the PM and the
the the subject matter expert into it
because that's how you build
defensibility.
So here are the perspectives. The way I
see it, this is not a complete list of
coding agents, but these are the ones
that I think are the most interesting.
Cloud code I think I think to me it wins
in user friendliness and simplicity. Uh
like I said if I'm doing something that
requires a lot of applications that git
git's just the best example. If I want
to make a PR I'm going to cloud codeex
uh context it's really good at context
management. Uh it feels powerful. Do I
have the evidence to show you that it's
more powerful? Probably not. But uh it
feels that way to me and the market feel
there's a whole another conversation
here to say the market knows best and
what people talk about knows best but I
don't know if they know either. Cursor
IDE is kind of that perspective model
agnostic. It's faster. Factory uh makes
Droid uh great team. They were here too.
Uh they have multiple they they really
specialize these droid sub aents they
have. So that's kind of their edge and
that's maybe a DAG conversation too or
maybe a model training uh cognition. Uh
so Devon uh kind of this endto-end
autonomy self-reflection AMP which I'll
talk about more in a second. They have a
lot of interesting perspectives and
actually I find them very exciting these
days free it's model agnostic uh and
there's a lot of UX sugar for users and
I actually I love their design their
their talks at this this conference they
they they have very very unique
perspectives so let's start with codeex
because it's a popular one
so it's pretty similar to cloud code uh
same master while loop most of these do
because that's just the winning
architecture uh interesting ly rust
core. Uh the cool thing is it's open
source so you can actually use codeex to
understand how codeex works which is
kind of what I did. Um it's a little
more event driven a little more uh work
went into concurrent threading here uh
kind of submission cues event outputs
kind of the the thing I was talking
about with the IO buffer in cloud code.
I think they do it a little bit
differently. Uh sandboxing is very
different. So theirs is more you I mean
you could see here Mac OS seat belt and
Linux land theirs is more kernel based
uh and then state kind of this it's all
under threading and and permissions is
how I'd say it's mostly different and
then the real difference is the model to
be honest. Uh so this is a this is
actually me using cloud code to
understand how codeex works. Uh so you
see we have a few explore. I didn't talk
about explore but uh it's uh it's a it's
another sub agent type as as I as I
mentioned these go in and out. Uh but
yeah this is researching codecs with
cloud code. It's always a fun thing to
do. So let's talk about AMP.
So this is source graphs coding agent.
I it has a free tier. That's just a cool
perspective in my opinion. Uh they
leverage kind of these excess tokens uh
from providers and they give ads. So, we
actually have an ad on them. I think
it's a cool I'm pro-AD. A lot of people
are anti- ad. I think it's one of my hot
takes, but I like it. They don't have a
model selector. This is very
interesting, too. This is its own
perspective. Uh, it actually helps them
move faster because you're you have less
of an exact expectation of what the
output is because, you know, they might
be switching models here and there. So,
that changes how they develop. And then,
uh, I think their vision is pretty
interesting.
uh their vision is how do we build not
just the best agent but how do we build
the agent that works with the most
agentfriendly environments and actually
factory gave a talk similar to this as
well but how do how do you build a
hermetically sealed uh a like coding
repo that the agent can run tests on how
do you build the feedback loop because
that's kind of the holy grail that's how
we build an autonomous agent and how do
we uh I'd love to see the front-end
version of this how do let it look at
its own design and make it better and go
back and forth and this is kind of their
guiding philosophy and you could boil it
down to the agent perspective as I've
been calling it.
I think they do interesting stuff with
context. So, we're all familiar with
compact. It's the worst. You have to
wait 10. I don't know why it takes so
long. Uh and if you're not familiar,
it's summarizing your chat window when
the context gets too high and giving the
summary. So, they have something called
handoff, which makes me think of if you
any was a anyone was a Call of Duty
player back in the day, switch weapons.
It's faster than reloading. And uh
that's what handoff is. You're you're
just starting a new thread and you're
giving it the information it needs for a
new thread. That feels like the winning
strategy to me. Could be wrong, but
maybe you need both. That's where
they're pushing it. And I kind of like
that. I They get they give a very fresh
perspective. So, the second thing is
model choice. This is the reasoning
knobs uh and their view on it. They have
fast, smart, and Oracle. So, they lean
even more heavily into we have different
models. We're not telling you what
Oracle is. They tell you, but we're
willing to switch what Oracle is, but
we're going to use Oracle when we have a
very hard problem.
So, yeah. So, that's AMP. Let's go to
Cursor's Agent. I think Cursor's agent
has a very interesting perspective here.
First, obviously, it's UI. uh UI first,
not CLI. I think they might have a CLI,
not entirely sure, but the UI is the
interesting part. It's just so fast.
Their new model composer, it's
distilled. They have they have the data.
They actually made, in my opinion,
people interested in fine-tuning again.
fine-tuning. It was almost uh we'd never
recommend it to our customers, but
composer shows you that you can actually
build defensibility based on your data
again, which which is uh surprising, but
uh yeah, cursors agent composer, I've
been almost switching completely to it
since because it's just so fast. It's
almost too fast. Accidentally pushed to
master on one of my personal projects.
Uh so you don't you don't want that
always. Uh but cursor was just the crowd
favorite and and I want to give a lot of
uh props to their team. They built
iteratively. The first version of cursor
was so bad and it was and we all use I
used it because it's a VS code for fork.
I have nothing to lose and it's gotten
so good. It's such a good piece of
software and it's a great team and uh
but I I'll say the same can be said
about OpenAI's codeex models. They're
not quite as fast, but they are
optimized for these coding agents and
they are distilled. And I could see
OpenAI coming out with a really fast
model here because they also have the
data.
So here's a picture. Um I think you
could this is a picture they put on
their blog and you could see what their
perspective is on coding agents here
just based on the fact that they show
you the three models they're running.
So, they're offering composer, but
they're letting you use the
state-of-the-art because they know that
maybe GPD 5.1 is better at planning or
here it's five, but now we have 5.1.
So, here begs the big question, which
one should we all use? Which
architecture is best? What should we do?
And uh my opinion here is that
benchmarks are pretty useless.
Benchmarks have become marketing for a
lot of these model providers. every
model beats the benchmarks. I don't know
how that happens, but
I think there's there's world where
evals matter here. And
the question is what you can eval. The
question is how this whole simplic
simple while loop architecture that I've
been kind of trying to push based on my
understanding of it actually makes it
harder to eval because if we're relying
more on model flexibility, how do you
test it? You could run an integration
test, kind of this endto-end test, and
just say, "Does it fix the problem?"
That's one way to do it. You could break
it up. You could kind of do point in
time snapshots and say, "Hey, I'm going
to give a context to my chatbot from
like a half-finish conversation where I
know it should be running a specific
tool call." I could run those. Uh I or I
could maybe just run a back test and
say, "How how often does it change the
tools?" I think there's also another
concept here that's starting to be
developed called agent smell or at least
I'm calling it agent smell. So run an
agent and see how many times does it
call a tool call. How many times does it
retry? How long does it take? And these
are all surface level metrics but it's
really good for sanity checking. And
these things are hard to eval. There's a
lot that goes into it. I'll show you an
example of what I did uh just to kind of
dive into it. But but on that subject
maybe I'll just say one more thing. I
would break it down me my mental model
is you could do an endto-end test, you
can do a point in time test or what I
most often recommend is just do a back
test. Start with back test, start
capturing historical data and then just
rerun it. So yeah, let me give you uh
this example. So basically what I have
here, so this is a screenshot of prompt
layer. This is our our eval product is
also just a batch runner. So you could
kind of just run a bunch of columns
through a prompt. But in this case, I'm
running it through not a prompt, but
cloud code. So I just have like a
headless cloud code and I'm taking all
these providers and I just my headless
cloud code says I think I have it on the
next slide. Search the web for the model
provider. It's given to you in a file
variables. Find the most recent and
largest model released and then return
the name. So I don't know what it's
doing. It's doing web search. I'm not
even caring about that. This is an
endto-end test. This is how we kind of
try doing cloud code. And I actually
think there's a lot about putting cloud
code into your workflows and those type
of headless SDKs. I'll talk about that I
think next slide. But
kind of main takeaway here is you can
kind of start to do endto-end tests. You
can look at it from a high level do a
model smell and then kind of look into
the statistics on each row and see how
many times it called a tool.
And going back and we we've talked about
this a lot in this talk. rigorous tools.
The tools can be rigorously tested. You
can This is how you offload the deter
This is how you offload the determinism
to different parts of your model. It's
you test the tools. You you test the
out of your tools. Look at them
like functions. It's an input and an
output. If your tools a sub agent that
runs, then we're in a kind of recursion
here because then you have to go back
and test the end to end thing. But for
your tools, I'll give you this example.
If I so there in my coding agents or my
agents in general, my autonomous agents,
if there's something very specific that
I want to output. So in this case, if I
have a very specific type of email
format or type of blog post that I want
to write and I really want it to get my
voice right, I don't want to rely on the
model exploration. I want to actually
build a tool that I can rigorously test.
So in this case, this is also just a
prompt layer screenshot, but this is a
like a workflow I've built. It has an LM
assertion where it says check if the
email is good to my standards. If it's
good, it revises it. If it's not good,
it adds the parts. So like the header
that it missed and it revises it with
the same step. This is obviously a very
simple example, but in I we have another
version for some of our SEO blog posts
that has like 20 different nodes and
writes an outline from a deep research
and then fixes a conclusion and adds
links.
for the stuff that you have a very
specific vision that's when testing it
just gets so much easier because as you
can see obviously testing this sort of
workflow has less steps and less
flexibility. So this is an eval I made I
start with just a bunch of sample emails
I run the prompt actually I run the the
agentic workflow here and I'm just
adding a bunch of heruristics. So this
is a very simple LMS judge does it
include three parts in it. So this is
what I was testing for like the hi Jared
email body and the signature. You can
get a lot more complicated. You could do
a code execution. You can do I don't
know LM's judge is usually the easiest.
But now obviously you could see I could
keep running this until it's correct on
all of them and kind of uh see my eval
over time. This is just from this
example. I got it to 100. So that was
fun.
Uh and then I want to I want to add
another future looking thing. keep an
eye on headless uh cloud code SDK. I
know there was a talk about it this
morning. Um so I don't want to I won't
spend too much time on it, but it's
amazing. You just give a simple prompt
and it's just another part of your
pipeline. I use it for I think I have it
on the next slide. I have a GitHub
action that updates my docs every day
and just reads all the commits we've
pushed to our other repos. And we have a
lot of commits going and it just runs
cloud code. The cloud code pulls down
all the repos, checks what's updated,
reads our cloud MD to see if it should
even update the docs, then creates a PR.
So I think this unlocks a lot of things
and there's a possibility that we're
going to start building agents at a
higher order of abstraction and just
rely on cloud code and these other
agents to do a lot of the harnesses and
orchestration.
>> Are you reviewing those?
Yeah, [laughter]
I it creates a PR. It doesn't uh it
doesn't merge the VR.
So, here are my takeaways. Number one,
trust in the model. Uh when in doubt,
rely on the model when you're building
agents. Number two, simple design wins.
Number one and number two kind of go
together here. Number three, bash is all
you need.
Go simple with your tools. Don't have 40
tools, have 10 or five tools. For
context management matters, this is the
boogeyman we're running from all the
time in agents at this point. Maybe
there'll be new models in the future
that are just so much better at context.
But there's always going to be a limit
because ah you're talking to a human. I
forget people's names if I meet too many
in one day. That's context management or
my stupidity. I don't know. And number
five, different perspectives. matter in
agents. I think this is the engineering
brain doesn't always comprehend this as
much as it should especially in and I'm
an engineer so I'm also talking about
myself but the different perspectives
matter such that there's different uh
ways to solve a problem where there's
not one is better than the other and you
kind of you probably want a mixture of
experts agent I I would love to have
mine run cloud code and codeex and this
and give me the output and considered a
team and maybe have them talk to each
other in a slack based message channel.
I'm waiting for someone to build that.
That would be great. But these are my
takeaways. Uh my bonus thing that I'll
show you is how I built this slide deck
using cloud code. So uh I built a slide
dev skill. So I I basically told cloud
code to research how slide dev works and
how it can and that's kind of just a
library that I made this in. I built a
deep research skill to research all
these agents and how they work. I built
a design skill because I know half a
thing looks terrible or looks good, but
I'm not a good designer to figure it
out. So, these boxes even I was just
like, "Oh, m make the box a little
nicer. Give it an accent color." Uh, so
yeah, this is how I built it. But again,
thank you for listening. Uh, happy to
answer any questions. I'm Jared, founder
of Prompt Layer. Find me there.
[applause]
>> Yes.
>> Thank you. Great talk. Um so you
mentioned u regarding DAGs basically
like let's get rid of them right but
DAGs kind of enforc this like sequential
uh execution right pass I don't know
customer service like agent asks the
name email right like in some sort of uh
sequence um so are you saying
just write this out um like this is now
this should be uh just written out as a
plan for an agent to execute and just
trust that the model is going to be
calling those tools in that sequence
like how do we enforce uh the order?
>> Right? So the question was
why do I keep talking about getting rid
of DAGs? How else am are you supposed to
enforce a specific order for solving a
problem? So I think there are different
types of problems. So the problem of
building a general purpose coding agent
that we can all use to do our work and
even non-technical people can use
there's no specific step to solving that
problem which is why it's better to rely
on the model. If your problem was
to build
let's say a travel itinerary
it's more of a specific step because you
have a deliverable that's always the
same. So there's a little bit more of a
DAG that could matter, but in the
research step of traveling, you probably
don't want a DAG because every city is
going to be different. So it really
depends on the problem you're solving. I
would if I wanted to make an agent for a
travel itinerary, I'd probably have my
tool call would one of my tool calls be
a DAG of creating the output file
because I want the output to look the
same or creating the plan. And then in
the system problem, I could say always
end with the output for example. But
you need to mix and match. There's a
every use case is different, but if you
want to make something general purpose,
my take is to rely more on the model on
simple loops and less on a DAG.
>> Cool. Any other questions?
Yes.
>> Yeah. Building on that point, like do
you think we're heading towards a world
where most of you're not actually going
to call the API through code and that
most LM calls are by triggering cloud
code and just write just writing the
files instead?
So the question is are we going to move
away from calling models directly and
just call call like a headless cloud
code, right?
>> Yeah. Like if I had a like I have a
pipeline that does one lm call per
document, summarizes it at the end. You
could make a while loop cloud code that
saves a file every time. You never call
the API besides
using cloud code in in a while loop
>> potentially. Uh, I'll give you the pro
and the con there.
>> Yeah,
>> the pro is it's easier to develop and we
can kind of rely on the frontier. I
mean, if you think about it, a reasoning
model is just that. The reasoning models
didn't always exist. We just had normal
LM model and then oh, now we have 01 and
reasoning models. All that is is a I
mean, it's a little more complicated
than this, but it's basically just a
while loop on OpenAI servers that keeps
running the context and then eventually
gives you the output. in the same way
that cloud code SDK is a while loop with
a bunch of more things. So I could
totally see a lot of builders only
touching these agentic endpoints. Maybe
even seeing a model provider release a
model as a agentic endpoint. But for a
lot of tasks, you're going to want a
little bit more control. And they're pro
and probably you'd still want to go as
close to metal as possible. Having said
that, there's there was a lot of people
who still wanted completions models and
that never happened and nobody really
talks about that anymore. So, it's very
likely that everything just becomes this
SDK, but I don't have a crystal ball,
but those are those are how I I would
think about it.
>> Yes,
>> thanks for the talk. Um, I know you said
the simpler the better, but um, what's
your thoughts about test during
development, spec during development in
AI? Have you tried it? What is it about
>> for building agents or for getting work
done?
>> For coding.
>> Okay. So the question on spec driven
development, test-driven development for
coding with agents.
[cough and laughter]
When in doubt, go back to
good engineering practices is what I
would say. So it if you
and there's there's whole engineering
debates on if test-driven development is
the right way and some people swear by
it and some people don't. So I don't
think there's an answer. I think coding
agents clearly test-driven development
makes it easier. I think as I was
showing you that's AMP's source graphs
whole philosophy that if you can build
good tests and factory I think thinks
this as well. If you could build good
tests your coding agent can work much
better. So it makes sense to me when I'm
working personally I rely pretty heavily
on the planning phase and the spectr in
development phase and I think the
simpler tasks are pretty easy for the
model but if I'm doing a very simple
edit I'll skip that step. So no
oneizefits-all but return to the
engineering principles that you believe
when in doubt I'd say yes.
>> So earlier you talked about about system
rock leaks is possible to just look at
the u
downloads bundle or they have a special
end point that has prompts behind
endpoint.
>> Yeah. Uh I think I think they hide it. I
think they hide it. There was a there
was actually an interesting article
someone
because codeex is open source they
before openai released the codeex model
that it was using they were able to hack
together the open source codeex to give
a custom prompt to the model and be able
to use the model without it. So yeah you
can dive into it but generally it's
tried to be hidden and also laziness of
someone posted it. So there you go
that's the work but someone had to have
found it right. like is this problem
somewhere on your machine?
>> I actually don't know that answer.
[laughter]
>> Do you know that answer?
>> Yeah.
>> Yes.
>> It's on your machine. Nico says it's on
your machine. So there we go. So maybe
the prompt I was looking at is a little
bit old and I have to update it. But the
s but uh the question was does uh is the
prompt hidden on their servers or can
you find it if you are so determined?
And the answer seems to be yes. Any
other questions?
>> Yes.
>> Is this the last one?
>> Is this the last question?
>> It can be.
>> Can you talk about prompt layer and how
can people help you?
>> Yes, that's a good one. I forgot about
that. Thank you.
Um, so yeah, my one, we're hiring. Uh,
so if you're looking for coding jobs at
a very fun and fastmoving team in New
York, you can reach out to me on X or
email jaredprompter.com.
We're based in New York. We are uh,
yeah, we're we're a platform for
building and testing AI products for
prompt management, audibility,
governance, all that fun stuff, but also
logging and evals. And those screenshots
I showed you came from prompt layer. If
you're building an AI application and
you're building it with a team, you
should probably try Problem layer. It'll
make your life easier. Uh especially the
bigger your team is, the more you want
to collaborate, the more you want to
collaborate with PMs and non-technical
users and or if you're just technical
users, it's a great tool. It'll make
your life better. Highly recommend it.
prompt layer.com and it's easy to do.
And that was my show.
Thank you for listening. [applause]
[music]
>> [music]
[music]
>> Heat.
Ask follow-up questions or revisit key timestamps.
This workshop, led by Jared from Prompt Layer, delves into the architecture and philosophy behind effective coding agents, particularly focusing on Claude Code. Jared highlights that the success of modern coding agents stems from simpler architectures and improved underlying models, rather than complex, multi-branched workflows (DAGs). The core principle is to
Videos recently processed by our community