HomeVideos

How Claude Code Works - Jared Zoneraich, PromptLayer

Now Playing

How Claude Code Works - Jared Zoneraich, PromptLayer

Transcript

1674 segments

0:13

[music]

0:20

So, welcome to the last workshop. Um,

0:23

you made it. Congrats. Y

0:28

out of like 800 uh people, you're you're

0:30

the last standing uh sort of very very

0:32

dedicated engineers. Uh yeah, so this

0:35

one's a weird one. I got in trouble with

0:37

entropic on this one. Uh obviously

0:39

because of the title. I actually also

0:41

gave him the title and I was like, do

0:42

you want to change it? He was like, no,

0:44

I just roll with it. It's kind of funny.

0:46

Uh uh so so yeah, this is not officially

0:48

endorsed by Copic, but we're hackers,

0:52

right? And Jared is like super

0:54

dedicated. He's um and the other thing I

0:57

also like really enjoy is featuring like

0:59

notable New York AI people, right? Like

1:02

so don't take this as like one is the

1:04

only thing that Jared does. He has a

1:06

whole startup that you should definitely

1:07

ask him about. Um but like you know I'm

1:09

just really excited to feature more

1:11

content for local people. So yeah,

1:14

Jared, take it away.

1:15

>> Thank you very much. Thank you very

1:17

much. And what an amazing conference.

1:19

very sad we're ending it, but hopefully

1:21

it'll be a good ending here. Um, and

1:24

yeah, uh, my name is Jared. Uh, this

1:27

will be a talk on how Claude Code works.

1:30

Again, not affiliated with Anthropic.

1:32

Uh, they don't pay me. I would take

1:35

money, but they don't. Um, but we're

1:37

going to talk about a few other coding

1:39

agents as well. And kind of the highle

1:42

goal that I'll go into is me personally,

1:45

I I'm a big user of all the coding

1:47

agents, as is everyone here. and they

1:50

kind of exploded recently and as a

1:53

developer I was curious what changed

1:55

what made it finally what made coding

1:58

agents finally be good. So let's get

2:01

started. I'll start about me. I'm Jared.

2:03

You can find me I'm Jared Z on X on

2:06

Twitter whatever. Um I'm building the

2:08

workbench for AI engineering. So uh my

2:12

company is called Prompt Layer. We're

2:13

based in New York. You can kind of see

2:15

our office here. It's like a little

2:17

building. So it's blocked by a few of

2:18

the other buildings. So we're we're a

2:20

small team. We launched the product 3

2:22

years ago. So uh long for AI but small

2:24

for everything else. And uh yeah, what

2:27

kind of our core thesis is that we

2:30

believe in rigorous prompt engineering,

2:32

rigorous agent developing development

2:35

and we believe that the product team

2:37

should be involved, the engineering team

2:39

should be involved. We believe if you're

2:40

building AI lawyers, you should have

2:42

lawyers involved as well as engineers.

2:44

Um so that's kind of what we do. uh

2:46

processing millions of LM requests a

2:48

day. And a lot of the insights in this

2:49

talk come from just conversations we

2:52

have with our customers on how to build

2:55

coding agents and stuff like that. And

2:57

also feel free throughout the talk we

2:59

can make this casual. So if there's

3:00

anything I say if you have a question

3:02

feel free to just throw it in. Uh and I

3:05

spend a lot of my time kind of dog

3:06

fooding the product. It's kind of weird

3:08

the job of of a founder these days

3:10

because it's half like kicking off

3:12

agents and then half just using my own

3:14

product to build agents and feels weird

3:16

but it's kind of fun. And uh yeah, the

3:19

last thing I'll add here is I'm a big

3:21

enthusiast. We literally rebuilt our

3:24

engineering org around cloud code. I

3:26

think the hard part about building a

3:28

platform is that you have to deal with

3:30

all these edge cases and oh uh we're

3:33

uploading data sets here it doesn't work

3:35

and you could die a death by a thousand

3:38

cuts. So we made a rule for our

3:39

engineering organization if you

3:42

can complete something in less than an

3:44

hour using cloud code. Just do it. Don't

3:47

prioritize it. And we're a small team on

3:49

purpose but uh it's helped us a lot and

3:52

I think it's really taken us to the next

3:53

level. So I'm a big fan and let's dive

3:56

into how these things work. So this is

3:58

what as I was saying the goal of this

3:59

talk. First, why have these things

4:02

exploded? What is the

4:05

what was the innovation? What was the

4:07

invention that made coding agents

4:08

finally work? If you've been around this

4:11

field for a little bit, you know that uh

4:14

a lot of these autonomous coding agents

4:16

sucked at the beginning and we all tried

4:18

to use them. Uh but it's it's night and

4:21

day. uh we'll dive into the internals

4:24

and and lastly we like everything in

4:27

this talk is oriented around how do you

4:29

build your own agents and how do you use

4:31

this to do AI engineering for yourself.

4:36

So let's just go uh talk about history

4:38

for a second here. How did we get here?

4:41

uh everybody knows started with uh

4:44

remember the workflow of you just copy

4:47

and paste your code back from chat GPT

4:49

back and forth and that was great and

4:50

that was kind of revolutionary when it

4:52

happened. Uh, step two, when cursor came

4:55

out, if we all remember, it was not not

4:59

great software at the beginning. It was

5:01

just the VS Code fork with the command K

5:04

and we all loved it. But, uh, now now

5:08

we're not going to be doing command K

5:09

anymore. Then we got the cursor

5:11

assistant. So, that little agent back

5:13

and forth and then cloud code. And

5:15

honestly, in the last few days since I

5:17

made this slide, maybe there's a new

5:19

version we could talk about here. And uh

5:21

at the end I'll talk about like kind of

5:22

what's next. But this is how we got

5:24

here. And this is really I think the

5:26

cloud code is kind of this headless uh

5:29

not even this this new workflow of not

5:32

even touching code. And it it has to be

5:35

really good. So why is it so good? What

5:38

what was uh what was the big

5:40

breakthrough here? Let's try to figure

5:41

that out. And again throw this in one

5:44

more time. These are all my opinions uh

5:46

and what I think is the breakthrough.

5:47

Maybe there's other things but simple

5:50

architecture. I think a lot of things

5:52

were simplified with how the agent was

5:54

designed and then better models, better

5:57

models and better models. Uh I think the

6:01

a lot of the breakthrough is kind of

6:03

boring in that it's just anthropic

6:06

releasing a better model that works

6:07

better for these type of tooling calls

6:09

and these type of things. But the simple

6:12

architecture relates to that. So we can

6:14

dive into that. the architecture and and

6:18

this is our little you'll see uh prompt

6:20

wrangler is our little mascot for our

6:22

company. So we made a lot of graphics

6:24

for these slides but uh

6:26

basically give it tools and then get out

6:28

of the way is what a oneliner of the

6:32

architecture is today. I think if you've

6:35

been building on top of LMS for a little

6:38

bit this has not always been true.

6:39

Obviously tool calls haven't always

6:41

existed and tool calls is kind of this

6:43

new abstraction for JSON formatting and

6:46

if you remember the GitHub libraries

6:48

like JSON former and stuff like that in

6:50

the olden days but give it tools get out

6:53

of the way. Uh the models are built for

6:56

these things and being trained to get

6:58

better at tool calling and better at

7:00

this. So the more you want to

7:02

overoptimize and every engineer uh

7:04

including my especially myself loves to

7:06

overoptimize and when you first have an

7:08

idea of how to build the agent you're

7:10

going to sit down and say oh and then

7:12

I'm going to prevent this hallucination

7:13

by doing this prompt and then this

7:15

prompt and then this prompt don't do

7:17

that just a simple loop and get out of

7:21

the way and just delete scaffolding and

7:25

less less scaffolding more model is kind

7:27

of the tagline here and you know This is

7:30

uh the leaderboard from this week.

7:34

Obviously, these models are getting

7:36

better and better. Uh we could have a

7:38

whole conversation and I'm sure there's

7:39

been many conversations about is it

7:41

slowing down? Is it plateauing? It

7:43

doesn't really matter for this talk. We

7:46

know it's getting better and they're

7:48

getting better at tool calling and

7:49

they're getting better optimized for

7:51

running autonomously. And don't this is

7:55

I I think Anthropic calls this like the

7:56

AGI pill to way to think about it is

7:59

don't try to overengineer around model

8:01

flaws today because a lot of the things

8:04

will just get better and you'll be

8:07

wasting your time. So here's the

8:10

philosophy, the way I see it of cloud

8:12

code,

8:14

ignoring embeddings, ignoring

8:16

classifiers, ignoring par matching. The

8:19

we had this whole rag thing actually

8:21

cursors bring back a little bit of rag

8:23

and how they're doing it and they're

8:24

mixing and matching. But I think the

8:25

genius with cloud code is that they they

8:28

scratched all this and they said we

8:29

don't need all these fancy uh paradigms

8:33

to get around how the model's bad. let's

8:36

just make a better model and then let it

8:38

let it cook and uh just leaning uh on

8:43

these tool calls and

8:46

simplifying the tool calls which is a

8:48

very important part part instead of

8:49

having a workflow where the master

8:53

prompt can break into three different

8:55

branches and then go into four different

8:56

branches there there's really just a few

9:00

simple tool calls uh including GP

9:03

instead of rag and uh yeah and that's

9:07

kind of what it's trained on. So uh

9:09

these are very optimized tool calling

9:11

models.

9:13

So this is uh the zen of Python if if

9:17

you guys are familiar if you do import

9:18

this in Python. This is I love this

9:20

philosophy when it comes to building

9:22

systems and I think it's really

9:26

apt for how cloud code was built. So

9:29

really just simple is better than

9:31

complex, complex is better than

9:32

complicated, flat is better than nested.

9:34

This is this is all you need to this is

9:36

the whole talk. This is all you need to

9:38

know about how cloud code works and why

9:40

it works specifically that just in we're

9:44

going back to engineering principles

9:46

such that simple design is better

9:48

design. Uh I think this is true whether

9:52

you're building a

9:56

database schema uh but this is also true

9:58

when you're building these autonomous

10:01

coding agents. So let's I'm going to now

10:05

kind of break down all the specific

10:07

parts of this coding agent and uh why I

10:11

think they're interesting. So the first

10:12

is the constitution. Now a lot of the

10:14

stuff we kind of take for granted even

10:16

though they started doing it a month or

10:18

two ago or maybe three or four months

10:20

ago. So this is the cloud MD codeex or

10:23

others use agents MD. The interesting

10:26

thing I think I assume most of you know

10:28

what it is. Uh it's again it's where you

10:30

put the instructions for your library.

10:32

But the interesting thing about this is

10:36

it's basically the team saying we don't

10:39

need to overengineer a system where the

10:42

model first researches the repo and

10:45

cursor uh like cursor 1.0 as you know

10:49

makes uh vector DB locally to understand

10:52

the repo and kind of does all this

10:54

research. They're just saying, "Ah, just

10:56

put a markdown file. Let the user change

10:58

stuff when they need. Let the agent

10:59

change stuff when they need very simple

11:02

and kind of goes back to prompt

11:05

engineering, which I'm a little biased

11:06

towards because prompt layer is a prompt

11:09

engineering platform, but uh

11:10

everything's prompt engineering at the

11:12

end of the day or context engineering.

11:14

Everything is how do you uh how do you

11:17

adapt these general purpose models for

11:18

your usage?" And the simplest answer is

11:21

the best one here, I think.

11:24

So this this is the core of the system.

11:29

It's just a simple master loop. Uh and

11:34

and this is actually kind of

11:35

revolutionary considering how we used to

11:37

build agents. Everything in cloud code

11:39

and and all the coding agents today,

11:41

codeex and and and the new cursor and

11:44

AMP and all that, it's just one while

11:46

loop with tool calls just running the

11:48

master while loop calling the tools and

11:50

going back to the master while loop.

11:52

This is basically four lines of what

11:55

it's called. I think they call it N0

11:57

internally. Uh at least based on my

12:00

research, but while there are tool

12:02

calls, run the tool, give the tool

12:05

results to the model, and do it again

12:07

until there's no tool calls and then ask

12:09

the user what to do. The first time I

12:11

did this, uh, the first time I used tool

12:15

calls, it was very shocking to me that

12:17

the models are so good at just knowing

12:19

when to keep calling the tool and

12:21

knowing when to fix their mistake. And I

12:23

think that's one of the most interesting

12:24

thing about LM just they're really good

12:27

at fixing mistakes and being flexible.

12:30

And the more just going back, the more

12:32

you lean on the model to explore and uh

12:36

figure it out, the better and more

12:39

robust your system is going to be when

12:41

it comes to better models.

12:45

So,

12:48

so these are the core tools uh we have

12:50

in cloud code today. And to be honest,

12:53

these change every day. you know,

12:55

they're doing new releases every few

12:56

days, but these are the core ones that I

12:59

found most interesting to talk about. Uh

13:02

there could be 15 tomorrow, there could

13:04

be down to five tomorrow, but this is

13:07

what I find interesting. So, first of

13:08

all, read. Uh yeah, they could just do a

13:12

cat. Uh but what's interesting is read

13:15

is we have token limits. So, if you've

13:17

used cloud code a lot, you've seen that

13:19

sometimes it'll say this file's too big

13:21

or something like that. That's why it's

13:23

worth building this read tool. Grep

13:26

glob. Uh,

13:28

this one's very interesting too because

13:30

it goes against a lot of the wisdom at

13:32

the time of using rag and using vectors.

13:34

And I'm not saying rag has no place by

13:36

the way either. But in these general

13:38

purpose agents, GP is good and and and

13:41

GP is uh how users would do it. And I

13:44

think that's actually a highle point

13:46

here. As as you're as I'm talking about

13:48

these tools, remember these are all

13:51

human tasks. They're not we're not

13:53

making up a brand new tool for the model

13:55

to use. We're kind of just mimicking the

13:57

human actions and what you and I would

13:59

do if we were at a terminal trying to

14:01

fix a problem. Edit. Edit makes sense. I

14:05

think the interesting thing to note in

14:06

edit is it's using diffs and it's not

14:08

rewriting files most of the time. uh way

14:12

faster, way way uh less context used,

14:15

but also way less

14:18

uh issues. Uh if if I asked you to if I

14:21

if I gave you these slides and asked you

14:23

to review the slides and you read it and

14:25

had to write down all the slides for me

14:28

in your new revisions versus if you

14:30

could just cross out things in the

14:31

paper, the crossing out is way easier.

14:33

Diff is kind of a natural thing to

14:35

prevent mistakes.

14:37

Bash. Bash is uh bash is the core thing

14:40

here. I think you could probably get rid

14:43

of all these tools and only have bash.

14:45

And the first time I saw this when when

14:48

you run something in claw code and

14:51

claude code creates a Python file and

14:53

then runs the Python file then deletes

14:55

the Python file. That's that's the

14:58

beauty of why this thing works. So bash

15:00

is the most important. I'd say web

15:02

search, web fetch. Uh the interesting

15:05

thing about these is they move move it

15:06

to a cheaper and faster model. So for

15:08

example, if you're building a some sort

15:11

of agent maybe on your platform and

15:13

you're building an agent and it needs to

15:14

connect to some endpoints, some list of

15:16

endpoints, might be worth to bring that

15:18

into a kind of sub tier as opposed to

15:23

that master while loop. That's why this

15:25

is its own tool. To-dos, uh we've all se

15:28

seen to-dos. talk about it a little bit

15:30

more later, but keeping the model on

15:32

track, steerability, and then tasks.

15:34

Tasks is very interesting. It's context

15:36

management. It's how do we how do we run

15:40

this long process, read this whole file

15:41

without cluttering the context? Because

15:43

the biggest enemy here is when your

15:46

context is full, the model gets stupid

15:49

for lack of better words. So basically,

15:51

bash is all you need. Uh I think this is

15:53

the one thing I want to drill down. The

15:56

amazing thing about there's two amazing

15:57

things about bash for coding agents. The

15:59

first is that it's simple uh and it does

16:04

everything. It's it's very robust. But

16:06

the second thing that's equally

16:07

important is there's so much training

16:09

data on it because that's what we use.

16:11

It's not it's the reason that models are

16:13

not as good at Rust or less common

16:16

programming languages just because

16:18

there's less people doing it.

16:22

So it's really the universal adapter.

16:24

Um, you thousands of tools, you could do

16:26

anything. Uh, this is that Python

16:29

example I gave. I I I always find it so

16:31

cool when it does the Python script

16:32

thing or creates tests and I always have

16:34

to tell it not to. But it all these

16:37

shell tools are in it. And this is I

16:40

mean I find myself using cloud code to

16:42

spin up local environments where

16:44

normally I'd have like five commands

16:46

written down on some file somewhere and

16:48

then they get out of date. It's really

16:50

good at figuring this stuff out and

16:51

running the stuff you'd want to do.

16:53

uh and it specifically lets the model

16:56

try things.

16:58

So uh yeah, the other suggestions here

17:02

and the tool usage uh I think there's a

17:06

little bit of a system prompt uh that

17:08

tells it which to use and when to use

17:11

which tool over which and this changes a

17:13

lot but the these are kind of like the

17:14

edge cases and the corners you find the

17:16

model getting stuck in. So reading

17:18

before editing uh they actually make

17:20

make you do that using GP the tool

17:23

instead of the bash. So if you look at

17:27

the tool list here there's a special GP

17:29

tool. Uh there could be a lot of reasons

17:32

for that. I think security is a big one

17:34

uh and sandboxing but then also just

17:37

that token limit thing running

17:39

independent operations in parallel. Uh,

17:41

so kind of pushing the model to do that

17:43

more. And then also like these trivial

17:45

things like quoting paths with spaces.

17:47

It's just the common common things. I'm

17:49

sure they're just dog fooding a lot at

17:51

anthropic and they find it and they're

17:52

like, "All right, we'll throw it in the

17:53

system prompt."

17:55

Okay, so let's talk about to-do lists.

17:57

Uh, now again, a very common thing, but

18:01

was not a common thing before. The the

18:04

So this is actually I think a to-do list

18:06

for from some of my my research for this

18:08

slide deck. Um, but the really

18:11

interesting thing about to-do lists is

18:14

that they're structured but not

18:17

structurally enforced. So, here are the

18:21

rules. One task at a time. Uh, mark them

18:24

completed. This is kind of stuff you

18:26

would expect. Uh, keep working on the in

18:29

progress if there's block blocks or

18:31

errors and kind of break up the tasks

18:34

into different instructions. But the

18:37

most interesting thing to me is it's not

18:39

enforced deterministically. It's purely

18:42

prompt based. It's purely in the system

18:44

prompt. It's purely because our models

18:47

are just good at instruction following

18:49

now. And this would not have worked a

18:51

year ago. This would not have worked two

18:52

years ago. Um there's tool descriptions

18:55

at the top of the system prompt. We're

18:57

kind of uh injecting the todos into the

19:01

system prompt. uh there's they're not

19:04

but it but it's not enforced in actual

19:06

code and again uh maybe there's other

19:09

agents that take an opposite path. Uh I

19:11

just found this pretty interesting that

19:13

this at least as a user makes a big

19:16

difference and it doesn't even see it

19:18

seems it was it seems like it was very

19:21

simple to implement almost a a weekend

19:23

project someone did and seemed to work.

19:25

could be wrong about about that as well,

19:27

but uh um so yeah, it's literally a

19:30

function call. Uh

19:32

it's the first time you ask something,

19:35

the reasoning exports this to-do block,

19:37

and I'll show you what the structure is

19:38

on the next slide. Uh there's ids there.

19:42

There's some kind of structured schema

19:44

and determinism, but

19:47

it it's just injected there. So here's a

19:51

example of what it could look like. You

19:53

get a version, you get your ID, uh a

19:56

title of the to-do, and then it could

19:58

actually inject evidence. So, this is uh

20:00

seemingly arbitrary blobs of data it

20:03

could use. And the ids are hashes that

20:06

it could then refer to

20:08

title, something human readable, but

20:11

this is a just another way to structure

20:13

the data. And in the same way that

20:15

you're going to organize your desk when

20:17

you work, this is how we're trying to

20:19

organize the model.

20:21

So I think there's uh these are kind of

20:24

the four benefits we're getting. We're

20:26

forcing it to plan. Uh we get to resume

20:29

after crashes. Uh clog code fails. I

20:33

think UX is a big part of this. As a

20:35

user, you know how it's going. It's not

20:38

just running off in a loop for 40

20:40

minutes without any uh signal to you. So

20:42

UX is non-negligible. Even though UX

20:45

might not make it a better coding agent,

20:46

it might make it better for us all to

20:48

use. and uh the steerability one. So

20:52

here's two other parts that were under

20:55

the hood. Async buffer, so they called

20:57

it H2A. Uh it's kind of uh the IO

21:02

process and how to decouple it from

21:04

reasoning and and how to manage context

21:06

in a way that you're not just stuffing

21:08

everything you're seeing in the terminal

21:09

and everything back into the model,

21:11

which again context is our biggest enemy

21:13

here. It's going to make the model

21:14

stupider. So we need to uh be a little

21:17

bit smart about that and and how we do

21:20

compact and how we do summarization. So

21:23

here you see when it reaches capacity it

21:24

kind of drops the middle summarizes the

21:26

head and tail. Um then we have the

21:30

that's the context compressor there. So

21:32

what is the limit 92% it seems like

21:36

something like that. Uh and and how does

21:39

it how does it save long-term storage?

21:42

That's actually another kind of

21:44

advantage of bash in my opinion and

21:46

having a sandbox. I would even make a

21:48

prediction here that all your all chat

21:50

GPT windows, all clawed windows are

21:52

going to come with a sandbox in the near

21:54

future. It's just so much better because

21:56

you can store that long-term memory. And

21:59

I do this all the time. I have I have

22:01

cloud code skills for deep research and

22:03

stuff like that. And I'm always

22:04

instructing it save markdown files

22:06

because the shorter the context, the

22:08

quicker it is and the smarter it is.

22:12

So this is what I'm most excited about.

22:15

We don't need DAGs like this. We

22:18

I'll give you I'll give you a real

22:19

example. Uh so some users at prompt

22:23

layer uh different agents like customer

22:26

support agent basically everybody was

22:28

building DAGs like this for the last two

22:30

two and a half years. Uh and it was

22:34

crazy. Hundreds of nodes of okay this if

22:38

this user wants a refund route them to

22:40

this prompt if they want this and a lot

22:43

of uh classifying prompts. The advantage

22:46

of this is you can kind of guarantee

22:47

there's not going to be hallucinations

22:49

or guarantee there's not going to be

22:52

refunds to people who shouldn't be

22:53

having refunds or kind of that pro it

22:56

solves the prompt injection problem

22:57

because if you're in a prompt that

22:59

purely classifies it as X or Y injecting

23:02

doesn't really matter especially if you

23:03

throw out the context. Now we kind of

23:06

brought back bring back that attack

23:08

vector but the but the major benefit is

23:10

we don't have to deal with this web of

23:13

engineering uh madness and uh it just

23:16

it's 10x easier to develop these things

23:18

10x more maintainable and it actually

23:20

works way better because our models are

23:22

just good now.

23:24

So this is this is kind of a takeaway is

23:27

rely on the model. uh when in doubt,

23:31

don't don't try to think through every

23:34

edge case and think through every if

23:35

statement. Just rely on the model to

23:38

explore and figure it out. And I was

23:40

actually two days ago, I think, or

23:42

yesterday, sometime this week, I was

23:44

doing an experiment on our dashboard to

23:48

add like trying these browser agents.

23:51

And I wanted to see if I could add

23:52

little titles to all our buttons and it

23:55

would help the agent navigate our

23:56

website automatically. And it actually

23:59

made it worse, surprisingly. Uh, and

24:01

maybe I could run it again and maybe I

24:03

did something wrong with this test, but

24:04

it made the agent navigate prompt layer

24:06

worse because it was getting distracted

24:09

because I was telling it you have to

24:10

click this button, then you have to

24:11

click this button and then

24:14

it's it didn't know what to do. So, it's

24:16

better to rely on exploration. You have

24:18

a question?

24:19

>> Yeah, I'll I'll push back a little bit,

24:22

>> please. I'll admit any

24:25

scaffolding we create today to resolve

24:29

the idiosyncrasies of limitations will

24:33

be that'll be obsolete 3 to 6 months

24:36

even if that's the case they help a

24:38

little bit today I how do you balance

24:41

that like wasted engineering to solve a

24:44

problem we only have for three months

24:46

>> it's a great question so just to repeat

24:48

uh the question is basically

24:52

what is the trade-off between solving

24:54

the actual problems we have today and if

24:55

you're relying on the model that can't

24:57

do it yet but it'll be able to do it in

24:59

three months, right? Um it's case by

25:02

case. It depends what you're building.

25:03

If you're building a chatbot for a bank,

25:05

you probably do want to be a little bit

25:07

more comp be careful. To me, the happy

25:10

middle ground is to use this agent

25:14

paradigm of a master while loop and tool

25:17

calls, but make your tool calls very

25:19

rigorous. So I think it's okay to have a

25:22

tool call that looks like this or looks

25:24

like half of this uh in the same way

25:26

that claude code uses read as a tool

25:29

call or GP as a tool call. So for the

25:32

edge cases,

25:34

throw it in a structured tool that you

25:36

can then eval in version and stuff like

25:38

that. And I could talk I'm going to talk

25:39

a little bit more about that later, but

25:41

throw it in that structured tool. But

25:43

for everything else, uh for the

25:45

exploration phase, leave it to the model

25:48

or throw some system prompt. Uh so

25:54

it's a trade-off and it's very use case

25:55

dependent, but I think it's a good

25:57

question. Thank you. So yeah, uh just

26:01

back to cloud code. Uh we're we're

26:04

getting rid of all this stuff. We're

26:05

saying we don't want MLbased intent

26:07

detection. We don't want reax. We don't

26:08

want the I mean it uses reax a little

26:10

bit, but we don't want reax baked into

26:12

it. We don't want classifiers. And and

26:14

there was a long time we actually built

26:16

a product for prompt layer. We never

26:18

released it because there's only a

26:19

prototype of using a MLbased like a

26:22

nonlm based classifier in your prompt

26:25

pipeline instead of LMS. A lot of people

26:27

have a lot of success with it, but

26:30

it it feels more and more like it's not

26:33

going to be that helpful unless cost is

26:34

a huge concern for you. And even then

26:36

cost is the smaller models is going less

26:39

and less as uh kind of financial

26:43

engineering between all these companies

26:44

pays for our tokens. Um so Claude does

26:50

also this smart thing I think with the

26:51

trigger phases. you know, you have

26:53

think, think hard, think harder, and

26:55

ultra think is my favorite. Uh, and this

26:58

lets us use the reasoning budget, the

27:01

reasoning token budget as another

27:03

parameter that the model can adjust. And

27:05

this is actually the model can adjust

27:07

this, but this is how we force it to

27:08

adjust. And as opposed to you could make

27:11

a tool call for

27:13

hard planning. And actually, there's

27:15

some coding agents that do this. or you

27:17

can uh let the user specify it and then

27:20

just on the fly change it.

27:23

So this is this is one of the biggest

27:26

topics here. Sandboxing and permissions.

27:28

I'm going to be completely honest, it's

27:30

the most boring part of this to me

27:32

because I just run it on YOLO mode half

27:34

the time. Um it's uh

27:39

some people on our team actually dropped

27:40

all their local databases. So you do

27:42

have to be careful. Uh so uh you know we

27:46

don't yolo mode with our enterprise

27:47

customers obviously but uh I but but I

27:52

think this stuff is it feels like it's

27:54

going to be solved but but we do need to

27:55

know how to works a little bit. So

27:58

there's a big issue of in prompt

28:01

injection from the internet. If you're

28:03

connecting this agent that has shell

28:07

access and you're doing web fetch that's

28:10

a pretty big attack vector. Uh, so

28:12

there's some containerization of that.

28:14

There's blocking URLs. You could see

28:17

cloud code's pretty annoying about can I

28:19

fetch from this URL? Can I do this? And

28:21

it kind of puts it into a sub agent. And

28:24

uh, yeah, most of the most of the

28:26

complex code here is in this sandboxing

28:29

and permission set.

28:31

I think there's this whole pipeline to

28:33

gate bash command. So it depending on

28:37

the prefix is how it goes through the

28:41

sandboxing environment and a lot of the

28:43

other models work differently here. Uh

28:45

but this is how cloud code does it. I'll

28:46

explain the other ones later at the end.

28:51

The next topic uh of relevance here is

28:53

sub aents. Uh so this is going back to

28:55

context management and this this problem

28:57

we keep going back to of the longer

28:59

context the the stupider our agent is.

29:02

This is a this is an answer to it. So

29:04

using sub aents for specific tasks and

29:07

the key with the sub aent is it has its

29:09

own context and it feeds back only the

29:12

results and this is how you don't

29:13

clutter it. So we got the researcher

29:15

these are just four examples researcher

29:17

docs reader testr runner code reviewer

29:20

in that example I was talking about

29:21

earlier when I added all the tags to our

29:24

website to let the agent do it better. I

29:27

obviously I use a coding agent to do

29:29

that and I said read our docs first and

29:31

then do it and it's going to do this in

29:33

a sub agent. It's going to feed back the

29:35

information and the the key thing here

29:38

is the forks of the agent and how we

29:40

aggregate it back into our main context.

29:44

So here's an example. I think this is

29:46

actually very interesting. I want to

29:47

call out a thing or two here. So task is

29:50

what a sub aent is. We're giving task

29:53

two things. Description and a prompt.

29:56

The description is what the user is

29:58

going to see. So you're going to say

30:00

task

30:02

uh find default chat context

30:04

instantiation or something. And then the

30:06

prompt you're going to give a long

30:08

string which is really interesting

30:09

because now we have the coding agent

30:11

prompting its own agents. And I've

30:14

actually used this paradigm in agents

30:16

I've built for our product. Uh if you

30:20

can you can just have the agent stuff as

30:23

much information as it wants in this

30:24

string. And if we're going back to

30:26

relying on the model if this task

30:28

returns an error now stuff even more

30:31

information and let it solve the

30:33

problems. It's better to be flexible

30:35

rather than rigid.

30:37

If I was building this I would consider

30:39

switching a string to maybe an object

30:41

here uh depending on what you're

30:43

building and maybe let it give actually

30:44

more structured data. Yes. So I can see

30:48

this prompt has quite a couple

30:49

sentences. Is that in the main agent? Is

30:52

that taking the context of the main

30:54

agent or is there some sort of

30:56

intermediate step where the sub agent

30:58

double reads over you know like what the

31:01

main agent is doing and then generates

31:06

>> right? So the question is does the task

31:09

just get the prompt here or does it also

31:12

get your chat history? Is that the

31:13

question?

31:15

The question is is are all of the I have

31:18

my main agent. Is all of this in the

31:20

system prompt of the main agent to

31:22

inform how that prompts the sub agent?

31:24

>> No. No. Like it's not in the system.

31:27

It's in the whole context. Is the all of

31:29

this context of the main agent

31:33

>> the task it calls or or you're saying

31:36

the structure for the task

31:37

>> this whole JSON right or

31:39

>> yes. So this is a tool call. So the tool

31:42

called structure of what a task is is in

31:45

the maiden agent. Uh and then these are

31:48

generated on the fly. Uh so as you want

31:50

to run a task, it's generating the

31:52

description and the prompt. Task is a

31:55

tool call. They could be run in parallel

31:56

and then they're returning the results

31:58

of it. Hopefully that helps.

32:03

Um so we could go back to the system

32:07

prompt. So there's some leaks of the

32:09

cloud code system prompt. So that's what

32:10

I'm basing this on. Uh you can find it

32:13

online. Um here are some things I I

32:16

noted from it. Uh concise outputs. Uh

32:20

obviously don't give anything too long.

32:23

No here is or I will just do the do the

32:26

task the user wants. Uh kind of pushing

32:29

it to use tools more more instead of

32:31

text explanations. Obviously, I think

32:34

when we we've all built coding agents

32:36

and when we do it, it usually says,

32:38

"Hey, I want to run this SQL." No, push

32:40

it to use the tool. Um,

32:44

matching the existing code, not adding

32:46

comments. This one does not work for me,

32:48

but uh running commands in parallel

32:51

extensively and then the to-dos and

32:53

stuff like that. There's a lot that you

32:55

can nudge it to do with the system

32:57

prompts. But as you see, I think there's

32:59

a really interesting point to the

33:01

earlier question you had about where

33:04

what's the trade-off between DAGs and

33:06

loops.

33:08

A lot of these things you could see are

33:11

feel like they came from someone using

33:13

it clawed code and saying, "Oh, if only

33:16

it did this a little less or if it did

33:17

this a little bit more." That's where

33:20

prompting comes in because it's so easy

33:21

to iterate and it's not you're not it's

33:24

not a hard requirement but if only it

33:27

said here is a little bit more. It's

33:28

okay to say it sometimes but

33:32

all right skills. Skills is great. It's

33:34

a slightly newer. I've I honestly got

33:37

convinced of it only recently. So good.

33:39

I built these slides with skills. uh

33:42

it's basically I think in the context of

33:44

this talk about architecture, let's

33:46

think of it as a extendable system

33:49

prompt. So in the same way that we don't

33:52

want to clutter the context, there's a

33:54

lot of different type of tasks you're

33:55

going to need to do where you want a lot

33:58

more context. So this is how we give

34:00

cloud code a few options of how it could

34:02

tap into more information. Here are some

34:05

examples. Uh, I use this for I have a

34:09

skill for docs updates to tell it my

34:11

writing style and and my product. So, if

34:14

I want to do a docs update, I say use

34:16

that skill. Load in that skill. Uh,

34:18

editing Microsoft Office uh Microsoft do

34:22

Microsoft Word and Excel. Um, I I don't

34:25

use this, but I've seen a lot of people

34:27

using it. It kind of like decompiles the

34:29

f it's really cool. Uh, but it lets

34:31

cloud code do this design style guide.

34:34

This is a common one. Deep research. I

34:36

the other day I threw in a like article

34:40

or GitHub uh repo on how deep research

34:42

works and I said rebuild this as a cloud

34:44

code skill works so well it's amazing.

34:49

So unified diffing I think this is worth

34:50

its own slide. Uh it's very obvious

34:53

probably not too much we need to talk

34:55

about here but

34:57

it makes this so much better and it

35:00

makes the token limit shorter. It makes

35:03

it faster and makes it less prone to

35:05

mistakes like I gave with that example

35:07

when you rewrite an essay versus marking

35:10

it with a red line. It's just better. I

35:12

highly recommend using diffing in any

35:15

agents you're doing. Unified diff is a

35:17

standard. When I looked into a lot of

35:18

these coding agents, some actually built

35:20

their own kind of standard uh and like

35:24

with slight variations on unified diff

35:26

because you don't always need the line

35:27

numbers and but unified diff works. You

35:31

had a question

35:32

>> to go back to skills.

35:35

I are uh I don't know if anyone's seen

35:38

the cloud the cloud code warns you and

35:40

in yellow text if your quad indeed is

35:42

like greater than 40k characters and so

35:44

I was like okay I'm up. Let me

35:46

break this down into skills. So I bet

35:48

spent some time and then Claude ignored

35:51

all of my skills and so I put them in

35:53

some. So what am I? I don't know. Skills

35:56

[clears throat] feel globally

35:59

misunderstood or like not I don't know

36:01

I'm missing something. Help me

36:03

understand. [laughter]

36:06

>> Yeah. So the the question was on okay so

36:10

cloud code system cloud MD it tells you

36:12

when it's too long. So uh you move it

36:16

into skills and then it's not

36:17

recognizing the skills and not picking

36:18

it up when it's needed.

36:21

>> Yeah.

36:22

take that up with the anthropic team I'd

36:24

say. Uh but that's also a good example

36:27

of maybe the system prompt

36:29

>> that was the intention like skills you

36:31

need to invoke them and like the agent

36:34

itself shouldn't like just call them all

36:38

the time,

36:39

>> right? It does give a dis description of

36:42

each skill to the model or it should uh

36:46

tell it okay here's like a oneliner

36:48

about each skill. So theoretically in a

36:50

perfect world it would pick up all the

36:51

skills all the time. But you're right, I

36:53

generally have to call the skill myself

36:54

manually. I but I think this is a good

36:58

tieback into when is prompting the right

37:01

solution or when is the DAG the right

37:04

solution or maybe this is a model

37:06

training problem. Maybe they need to do

37:08

a little bit more in post-raining of

37:11

getting the model to call the skills is

37:15

almost like calling a tool call. You

37:17

have to know when to call it. So maybe

37:19

this is just uh a functionality that's

37:21

not that good yet, but I think the

37:22

paradigm is very interesting, but it's

37:24

not perfect as we're learning.

37:28

So diffing we just talked about what's

37:31

next. So this is more opinion based, but

37:34

where I see these things going and where

37:35

the next kind of innovations might

37:37

likely be. So

37:41

I I think there's two schools of

37:43

thoughts here. A lot of people think

37:45

we're going to have one master loop with

37:46

hundreds of tool calls and just tool

37:48

calling is going to get much better.

37:50

That's highly likely. Uh I take the

37:53

alternate view which I think we need to

37:56

reduce the tool calls as much as

37:58

possible and just go back to just bash

38:00

and maybe even put scripts in the local

38:03

directory. I think I am on the proponent

38:06

of one mega tool call instead of a lot

38:08

of tool calls. Maybe not actually one. I

38:10

actually think that slide I showed you

38:12

before is probably a good list, but a

38:15

lot of people think we need hundreds of

38:17

tool calls. I just don't think it's

38:18

going there. Adaptive budgets, uh,

38:21

adjusting reasoning, we do this a little

38:23

bit, uh, the thinking and ultra think

38:25

and stuff like that, but I I think

38:28

reasoning models as a tool makes a lot

38:30

of sense as a paradigm. Can you use I

38:33

think a lot of us would make a trade-off

38:35

of a 20 times quicker model with

38:38

slightly stupider results and being able

38:40

to call a tool call for a very good

38:43

model. I think that's a trade-off we

38:44

we'd make in a lot of cases. Maybe not

38:46

our planner. Maybe we go to the planner

38:48

first with GPD 51 codeex or opus or

38:51

whatever if when the new opus comes out.

38:53

Uh but

38:56

I think I think there's a lot of uh mix

38:58

and matching we can do and that's I

38:59

think the next frontier and I think the

39:01

last frontier

39:03

I think there's a lot we can learn from

39:04

to-do lists and and new first class

39:07

paradigms we can build skills is another

39:09

example of a first class paradigm we can

39:11

kind of try to build into it maybe it

39:13

doesn't work perfectly uh but I think

39:15

there's a I think there's a lot of new

39:16

discoveries to be made there in my

39:18

opinion do I have them I don't know uh

39:21

so now I I want to for the for the

39:24

latter part of this talk I want to talk

39:26

about the other frontier agents and the

39:28

other philosophies they've designed

39:30

philosophies they've chosen and

39:34

we all have the benefit we can mix and

39:36

match when we were building our agent we

39:38

could do whatever we want and learn from

39:39

the best and the frontier labs are very

39:41

good at this so

39:45

uh something I like to go back to a lot

39:47

I call it the AI therapist problem may

39:49

maybe there's a better name to give it

39:51

uh but I believe there's a lot of

39:54

problems, the most interesting AI

39:55

problems around. There isn't a global

39:58

maximum. Meaning,

40:00

all right, we're in New York City. If I

40:02

need to see a therapist, there's six on

40:04

every block here. There's no global

40:06

answer for what the best therapist is.

40:09

There's different strategies. There's a

40:11

therapist that does meditation or CBT or

40:14

maybe one that gives you Iawaska. and

40:16

and these are just kind of like

40:17

different strategies for the same goal

40:20

in the same way that if you're building

40:22

an AI therapist, there isn't a global

40:24

maxima. This is kind of my anti-AGI

40:27

take, but this is also the take to say

40:29

that when you're building these

40:31

applications, taste comes into it a lot

40:33

and design architecture matters a lot.

40:35

You can have five different coding

40:37

agents that are all amazing. Nobody

40:39

knows which today. Nobody knows which

40:41

one's best to be honest. I don't think

40:43

Anthropic knows. I don't think OpenAI

40:44

knows. I don't think source graph knows.

40:46

Nobody knows whose has the best, but

40:48

some are better at some things. I

40:50

personally like claude code for I said

40:53

like running my local environment or

40:55

using git or using these kind of like

40:57

human actions that require back and

40:59

forth, but I go to codeex for the hard

41:01

problems or I go to composer from cursor

41:04

because it's faster. And there's a lot

41:07

basically all this to say there's value

41:10

in having different philosophies here.

41:11

And I don't think there's going to be

41:13

one winner to this. I think there's

41:14

going to be different winners for

41:15

different use cases. And and this is not

41:18

just coding agents, by the way. This is

41:19

all AI products. This is this is kind of

41:21

why our whole company focuses on domain

41:24

experts and bringing in the PM and the

41:26

the the subject matter expert into it

41:29

because that's how you build

41:30

defensibility.

41:31

So here are the perspectives. The way I

41:33

see it, this is not a complete list of

41:35

coding agents, but these are the ones

41:37

that I think are the most interesting.

41:39

Cloud code I think I think to me it wins

41:42

in user friendliness and simplicity. Uh

41:45

like I said if I'm doing something that

41:46

requires a lot of applications that git

41:49

git's just the best example. If I want

41:51

to make a PR I'm going to cloud codeex

41:54

uh context it's really good at context

41:56

management. Uh it feels powerful. Do I

42:00

have the evidence to show you that it's

42:01

more powerful? Probably not. But uh it

42:03

feels that way to me and the market feel

42:06

there's a whole another conversation

42:08

here to say the market knows best and

42:10

what people talk about knows best but I

42:12

don't know if they know either. Cursor

42:15

IDE is kind of that perspective model

42:18

agnostic. It's faster. Factory uh makes

42:21

Droid uh great team. They were here too.

42:23

Uh they have multiple they they really

42:26

specialize these droid sub aents they

42:28

have. So that's kind of their edge and

42:31

that's maybe a DAG conversation too or

42:33

maybe a model training uh cognition. Uh

42:36

so Devon uh kind of this endto-end

42:38

autonomy self-reflection AMP which I'll

42:41

talk about more in a second. They have a

42:43

lot of interesting perspectives and

42:44

actually I find them very exciting these

42:46

days free it's model agnostic uh and

42:50

there's a lot of UX sugar for users and

42:52

I actually I love their design their

42:54

their talks at this this conference they

42:56

they they have very very unique

42:58

perspectives so let's start with codeex

43:00

because it's a popular one

43:03

so it's pretty similar to cloud code uh

43:06

same master while loop most of these do

43:09

because that's just the winning

43:10

architecture uh interesting ly rust

43:13

core. Uh the cool thing is it's open

43:15

source so you can actually use codeex to

43:18

understand how codeex works which is

43:19

kind of what I did. Um it's a little

43:22

more event driven a little more uh work

43:25

went into concurrent threading here uh

43:27

kind of submission cues event outputs

43:30

kind of the the thing I was talking

43:32

about with the IO buffer in cloud code.

43:36

I think they do it a little bit

43:38

differently. Uh sandboxing is very

43:40

different. So theirs is more you I mean

43:44

you could see here Mac OS seat belt and

43:46

Linux land theirs is more kernel based

43:48

uh and then state kind of this it's all

43:52

under threading and and permissions is

43:55

how I'd say it's mostly different and

43:56

then the real difference is the model to

44:00

be honest. Uh so this is a this is

44:04

actually me using cloud code to

44:06

understand how codeex works. Uh so you

44:09

see we have a few explore. I didn't talk

44:11

about explore but uh it's uh it's a it's

44:14

another sub agent type as as I as I

44:17

mentioned these go in and out. Uh but

44:18

yeah this is researching codecs with

44:20

cloud code. It's always a fun thing to

44:22

do. So let's talk about AMP.

44:25

So this is source graphs coding agent.

44:28

I it has a free tier. That's just a cool

44:31

perspective in my opinion. Uh they

44:33

leverage kind of these excess tokens uh

44:36

from providers and they give ads. So, we

44:37

actually have an ad on them. I think

44:39

it's a cool I'm pro-AD. A lot of people

44:41

are anti- ad. I think it's one of my hot

44:43

takes, but I like it. They don't have a

44:45

model selector. This is very

44:47

interesting, too. This is its own

44:49

perspective. Uh, it actually helps them

44:50

move faster because you're you have less

44:54

of an exact expectation of what the

44:56

output is because, you know, they might

44:58

be switching models here and there. So,

45:00

that changes how they develop. And then,

45:03

uh, I think their vision is pretty

45:05

interesting.

45:07

uh their vision is how do we build not

45:12

just the best agent but how do we build

45:14

the agent that works with the most

45:16

agentfriendly environments and actually

45:18

factory gave a talk similar to this as

45:20

well but how do how do you build a

45:22

hermetically sealed uh a like coding

45:26

repo that the agent can run tests on how

45:28

do you build the feedback loop because

45:29

that's kind of the holy grail that's how

45:31

we build an autonomous agent and how do

45:33

we uh I'd love to see the front-end

45:35

version of this how do let it look at

45:37

its own design and make it better and go

45:39

back and forth and this is kind of their

45:41

guiding philosophy and you could boil it

45:43

down to the agent perspective as I've

45:46

been calling it.

45:47

I think they do interesting stuff with

45:49

context. So, we're all familiar with

45:52

compact. It's the worst. You have to

45:54

wait 10. I don't know why it takes so

45:55

long. Uh and if you're not familiar,

45:58

it's summarizing your chat window when

45:59

the context gets too high and giving the

46:01

summary. So, they have something called

46:03

handoff, which makes me think of if you

46:05

any was a anyone was a Call of Duty

46:07

player back in the day, switch weapons.

46:10

It's faster than reloading. And uh

46:12

that's what handoff is. You're you're

46:14

just starting a new thread and you're

46:15

giving it the information it needs for a

46:16

new thread. That feels like the winning

46:19

strategy to me. Could be wrong, but

46:21

maybe you need both. That's where

46:23

they're pushing it. And I kind of like

46:25

that. I They get they give a very fresh

46:27

perspective. So, the second thing is

46:29

model choice. This is the reasoning

46:32

knobs uh and their view on it. They have

46:34

fast, smart, and Oracle. So, they lean

46:38

even more heavily into we have different

46:41

models. We're not telling you what

46:42

Oracle is. They tell you, but we're

46:44

willing to switch what Oracle is, but

46:46

we're going to use Oracle when we have a

46:48

very hard problem.

46:50

So, yeah. So, that's AMP. Let's go to

46:54

Cursor's Agent. I think Cursor's agent

46:56

has a very interesting perspective here.

46:58

First, obviously, it's UI. uh UI first,

47:01

not CLI. I think they might have a CLI,

47:03

not entirely sure, but the UI is the

47:05

interesting part. It's just so fast.

47:07

Their new model composer, it's

47:09

distilled. They have they have the data.

47:11

They actually made, in my opinion,

47:13

people interested in fine-tuning again.

47:15

fine-tuning. It was almost uh we'd never

47:18

recommend it to our customers, but

47:20

composer shows you that you can actually

47:22

build defensibility based on your data

47:24

again, which which is uh surprising, but

47:28

uh yeah, cursors agent composer, I've

47:30

been almost switching completely to it

47:33

since because it's just so fast. It's

47:34

almost too fast. Accidentally pushed to

47:36

master on one of my personal projects.

47:38

Uh so you don't you don't want that

47:40

always. Uh but cursor was just the crowd

47:42

favorite and and I want to give a lot of

47:45

uh props to their team. They built

47:48

iteratively. The first version of cursor

47:50

was so bad and it was and we all use I

47:53

used it because it's a VS code for fork.

47:55

I have nothing to lose and it's gotten

47:57

so good. It's such a good piece of

47:58

software and it's a great team and uh

48:01

but I I'll say the same can be said

48:03

about OpenAI's codeex models. They're

48:05

not quite as fast, but they are

48:08

optimized for these coding agents and

48:10

they are distilled. And I could see

48:12

OpenAI coming out with a really fast

48:14

model here because they also have the

48:16

data.

48:18

So here's a picture. Um I think you

48:20

could this is a picture they put on

48:23

their blog and you could see what their

48:25

perspective is on coding agents here

48:27

just based on the fact that they show

48:28

you the three models they're running.

48:30

So, they're offering composer, but

48:32

they're letting you use the

48:32

state-of-the-art because they know that

48:34

maybe GPD 5.1 is better at planning or

48:38

here it's five, but now we have 5.1.

48:42

So, here begs the big question, which

48:45

one should we all use? Which

48:46

architecture is best? What should we do?

48:49

And uh my opinion here is that

48:53

benchmarks are pretty useless.

48:55

Benchmarks have become marketing for a

48:57

lot of these model providers. every

48:58

model beats the benchmarks. I don't know

49:01

how that happens, but

49:04

I think there's there's world where

49:06

evals matter here. And

49:08

the question is what you can eval. The

49:10

question is how this whole simplic

49:14

simple while loop architecture that I've

49:16

been kind of trying to push based on my

49:18

understanding of it actually makes it

49:20

harder to eval because if we're relying

49:22

more on model flexibility, how do you

49:25

test it? You could run an integration

49:27

test, kind of this endto-end test, and

49:29

just say, "Does it fix the problem?"

49:31

That's one way to do it. You could break

49:33

it up. You could kind of do point in

49:34

time snapshots and say, "Hey, I'm going

49:37

to give a context to my chatbot from

49:39

like a half-finish conversation where I

49:41

know it should be running a specific

49:42

tool call." I could run those. Uh I or I

49:46

could maybe just run a back test and

49:47

say, "How how often does it change the

49:49

tools?" I think there's also another

49:52

concept here that's starting to be

49:53

developed called agent smell or at least

49:56

I'm calling it agent smell. So run an

49:58

agent and see how many times does it

50:00

call a tool call. How many times does it

50:01

retry? How long does it take? And these

50:03

are all surface level metrics but it's

50:05

really good for sanity checking. And

50:07

these things are hard to eval. There's a

50:09

lot that goes into it. I'll show you an

50:11

example of what I did uh just to kind of

50:14

dive into it. But but on that subject

50:17

maybe I'll just say one more thing. I

50:20

would break it down me my mental model

50:22

is you could do an endto-end test, you

50:24

can do a point in time test or what I

50:27

most often recommend is just do a back

50:30

test. Start with back test, start

50:32

capturing historical data and then just

50:34

rerun it. So yeah, let me give you uh

50:37

this example. So basically what I have

50:40

here, so this is a screenshot of prompt

50:42

layer. This is our our eval product is

50:44

also just a batch runner. So you could

50:45

kind of just run a bunch of columns

50:47

through a prompt. But in this case, I'm

50:49

running it through not a prompt, but

50:51

cloud code. So I just have like a

50:53

headless cloud code and I'm taking all

50:56

these providers and I just my headless

50:58

cloud code says I think I have it on the

50:59

next slide. Search the web for the model

51:02

provider. It's given to you in a file

51:04

variables. Find the most recent and

51:06

largest model released and then return

51:08

the name. So I don't know what it's

51:09

doing. It's doing web search. I'm not

51:11

even caring about that. This is an

51:12

endto-end test. This is how we kind of

51:16

try doing cloud code. And I actually

51:17

think there's a lot about putting cloud

51:19

code into your workflows and those type

51:22

of headless SDKs. I'll talk about that I

51:24

think next slide. But

51:27

kind of main takeaway here is you can

51:30

kind of start to do endto-end tests. You

51:32

can look at it from a high level do a

51:34

model smell and then kind of look into

51:36

the statistics on each row and see how

51:38

many times it called a tool.

51:41

And going back and we we've talked about

51:43

this a lot in this talk. rigorous tools.

51:47

The tools can be rigorously tested. You

51:49

can This is how you offload the deter

51:51

This is how you offload the determinism

51:53

to different parts of your model. It's

51:56

you test the tools. You you test the

51:58

out of your tools. Look at them

51:59

like functions. It's an input and an

52:01

output. If your tools a sub agent that

52:03

runs, then we're in a kind of recursion

52:06

here because then you have to go back

52:07

and test the end to end thing. But for

52:09

your tools, I'll give you this example.

52:12

If I so there in my coding agents or my

52:18

agents in general, my autonomous agents,

52:20

if there's something very specific that

52:22

I want to output. So in this case, if I

52:24

have a very specific type of email

52:26

format or type of blog post that I want

52:29

to write and I really want it to get my

52:31

voice right, I don't want to rely on the

52:33

model exploration. I want to actually

52:35

build a tool that I can rigorously test.

52:38

So in this case, this is also just a

52:40

prompt layer screenshot, but this is a

52:43

like a workflow I've built. It has an LM

52:45

assertion where it says check if the

52:47

email is good to my standards. If it's

52:49

good, it revises it. If it's not good,

52:51

it adds the parts. So like the header

52:52

that it missed and it revises it with

52:54

the same step. This is obviously a very

52:57

simple example, but in I we have another

53:02

version for some of our SEO blog posts

53:04

that has like 20 different nodes and

53:07

writes an outline from a deep research

53:09

and then fixes a conclusion and adds

53:12

links.

53:13

for the stuff that you have a very

53:15

specific vision that's when testing it

53:18

just gets so much easier because as you

53:20

can see obviously testing this sort of

53:23

workflow has less steps and less

53:25

flexibility. So this is an eval I made I

53:28

start with just a bunch of sample emails

53:30

I run the prompt actually I run the the

53:33

agentic workflow here and I'm just

53:36

adding a bunch of heruristics. So this

53:38

is a very simple LMS judge does it

53:40

include three parts in it. So this is

53:41

what I was testing for like the hi Jared

53:44

email body and the signature. You can

53:46

get a lot more complicated. You could do

53:48

a code execution. You can do I don't

53:51

know LM's judge is usually the easiest.

53:53

But now obviously you could see I could

53:55

keep running this until it's correct on

53:56

all of them and kind of uh see my eval

53:59

over time. This is just from this

54:01

example. I got it to 100. So that was

54:03

fun.

54:04

Uh and then I want to I want to add

54:07

another future looking thing. keep an

54:09

eye on headless uh cloud code SDK. I

54:13

know there was a talk about it this

54:14

morning. Um so I don't want to I won't

54:16

spend too much time on it, but it's

54:19

amazing. You just give a simple prompt

54:21

and it's just another part of your

54:23

pipeline. I use it for I think I have it

54:26

on the next slide. I have a GitHub

54:28

action that updates my docs every day

54:31

and just reads all the commits we've

54:33

pushed to our other repos. And we have a

54:35

lot of commits going and it just runs

54:37

cloud code. The cloud code pulls down

54:39

all the repos, checks what's updated,

54:42

reads our cloud MD to see if it should

54:44

even update the docs, then creates a PR.

54:47

So I think this unlocks a lot of things

54:50

and there's a possibility that we're

54:51

going to start building agents at a

54:53

higher order of abstraction and just

54:54

rely on cloud code and these other

54:56

agents to do a lot of the harnesses and

55:00

orchestration.

55:01

>> Are you reviewing those?

55:03

Yeah, [laughter]

55:06

I it creates a PR. It doesn't uh it

55:09

doesn't merge the VR.

55:12

So, here are my takeaways. Number one,

55:15

trust in the model. Uh when in doubt,

55:18

rely on the model when you're building

55:19

agents. Number two, simple design wins.

55:23

Number one and number two kind of go

55:25

together here. Number three, bash is all

55:28

you need.

55:29

Go simple with your tools. Don't have 40

55:32

tools, have 10 or five tools. For

55:36

context management matters, this is the

55:38

boogeyman we're running from all the

55:41

time in agents at this point. Maybe

55:43

there'll be new models in the future

55:45

that are just so much better at context.

55:47

But there's always going to be a limit

55:49

because ah you're talking to a human. I

55:52

forget people's names if I meet too many

55:53

in one day. That's context management or

55:56

my stupidity. I don't know. And number

55:58

five, different perspectives. matter in

56:01

agents. I think this is the engineering

56:04

brain doesn't always comprehend this as

56:07

much as it should especially in and I'm

56:10

an engineer so I'm also talking about

56:11

myself but the different perspectives

56:14

matter such that there's different uh

56:18

ways to solve a problem where there's

56:20

not one is better than the other and you

56:22

kind of you probably want a mixture of

56:24

experts agent I I would love to have

56:26

mine run cloud code and codeex and this

56:28

and give me the output and considered a

56:30

team and maybe have them talk to each

56:32

other in a slack based message channel.

56:34

I'm waiting for someone to build that.

56:36

That would be great. But these are my

56:38

takeaways. Uh my bonus thing that I'll

56:40

show you is how I built this slide deck

56:42

using cloud code. So uh I built a slide

56:46

dev skill. So I I basically told cloud

56:48

code to research how slide dev works and

56:50

how it can and that's kind of just a

56:52

library that I made this in. I built a

56:54

deep research skill to research all

56:57

these agents and how they work. I built

56:58

a design skill because I know half a

57:01

thing looks terrible or looks good, but

57:03

I'm not a good designer to figure it

57:05

out. So, these boxes even I was just

57:07

like, "Oh, m make the box a little

57:09

nicer. Give it an accent color." Uh, so

57:12

yeah, this is how I built it. But again,

57:14

thank you for listening. Uh, happy to

57:16

answer any questions. I'm Jared, founder

57:18

of Prompt Layer. Find me there.

57:20

[applause]

57:24

>> Yes.

57:25

>> Thank you. Great talk. Um so you

57:27

mentioned u regarding DAGs basically

57:30

like let's get rid of them right but

57:32

DAGs kind of enforc this like sequential

57:36

uh execution right pass I don't know

57:39

customer service like agent asks the

57:41

name email right like in some sort of uh

57:46

sequence um so are you saying

57:50

just write this out um like this is now

57:54

this should be uh just written out as a

57:57

plan for an agent to execute and just

58:00

trust that the model is going to be

58:02

calling those tools in that sequence

58:04

like how do we enforce uh the order?

58:07

>> Right? So the question was

58:11

why do I keep talking about getting rid

58:13

of DAGs? How else am are you supposed to

58:16

enforce a specific order for solving a

58:18

problem? So I think there are different

58:21

types of problems. So the problem of

58:25

building a general purpose coding agent

58:27

that we can all use to do our work and

58:30

even non-technical people can use

58:31

there's no specific step to solving that

58:33

problem which is why it's better to rely

58:36

on the model. If your problem was

58:40

to build

58:42

let's say a travel itinerary

58:46

it's more of a specific step because you

58:48

have a deliverable that's always the

58:50

same. So there's a little bit more of a

58:52

DAG that could matter, but in the

58:54

research step of traveling, you probably

58:56

don't want a DAG because every city is

58:58

going to be different. So it really

58:59

depends on the problem you're solving. I

59:01

would if I wanted to make an agent for a

59:03

travel itinerary, I'd probably have my

59:05

tool call would one of my tool calls be

59:08

a DAG of creating the output file

59:10

because I want the output to look the

59:11

same or creating the plan. And then in

59:14

the system problem, I could say always

59:15

end with the output for example. But

59:19

you need to mix and match. There's a

59:21

every use case is different, but if you

59:22

want to make something general purpose,

59:24

my take is to rely more on the model on

59:28

simple loops and less on a DAG.

59:31

>> Cool. Any other questions?

59:34

Yes.

59:34

>> Yeah. Building on that point, like do

59:37

you think we're heading towards a world

59:38

where most of you're not actually going

59:39

to call the API through code and that

59:41

most LM calls are by triggering cloud

59:44

code and just write just writing the

59:46

files instead?

59:50

So the question is are we going to move

59:52

away from calling models directly and

59:54

just call call like a headless cloud

59:57

code, right?

59:58

>> Yeah. Like if I had a like I have a

60:01

pipeline that does one lm call per

60:03

document, summarizes it at the end. You

60:06

could make a while loop cloud code that

60:08

saves a file every time. You never call

60:11

the API besides

60:13

using cloud code in in a while loop

60:16

>> potentially. Uh, I'll give you the pro

60:19

and the con there.

60:20

>> Yeah,

60:21

>> the pro is it's easier to develop and we

60:24

can kind of rely on the frontier. I

60:26

mean, if you think about it, a reasoning

60:28

model is just that. The reasoning models

60:31

didn't always exist. We just had normal

60:32

LM model and then oh, now we have 01 and

60:34

reasoning models. All that is is a I

60:37

mean, it's a little more complicated

60:38

than this, but it's basically just a

60:39

while loop on OpenAI servers that keeps

60:42

running the context and then eventually

60:43

gives you the output. in the same way

60:45

that cloud code SDK is a while loop with

60:48

a bunch of more things. So I could

60:51

totally see a lot of builders only

60:54

touching these agentic endpoints. Maybe

60:57

even seeing a model provider release a

60:59

model as a agentic endpoint. But for a

61:02

lot of tasks, you're going to want a

61:04

little bit more control. And they're pro

61:07

and probably you'd still want to go as

61:10

close to metal as possible. Having said

61:12

that, there's there was a lot of people

61:14

who still wanted completions models and

61:17

that never happened and nobody really

61:19

talks about that anymore. So, it's very

61:21

likely that everything just becomes this

61:23

SDK, but I don't have a crystal ball,

61:24

but those are those are how I I would

61:26

think about it.

61:29

>> Yes,

61:30

>> thanks for the talk. Um, I know you said

61:32

the simpler the better, but um, what's

61:35

your thoughts about test during

61:37

development, spec during development in

61:39

AI? Have you tried it? What is it about

61:42

>> for building agents or for getting work

61:44

done?

61:45

>> For coding.

61:46

>> Okay. So the question on spec driven

61:49

development, test-driven development for

61:51

coding with agents.

61:53

[cough and laughter]

61:54

When in doubt, go back to

61:58

good engineering practices is what I

62:00

would say. So it if you

62:04

and there's there's whole engineering

62:06

debates on if test-driven development is

62:08

the right way and some people swear by

62:10

it and some people don't. So I don't

62:12

think there's an answer. I think coding

62:14

agents clearly test-driven development

62:16

makes it easier. I think as I was

62:18

showing you that's AMP's source graphs

62:20

whole philosophy that if you can build

62:22

good tests and factory I think thinks

62:24

this as well. If you could build good

62:26

tests your coding agent can work much

62:29

better. So it makes sense to me when I'm

62:31

working personally I rely pretty heavily

62:34

on the planning phase and the spectr in

62:36

development phase and I think the

62:38

simpler tasks are pretty easy for the

62:40

model but if I'm doing a very simple

62:42

edit I'll skip that step. So no

62:44

oneizefits-all but return to the

62:47

engineering principles that you believe

62:49

when in doubt I'd say yes.

62:53

>> So earlier you talked about about system

62:57

rock leaks is possible to just look at

63:00

the u

63:02

downloads bundle or they have a special

63:05

end point that has prompts behind

63:07

endpoint.

63:08

>> Yeah. Uh I think I think they hide it. I

63:11

think they hide it. There was a there

63:13

was actually an interesting article

63:14

someone

63:16

because codeex is open source they

63:19

before openai released the codeex model

63:22

that it was using they were able to hack

63:24

together the open source codeex to give

63:27

a custom prompt to the model and be able

63:29

to use the model without it. So yeah you

63:31

can dive into it but generally it's

63:34

tried to be hidden and also laziness of

63:36

someone posted it. So there you go

63:38

that's the work but someone had to have

63:40

found it right. like is this problem

63:43

somewhere on your machine?

63:46

>> I actually don't know that answer.

63:47

[laughter]

63:49

>> Do you know that answer?

63:50

>> Yeah.

63:50

>> Yes.

63:52

>> It's on your machine. Nico says it's on

63:54

your machine. So there we go. So maybe

63:56

the prompt I was looking at is a little

63:57

bit old and I have to update it. But the

64:00

s but uh the question was does uh is the

64:04

prompt hidden on their servers or can

64:07

you find it if you are so determined?

64:09

And the answer seems to be yes. Any

64:13

other questions?

64:15

>> Yes.

64:16

>> Is this the last one?

64:18

>> Is this the last question?

64:19

>> It can be.

64:21

>> Can you talk about prompt layer and how

64:23

can people help you?

64:24

>> Yes, that's a good one. I forgot about

64:26

that. Thank you.

64:29

Um, so yeah, my one, we're hiring. Uh,

64:34

so if you're looking for coding jobs at

64:38

a very fun and fastmoving team in New

64:41

York, you can reach out to me on X or

64:43

email jaredprompter.com.

64:45

We're based in New York. We are uh,

64:48

yeah, we're we're a platform for

64:51

building and testing AI products for

64:52

prompt management, audibility,

64:54

governance, all that fun stuff, but also

64:56

logging and evals. And those screenshots

64:58

I showed you came from prompt layer. If

65:00

you're building an AI application and

65:02

you're building it with a team, you

65:03

should probably try Problem layer. It'll

65:05

make your life easier. Uh especially the

65:07

bigger your team is, the more you want

65:08

to collaborate, the more you want to

65:10

collaborate with PMs and non-technical

65:12

users and or if you're just technical

65:14

users, it's a great tool. It'll make

65:15

your life better. Highly recommend it.

65:17

prompt layer.com and it's easy to do.

65:20

And that was my show.

65:22

Thank you for listening. [applause]

65:26

[music]

65:33

>> [music]

65:37

[music]

65:41

>> Heat.

Interactive Summary

This workshop, led by Jared from Prompt Layer, delves into the architecture and philosophy behind effective coding agents, particularly focusing on Claude Code. Jared highlights that the success of modern coding agents stems from simpler architectures and improved underlying models, rather than complex, multi-branched workflows (DAGs). The core principle is to

Suggested questions

7 ready-made prompts