Hard Won Lessons from Building Effective AI Coding Agents

Hard Won Lessons from Building Effective AI Coding Agents – Nik Pash, Cline

Watch on YouTube

Now Playing

Hard Won Lessons from Building Effective AI Coding Agents – Nik Pash, Cline

Transcript

342 segments

0:13

[music]

0:21

Wow, it's wild to be on the same stage

0:24

as so many people I've drawn inspiration

0:25

from. Let's dive into it. My name is

0:28

Nick. I'm the head of AI at Klein and

0:30

today I'm going to share some lessons we

0:32

learned along the way.

0:34

So let's start with the bitter truth.

0:37

For years we compensated for weak models

0:40

by building clever scaffolds around

0:42

them. All kinds of clever ideas like rag

0:45

indexing systems, search trees, tool

0:48

calling scaffolds, all this was invented

0:50

to cope with weaker models. And Frontier

0:54

models simply bulldoze those

0:55

abstractions. Now, you don't really need

0:58

your scaffolding anymore. Your

0:59

scaffolding just gets in the way of

1:01

these models. And the question really

1:03

isn't how fancy is your agent stack.

1:07

Increasingly, it's how strong is the

1:08

model driving it.

1:11

And the lesson here is relentless. Um, a

1:14

perfect example of what I'm talking

1:15

about is Gemini 3.0 released this week

1:19

and it immediately dominated terminal

1:21

bench leaderboards with no aentic

1:24

harness supporting it at all. In this

1:26

chart, you can see Gemini 3.0 on

1:28

Terminus scored better than the vast

1:30

majority of model agent combinations in

1:32

the world all out of the box. And what's

1:35

remarkable is that Terminus is designed

1:37

to be an unopinionated generic stripped

1:40

down harness. And it has no graph

1:42

search, no rag, no indexing, just here's

1:45

a terminal, go figure it out. And it

1:48

crushes. The whole point of terminus is

1:50

that it has no clever tool calling, no

1:53

context engineering features. So the

1:55

takeaway here is that capability beats

1:58

scaffolding. If you get out of the

1:59

model's way, it will perform just fine.

2:03

So really what I'm driving at and the

2:06

key takeaway from this whole talk is if

2:08

you're building agents, just relax. Cool

2:12

it with all your clever engineering

2:14

tricks. Stop overthinking it. That's it.

2:17

That's the lesson. And another point on

2:21

this, kind of like an aside, is I don't

2:23

know about you guys, but we're all on

2:26

Twitter. I'm on Twitter, and at this

2:29

point, I just think talking about these

2:31

like clever little context tricks and

2:34

and hacks is a little played out. Like,

2:37

at this point, I'm straight up tired of

2:39

seeing some of this stuff. And like, I

2:42

get it. it's free engagement and we all,

2:44

you know, indulge in it a little bit.

2:45

But personally, I think there's not

2:48

really much signal there.

2:50

So, if you want the full playbook for

2:53

building an effective coding agent, like

2:56

the playbook's right here. It's up on

2:58

the screen. Um, there's really some

3:00

novelty talking about it like months

3:02

ago, but at this point, in my opinion,

3:04

it's been done to death. And we've been

3:06

in this, you know, we're model agnostic

3:07

at Klein. We support all the models.

3:09

Every two weeks there's a new big model

3:12

release going out and we've basically

3:15

come down to the same playbook of

3:16

supporting each model as it comes out.

3:19

So I'm sure everyone here knows how to

3:21

tune an agent from Sonnet 4 to Sonnet

3:24

4.5, from Gemini 2.5 to Gemini 3 and GBT

3:30

5 to GP GBT 5.1. I feel like this entire

3:34

conversation is a little played out. So,

3:36

I'm not really even going to cover this

3:37

in depth because the tweaks here are

3:40

trivial and the gains are marginal.

3:43

So, what I really want to talk about is

3:46

something that's not actually given a

3:48

lot of attention and it's the real

3:50

bottleneck. And the real bottleneck is

3:52

that you can build the cleanest agent in

3:54

the world, but that doesn't improve

3:56

model capability by even 1%. Models only

4:00

get better when labs train on something

4:03

hard. And benchmarks, not agent

4:07

cleverness, not all your clever

4:08

engineering techniques, not your clever

4:10

rag pipelines. It's benchmarks that

4:12

determine what frontier models learn to

4:15

do next. And models didn't magically get

4:18

better at tool use.

4:21

They got better because people built RL

4:23

environments that forced them to

4:25

practice certain actions. handling

4:28

failure more handling failure modes

4:30

retrying and for example like agents

4:33

improve only when the model learns

4:35

inside the right environment every jump

4:37

in reasoning we've seen came from a

4:39

benchmark every jump in agent

4:41

reliability came from an RL environment

4:44

so the real questions become what is a

4:47

benchmark how do you turn real world

4:51

agent coding data into an RL environment

4:54

and what makes a good verifier how do

4:56

you detect [clears throat] real

4:57

difficult ulty and how do you train

4:58

these models to work on the problems

5:00

that we actually care about as

5:01

engineers? These are the questions that

5:04

matter for the next frontier.

5:07

So what is a benchmark?

5:09

A benchmark put simply it's an

5:11

environment. It's a so in our case it's

5:14

like a docker container where you let

5:15

the agent run wild. It's a starting

5:18

state which is the snapshot of the code

5:20

when you started working on a real world

5:23

coding task as well as a starting

5:25

prompt. And the last thing is a verifier

5:28

at the end that checks whether an end

5:30

state is correct or acceptable.

5:33

So how are RL environments different?

5:36

[clears throat]

5:37

Well, here's the thing. They're not

5:38

really different at all. And you might

5:40

notice this chart is basically the same

5:42

thing as the previous slide. The only

5:44

real difference, the only distinction

5:46

here is how the reward is used.

5:49

Benchmarks measure models. RL

5:52

environments improve models. The score

5:55

doesn't just stop in a leaderboard where

5:57

you publish the results. The score is

5:59

actually used to update the weights of

6:01

the policy model.

6:03

So, how do you transform real world

6:06

coding data into useful RL environments

6:10

for training?

6:12

At Klein, we created the system called

6:15

an RL environments factory. Looking for

6:18

a better name there, but that's what we

6:20

got so far. And the first phase in this

6:23

pipeline is you get sub agents and you

6:26

have them qualify tasks. And these sub

6:29

agents, they work in parallel to decide

6:31

whether or not given tasks are suitable

6:33

to be turned into RL environments for

6:35

the purpose of training.

6:37

And the qualification process goes as

6:40

follows. So you have you start with

6:41

origins. So you have to validate does

6:43

the repository actually exist. Is the

6:46

starting commit accessible? Is it open

6:48

source? The journey where you look at

6:52

the starting prompt, the other follow-on

6:54

prompts that the user might have

6:57

followed up with with the agent. You

6:59

have to try to understand what was the

7:01

user actually trying to accomplish, what

7:02

was the spirit of their task. And

7:05

lastly, it's the outcome. So, can we

7:08

find the actual commits or PRs that fix

7:11

the problem in real life? Like, did they

7:13

actually commit the solution to their

7:15

problem later on in the timeline? And

7:18

we're actively looking for easy

7:20

disqualifiers as part of this. So,

7:22

things like vibecoded slop, we don't

7:24

need another benchmark that tests for,

7:26

you know, build the next.js app uh from

7:28

scratch. We're looking we're looking to

7:31

disqualify trivial tasks that are too

7:33

easy and tasks that have no reliable

7:36

start or end states.

7:38

And lastly, what makes a good RL

7:40

environment good? How do we actually

7:42

make an RL environment and what makes a

7:44

good test or verifier?

7:47

So phase two of this pipeline is

7:49

building the actual RL environment. So

7:52

you start out with archaeology where you

7:54

actually reconstruct both states

7:56

locally. You pull down the code, you see

7:58

if you can implement it yourself,

8:01

reconstruct it, build it, and verify

8:03

that the bug that the user was

8:05

referencing and the solution actually

8:07

exists. You document every obstacle and

8:10

dependency. You containerize it with

8:12

Docker, removing Git obviously, so

8:15

agents can't reward hack. And lastly,

8:17

you define the verifier at the end. And

8:19

this is where it gets into like a little

8:21

bit of the art of building a good

8:24

verifier. And I want to talk about this

8:26

because the analogy that I typically

8:28

give is a teac kettle. So let's say the

8:33

user's goal is I want to boil water.

8:36

A really good example of a verifier to

8:38

test whether or not the water is boiling

8:41

is a little whistle attachment that goes

8:43

inside your teac kettle. And the whistle

8:46

is a pure outcome verification. And it's

8:48

an example of a pure outcome driven

8:51

verifier where the water either reached

8:54

the boiling point or it didn't. Either

8:56

it's whistling or it's not. The kettle

8:58

doesn't care how you achieved it,

9:00

whether you used a gas stove, an

9:02

electric induction stove, or a campfire.

9:04

It just signals the result. And in the

9:07

process of doing this, all these weird

9:09

bad tests can emerge. So you might have

9:12

noticed like that the sub agent might

9:14

have noticed like oh in the ground truth

9:16

solution like in a previous run the

9:18

burner was set to high so maybe we

9:20

should be checking for that but we all

9:22

know that water can boil at a low

9:24

setting on the burner or was it on the

9:27

front left burner has 5 minutes elapsed

9:29

like all kinds of weird bad tests and

9:30

the key point here is don't

9:34

overprescribe based on the ground truth

9:36

test for the spirit of the task test for

9:39

the outcome of the task.

9:41

And the outcome at the end of all this

9:43

is a containerized benchmark or

9:46

environment for that task. Agent work is

9:49

recorded so you can see the traces the

9:51

trajectory that the agent took to

9:52

complete the task and you can reliably

9:55

score it and verify it and it's fully

9:57

portable. You can run it on any device.

10:01

So the path to automation that we've

10:04

been undertaking as part of this is can

10:07

we fully automate the process of

10:09

converting real world coding data into

10:13

RL environments for the purpose of

10:14

training models.

10:16

And this work largely started out manual

10:19

but then the first time in the RL

10:21

environment was like about 16 hours of

10:23

my time. And what used to take 16 hours

10:25

now takes less than 20 minutes per task.

10:29

And we're building towards a fully

10:31

automated RL environment factory where

10:32

the bottleneck shifts from engineering

10:35

to collecting high quality tasks. And an

10:38

interesting kind of point here, the

10:40

natural endpoint of all this is what if

10:44

we actually built RL environments and

10:46

this is like a question for everyone in

10:47

the audience is what if we built RL

10:49

environments to test how well agents can

10:51

actually make RL environments kind of

10:53

like a meta benchmark. What would hill

10:55

climbing on that look like? And you can

10:58

kind of start imagining that as models

11:00

get really really good at making their

11:02

own RL environments to train on based on

11:04

real world user data, you kind of

11:06

complete that loop. Something to think

11:08

about. So, okay. Um, this next part is

11:13

the truth nuke. Um, also known as TRO.

11:17

Um,

11:19

an unspoken fact is that we're not alone

11:23

at Klein building this kind of system.

11:26

Every major agent lab captures this

11:29

data. They all do some version of this

11:31

behind the scenes, but no one really

11:32

talks about it. And I don't even need to

11:36

name them. If you know, you know. And

11:37

realistically, you all know. These same

11:40

companies site internal benchmarks to

11:43

justify legacy systems that they spent

11:46

months maintaining. But curiously,

11:48

you'll never be able to study or inspect

11:49

them because they don't publish them

11:51

openly. And this data is so valuable yet

11:55

no one shares it. It's the only thing

11:57

that actually moves the needle.

12:00

And here's the heart of my argument is

12:03

by standing between real world engineers

12:06

working on real world tasks and the

12:08

models agent labs have a unique role in

12:10

history. We can build better prompts. We

12:12

can build better tools. But none of that

12:14

improves the underlying models. We

12:16

possess the single richest data set of

12:19

real engineering work anywhere in the

12:22

world. Models don't improve without this

12:25

data and keeping them closed is slowing

12:27

down Frontier Research.

12:29

So today we're announcing client bench.

12:32

This is our attempt to finally create a

12:33

benchmark that isn't cosplay

12:35

engineering. It's not write me a server

12:38

that generates Fibonacci sequences. This

12:40

is real software development captured

12:43

and packaged into standardized RL and

12:44

inval and eval environments and this is

12:47

the benchmark that we always wanted

12:49

someone else to build. No one did. So

12:51

we're doing it and anyone can

12:54

participate. So here's how it works. The

12:58

whole thing is open source. There's no

12:59

secret sauce, no locked away data sets.

13:03

You can openly run it yourself and

13:05

inspect it to see how it works. Anyone

13:07

can use these environments for SFT, RL,

13:10

eval, whatever. The point is is to just

13:12

give the entire ecosystem a real

13:14

substrate to measure and improve models

13:16

on, not just leak code puzzles. And this

13:20

only works if the community contributes.

13:22

And the good news is you don't actually

13:24

need to do anything special. Just work

13:26

on your open source project with the

13:28

client provider turned on and opt into

13:30

the client bench initiative. If a

13:32

frontier model gets stuck and you step

13:34

in to fix it, that's actually a ideal

13:37

task for to be a candidate for a

13:40

benchmark and that's it. Just use the

13:43

climb provider, see where the model

13:45

struggles and we'll pick it up and

13:47

introduce it into this open-source

13:49

benchmark. So, client bench will always

13:52

remain free, fully open source and

13:55

freely accessible.

13:57

Thank you all. If you want to

13:58

contribute,

14:01

[music]

14:13

>> [music]

Interactive Summary

Ask follow-up questions or revisit key timestamps.

In this talk, Nick from Klein discusses why building complex 'scaffolding' or agentic stacks to compensate for weak AI models is no longer the best approach. He argues that recent frontier models have become so powerful that they perform best when not hindered by over-engineered hacks. Instead of focusing on agent wrappers, the real progress in AI capability comes from training models on challenging, real-world tasks using reinforcement learning (RL) environments. Nick highlights the need for open, standardized benchmarks built from authentic engineering data to drive future model improvements. To this end, he announces 'Klein Bench,' an open-source project that turns real-world coding issues into verifiable RL environments for the entire community to use.