Building pi in a World of Slop

Building pi in a World of Slop — Mario Zechner

Watch on YouTube

Now Playing

Building pi in a World of Slop — Mario Zechner

Transcript

490 segments

0:14

Hey there, I'm Mario. I built pie in a

0:17

world of slop and this is a strategy, a

0:19

tragedy in three acts. Just to talk

0:22

about this real quick, bunch of people

0:23

on the internet gave me money for ad

0:25

space on my torso and all of that goes

0:26

to a charity. So yeah, thanks guys.

0:29

So act one building pi in the beginning

0:32

there was cloud code and was good right

0:34

we all got basically catnipped by that

0:37

thing and stopped sleeping um bunch of

0:41

stuff before that but code cloud code

0:42

was the one thing that kind of clicked

0:44

with me the most and to preface all of

0:46

this I love the cloud cloud team they're

0:48

are brilliant people talented super high

0:50

velocity so uh they also created the

0:53

entire game major props to them so this

0:56

is not a roast this is just me an old

0:58

man telling you why I stopped using

0:59

cloud code and built my own thing. Um in

1:03

2025 I started using cloud code in about

1:05

April I think thanks to Peter uh because

1:08

he told us the agents are working now

1:12

and back then it was simple and

1:13

predictable and fit my workflow but

1:15

eventually

1:17

the token madness got hold of them I

1:19

think and the team got bigger and they

1:20

started uh dog fooding that stuff and

1:23

build a lot of features a lot of

1:24

features I don't need which is fine I

1:26

can just ignore them but with velocity

1:28

and more features come more bucks and

1:30

that's mad because I used to work at

1:33

construction sites and if my hammer

1:35

breaks every day I'm getting really mad

1:36

and if my development tools break every

1:38

day I'm also getting mad. So there was

1:41

this it's just a running gag and here's

1:43

tar telling us that cloud code is now a

1:44

game engine and here's Mitchell from

1:46

Ghosty telling us no it's not and

1:48

eventually they fixed the flicker but

1:50

then other stuff broke and I think

1:51

they're now in the third iteration of a

1:54

2y renderer. Yeah but that's just a

1:56

symptom. The real problem is that my

1:58

context wasn't my context. Cloud code is

2:01

the thing that controls my context. And

2:03

behind my back, cloud code does things

2:05

uh to the context. So you have the

2:08

system prompt which changes on every

2:09

release, including the tool definitions.

2:11

They would remove tools, modify tools.

2:14

It's not good. They would insert system

2:17

reminders in the most oppoune place in

2:20

your context, telling the model, here's

2:21

some information. It may or may not be

2:24

relevant to what you're doing. That it

2:26

actually says it may or may not be

2:27

relevant what you're doing. And that

2:29

kind of confused the model and that kind

2:30

of broke my workflows.

2:34

On top of all that, there's zero

2:35

observability because that's how the

2:36

tool is constructed and I like knowing

2:39

what my agents are doing. There's zero

2:41

model choice which is obvious. It's the

2:42

native entropic uh harness. So it makes

2:44

sense for them to want you to use cloud,

2:46

right? And there's almost zero

2:48

extensibility and some of you might have

2:50

written some hooks for cloud code, but

2:51

I'm telling you the number of hooks and

2:54

the depth of those hooks is very

2:55

shallow. Um, and every time a hook

2:58

triggers, what actually happens is a new

3:00

process gets spawned. Basically, the

3:01

command you specified for the hook to be

3:03

executed. And I don't find that

3:05

specifically efficient. So, I uh took a

3:08

step back and looked around for

3:09

alternatives. And I'd like to especially

3:11

call out AMP and factory droid, the

3:14

Porsche and Lamborghini of coding agent

3:16

harnesses. So, if you can afford them,

3:17

please use them. They're at the

3:18

frontier. They're really good, and the

3:20

teams are fantastic. And there's a bunch

3:22

of other options. And I have history in

3:23

OSS. So naturally I kind of gravitated

3:26

towards open code and again brilliant

3:28

team super high execution velocity and

3:31

they don't sell you hype they sell you

3:33

tools that work for the most part. I

3:36

started looking under the hood of open

3:38

code uh with respect to context handling

3:40

as well because that's the most

3:41

important part for me and I found a

3:43

bunch of things like given some

3:45

conditions open code would just uh prune

3:49

tool output after a specific minimum

3:52

amount of tokens and that basically

3:54

lobomizes the model. Uh there's also LSP

3:57

server support which means every time

3:59

your model is calling the edit tool open

4:02

code goes to the LSP server that's

4:03

connected asks are there any errors and

4:06

if so injects that as part of the edit

4:08

tool uh result which is bad because

4:11

think about how you add editing code

4:13

you're not writing a line of code

4:15

checking the errors writing the next

4:16

line checking the errors you don't do

4:18

that you finish your work and then you

4:20

check the errors this confuses the model

4:23

there's a bunch of other things like

4:24

storing individual messages of a session

4:26

in a JSON file. Each me message is a

4:29

JSON file on disk. Uh there was this and

4:31

this happens to all of us. No, no claim

4:33

there. But it's not great if by default

4:36

a server spins up, course headers are

4:38

set in such a way that any website you

4:39

open in your browser can now access your

4:41

open code server. That's yeah, and

4:44

entirely unrelated to all of this, I

4:46

started looking into benchmarks for

4:47

coding agent harnesses and found

4:49

terminal bench um which is a pretty good

4:52

benchmark all things considered. And the

4:54

funny part about it is that it's the

4:56

most minimal kind of thing you can think

4:58

of. All it gives the model is a tool to

5:01

send keystrokes to to a T-Max session

5:03

and read the output of that T-Max

5:05

session. There's no file tools, no sub

5:07

agents, none of that stuff. And it's one

5:11

of the best performing harnesses in the

5:12

leaderboard. Here's the leaderboard from

5:14

December 2025. irrespective of model

5:17

family terminal scores higher mostly

5:20

high even higher than the native harness

5:22

of that model. So what does that tell

5:25

us? A form two thesis is we are in the

5:28

around and find out phase of coding

5:30

agents and their current form is not

5:31

their final form right. So second thesis

5:35

is we need better ways to around

5:37

and for me that means self modifying

5:40

malleable agents things that the agent

5:42

itself can modify and I can modify

5:45

depending on my workflow. So I stripped

5:47

away all the things built a minimal core

5:49

but made it super extensible and made it

5:52

so that the agent can modify itself

5:55

with some creature comforts. It's not

5:56

entirely bare bones. Uh so that's PI.

5:59

It's an agent that adapts to your

6:00

workflow instead of the other way

6:01

around. It comes with four packages. Uh

6:04

an AI package that's basically just an

6:06

abstraction across providers and context

6:08

handoff between providers. An agent core

6:11

uh which is just a while loop and the

6:12

tool calling. A bespoke toy framework. I

6:15

come out of game development. So I built

6:17

a thing that actually doesn't flicker

6:18

too much. And the coding agent itself.

6:21

Here's Pi's system prompt.

6:23

That's it. Eventually the industry

6:26

created a new standard called skills

6:28

which is basically just markdown files.

6:30

So we added that as well. and that needs

6:31

to go in a system prompt. So, be

6:33

crouchingly, we had to add a couple more

6:35

lines. And finally, here's the magic

6:37

that makes Pi able to modify itself. We

6:40

ship the documentation which was

6:42

handcrafted by me and an agent. Um, and

6:45

code examples of extensions,

6:48

and all we need to do for the agent to

6:50

modify itself is tell it, here's the

6:52

documentation. Here's some code that

6:54

shows you how to modify yourself by

6:55

writing extensions.

6:57

It comes with four tools. That's all it

6:59

has. Retrate, edit, mesh. Here's the

7:01

tool definitions. Don't read the the

7:02

text. Just look at the size.

7:05

That's it. Here's what happens when you

7:08

start a new session in one of these

7:09

tools.

7:11

So the thing is the models are actually

7:13

reinforcement trained up to a wazoo. So

7:15

they know what a coding agent is because

7:17

a coding agent harness is basically what

7:19

they're being trained when they are

7:20

post-trained. You don't need 10,000

7:22

tokens to tell them you're a coding

7:24

agent. They know because they are coding

7:26

agents. No, PI is also YOL by default

7:29

because my security needs are different

7:30

than yours. And I don't think a little

7:32

dialogue that pops up every now every

7:35

time you call bash asking you to approve

7:38

is a smart security uh uh mechanism. So

7:41

instead I give you so much rope that you

7:44

can build anything that's fit for your

7:46

specific security needs. There's also

7:49

stuff that's not built in. I'm a he

7:53

because this is how I do it. But if you

7:56

don't like that then you just ask Pi to

7:57

build you sub agent support or plan mode

8:00

or MCP support whatever you need.

8:02

Extensibility comes with a bunch of

8:04

table stakes and then with the

8:06

extensions itself and extensions imply

8:08

are just TypeScript modules. In the

8:10

simplest case a TypeScript file on disk.

8:12

You point PI at that. Here's an

8:14

extension loaded as part of the harness.

8:16

And with that you get a basically an

8:19

extension API that lets you hook into

8:21

everything and define stuff for the

8:23

harness to expose to the to the model.

8:25

And that includes tools uh slashcomand

8:28

shortcuts. You can listen in on any kind

8:29

of event and react and then save state

8:32

in the session that's optionally

8:36

provided to the agent as well or stored

8:38

there for tools that analyze sessions as

8:41

part of your organizational workflows.

8:43

You can do custom compaction, custom

8:45

providers and you have full control over

8:46

the tool. So you can modify everything

8:48

in PI and you can then bundle all of

8:50

that up and put it on mpm or on GitHub

8:53

because I think we don't need to

8:55

reinvent another bunch of silos called

8:58

marketplaces. We already have package

9:00

manage managers and all of that hot

9:03

reloads. So if you develop an extension

9:06

for pi, you do so in the session and you

9:09

hot reload the changes and see the the

9:12

effects of that immediately which is

9:14

very great and it's also game

9:15

development thing is in game development

9:17

you want high very low iteration uh

9:20

speeds and that's great. So a couple of

9:23

examples cloud or anthropic ships the

9:25

slash by the way which lets you talk to

9:26

the agent why goes on its main quest. I

9:29

posted this little prompt on Twitter

9:31

jokingly and somebody build it in five

9:33

minutes with more features and they

9:35

didn't have to fork a clone pie. They

9:37

just let the agent write the extension

9:40

based on the prompt. Here's Nico. He's

9:42

one of the most prolific uh extension

9:44

writers. I don't know what the is

9:46

going on here. It's a chat room for all

9:48

of his Pi agents and they talk with each

9:49

other. I would never use this, but all

9:51

of this is custom including the UI. or

9:53

you can play NES games or you can play

9:56

Doom.

9:58

And there's a bunch of other examples

9:59

I'm not going to talk about. So, how do

10:01

you build a PI extension? You don't. You

10:03

tell Pi to build it for you based on

10:05

your specifications. And then you just

10:06

iterate with it on that and hot reload

10:08

during the session. I'm going to skip

10:10

that example as well. And if you don't

10:12

like building things yourself, and I

10:14

hope you do like building things

10:15

yourself, but if you don't, you can look

10:17

on MPM or our little search uh interface

10:20

on top of MPM to find packages for sub

10:23

agents, MCP, and so on. So, does it

10:25

actually work? Well, here's the terminal

10:27

bench leaderboard from October before Pi

10:29

had compaction. I added that for Peter's

10:31

claw thingy. It scored sixth place.

10:35

Uh, but none of this is actually about

10:36

Pi. If you want to retake, I I basically

10:39

want you to retake control of your tools

10:40

and workflows. So build your own. Um and

10:43

if you want to know more about pi and

10:44

openclaw, go to this talk please. Yeah.

10:46

And then eventually Peter happened. He

10:48

put pi inside of open claw as its aentic

10:51

core which meant my open source project

10:53

became the target of a lot of openclaw

10:55

instances unbeknownst to their users. So

10:57

this is act 2 oss in the age of

10:59

clankers. Clankers are destroying oss.

11:01

Here's tal draw. They closed down the

11:03

issue on pull request tracker. Here's

11:05

open clause uh trackers. Here's mine.

11:08

Half of that is open source instances

11:10

who post garbage. So I started to rage

11:12

against the clankers.

11:14

Um if you send a pull request, it gets

11:16

autoclosed with a comment that asks you

11:18

to please write a nice issue in your

11:21

human voice, no longer than a screen

11:22

worth of text. And if I see that I write

11:25

looks good to me and your account name

11:26

gets put in a file in the repository and

11:28

the next time you send a pull request,

11:30

it's let through. Clankers don't read

11:33

that comment. They don't go back once

11:34

they posted a pull request. So that's a

11:36

perfect filter. Uh Mitchell eventually

11:38

turned it into vouch. Here's a clanker.

11:40

Uh I also labeled them. If you had

11:42

interactions with openclaw, your issues

11:44

get dep prioritized. I also built tools

11:47

where I embed uh issues and pull request

11:49

texts into 3D space. So I see clusters

11:52

of issues. Uh I also invented OSS

11:54

vacation. I just close the tracker

11:56

whenever I want. So I have my life back.

11:58

So does this work? Yes, sort of.

12:02

Which leads me to act three. Slow the

12:04

down. Everything's broken.

12:09

And then there's people that say, "Our

12:10

product's been 100% built by agents."

12:12

Yes, we know it sucks now.

12:14

Congratulations.

12:22

And I'm hearing this from my peers and

12:24

this is entirely unhealthy.

12:27

Um, so here's how we should not work

12:28

with agents and why, at least in my

12:30

opinion. I wrote this on my blog a while

12:32

ago, but the basic is this. We're having

12:34

armory of agents and you're using beats

12:36

on been and you don't know that it's

12:38

basically uninstallable malware and

12:40

entropic build a C compiler that kind of

12:41

works but actually doesn't and we're

12:43

hoping the next generation of models

12:44

will fix it and here is Perso building a

12:46

browser and that's also super

12:48

broken but the next generation will fix

12:50

it and SAS is dead software solved in

12:52

six months and my grandma just built

12:54

herself a Spotify with her open claw

12:56

come on people so agents are actually

13:00

combounding boooos which is my word for

13:01

errors with serial learning and No

13:03

bottlenecks and uh delayed pain. The

13:06

delayed pain is for you. Here's your

13:08

code base on a human on one agent and 10

13:11

agents. How much of the agent code can

13:13

you review? Here's the same codebase but

13:16

expressed in number of boooos per day.

13:19

How much of those boooos do you think

13:21

you'll find? Then you say, "Oh, I have a

13:23

review agent. Let me introduce you to

13:26

the wonderful world of the Oro." Doesn't

13:28

work. It catches some issues. Um the

13:31

problem is that agents and merchants

13:32

have learned complexity. Where did they

13:34

learn that complexity from? From the

13:36

internet. What's on the internet? All

13:37

our old garbage code. There are some

13:39

pearls on the internet, really

13:41

well-designed systems, but 90% of code

13:43

on the internet is our old garbage. And

13:45

that's what the models learn from. And

13:47

every decision of an agent is local,

13:49

especially if the codebase is so big

13:51

that it doesn't fit into its context.

13:52

And if you let it go wild and add

13:55

abstractions everywhere that are

13:57

intertwined. Um, so that leads to lots

14:00

of abstractions and duplication and

14:02

backwards compatibility. Who has seen

14:04

that in the output of their agent? It's

14:06

annoying or defense in depth. So

14:09

yeah, you get enterprise grade

14:11

complexity within two weeks with just

14:13

two humans and 10 agents.

14:15

Congratulations.

14:16

And then you say, but my detailed spec.

14:19

Yes, sure. You know what we call a

14:21

sufficiently detailed spec? It's a

14:23

program.

14:25

So if you leave blanks in your spec,

14:28

what do you think happens? How does the

14:29

model fill in the blanks? And with what

14:31

does it fill that in? It fills it in

14:34

with the garbage that it learned on the

14:35

internet from our old code, which is

14:37

garbage to mediocre. And then you say,

14:39

but humans also, yes, humans are

14:41

horrible, fail failable beings, but they

14:44

can learn and they are bottlenecks.

14:46

There's only so many boooos they can add

14:48

to your code base on a daily basis. And

14:51

humans feel pain, which is a very

14:54

interesting property because humans hate

14:55

pain. And once there's too much pain,

14:57

the human has a bunch of options. It can

15:00

quit their job. It can uh blame somebody

15:04

else and make them fix it or everybody

15:06

bands together and starts refactoring

15:07

the out of the garbage codebase,

15:10

right? Agents will happily keep

15:13

into your codebase.

15:16

And now your agents MD and super complex

15:19

memory systems will not save you. agents

15:21

don't learn the way we learn.

15:24

Those are my most most beloved people. I

15:26

don't even read the code anymore.

15:28

Congratulations. Something is broken and

15:31

your users are screaming. So, who you

15:32

going to call? Not yourself because you

15:35

haven't read the code. So, you're

15:36

relying on your agents, but they are now

15:38

also overwhelmed because the codebase is

15:40

so humongous that there's absolutely

15:42

zero chance they can get all the context

15:44

they need to fix the issues. And long

15:46

context windows are a heck, as most of

15:49

you will find out this year. as

15:50

everybody's switching to 1 million

15:52

tokens context windows and agentic

15:54

search is also failing.

15:57

So the agent patches locally and

15:59

up globally. If you see this in

16:01

your codebase, you're

16:06

So you cannot trust your codebase

16:08

anymore and also not your test because

16:09

your agent wrote your test. So good

16:11

game. So here's how I think we should

16:13

work. Um there's a bunch of properties

16:15

for good agent tasks. That means scope.

16:18

If you can scope it in such a way that

16:20

the agent is guaranteed to find all the

16:22

things it needs to find to do a good

16:23

job, you're done. That means modularize

16:26

your codebase. If you can give it a

16:28

function to evaluate how well it did the

16:30

job, even better. Hill climbing, auto

16:32

research. Uh, anything non-m mission

16:34

critical, let it wipe. Boring stuff, let

16:36

it wipe. Reproduction cases for user

16:39

issues, which are usually only partial

16:40

in information, perfect. I don't spend

16:43

any mornings anymore doing that. Or if

16:44

you don't have a human near you, rubber

16:46

duck. So, lots of tasks you can use them

16:48

for and save time. At the end of that,

16:51

you evaluate. You take what's

16:53

reasonable. Most of it isn't. And then

16:55

finalize. My final slide, more or less,

16:58

slow the down. Think about what

17:00

you're building and why. And don't just

17:02

build because your agent can do it. Now,

17:03

that's stupid. Uh, learn to say no. This

17:07

is your most valuable uh capability at

17:10

the moment. Fewer features, but the ones

17:12

that matter. And then use your agents to

17:14

polish the out of that. Enlighten

17:16

your users, not your uh token maxing

17:20

desires. Get the amount of generated

17:22

code uh that you need to review.

17:26

And non-critical code, sure, wipe slop

17:28

ahead. Critical code, read every

17:30

line. See the keynote after me for more

17:33

info on that. So, how do you know what's

17:35

critical? Any guesses?

17:38

Well, you read the code. Uh, if

17:42

you do anything important, write it by

17:43

hand. You can use a clanker to help you

17:45

with that, but don't let it make the

17:47

decisions for you because we've learned

17:49

all the decisions it makes are learned

17:51

from the internet. And that friction is

17:53

the thing that builds the understanding

17:55

of the system in your head, which is

17:57

important. And it's also where you learn

18:01

new things. And all of this requires

18:03

discipline and agency. And all of this

18:06

still requires humans. Thank you.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

In this talk, Mario discusses his experiences with building and using coding agents, leading to the creation of his own project, Pi. He highlights the pitfalls of relying heavily on automated agents, such as loss of control over context, lack of observability, and the accumulation of 'slop' or technical debt in codebases. Mario advocates for a more disciplined approach to using AI in development—using it for specific, scoped tasks while maintaining human oversight for critical code—and emphasizes that developers should reclaim control over their own tools and workflows.