HomeVideos

wtf is Harness Engineer & why is it important

Now Playing

wtf is Harness Engineer & why is it important

Transcript

479 segments

0:01

Thanks to HubSpot for sponsoring this

0:02

video.

0:04

So, something really big actually

0:05

happened in December 2025 and most of

0:07

the people didn't even realize that.

0:09

Entry Kapsi tweeted about this last

0:10

week. It's very hard to communicate how

0:12

much programming has changed due to AI

0:14

in the last 2 months, specifically since

0:16

last December. And Greg from OpenAI also

0:19

talked about this. Since December,

0:21

there's step function improvements in

0:22

what the model and tools are capable of.

0:24

And a few engineers have told him that

0:26

their job has fundamentally changed

0:27

since December 2025. So, what actually

0:30

happened in December 2025? In short

0:33

words, the latest model introduced then

0:35

is finally ready for fully autonomous

0:37

long-running tasks. So, with AI, the

0:39

ultimate dream is always that while we

0:41

are sleeping, AI can just work on tasks

0:43

fully autonomously 24/7. Even back 2023,

0:47

the most popular project, if you

0:48

remember, is called Auto GPT. It is

0:50

first time this fully autonomous agent

0:52

existing was introduced. And they have

0:54

very basic and simple architecture that

0:56

using GPT-4 as a model to autonomously

0:59

break down a list of tasks based on

1:00

user's goal simple memory storage to

1:03

store the result. And people were doing

1:04

some pretty crazy stuff like just give

1:06

it a goal, make a $100,000 and let it

1:08

loop through tasks infinitely until

1:10

complete. Back then, the system just

1:12

break and failed miserably because the

1:14

model is simply not ready. But since

1:16

December last year, this really changed.

1:18

The models have significantly higher

1:19

quality, long-term coherence, and they

1:22

can power through much larger and longer

1:24

tasks. And we saw all sorts of different

1:26

experimentation came out from industry.

1:28

Firstly, from January, we got this super

1:30

hot concept called rough loop. The most

1:32

basic and simple agent iteration loop to

1:34

force model work longer so that it can

1:36

take more complex tasks. You just follow

1:39

the model with some simple condition

1:40

checks. But already, we start seeing the

1:42

difference. And 1 week later, Cursor

1:44

also released their experimentation

1:46

where they used GPT-5.2 to autonomously

1:49

build a browser from scratch with 3

1:51

million lines of code. And Anthropic

1:53

also released this experimentation they

1:54

had where they get a team of cloud codes

1:57

to autonomously working on a C compiler

2:00

from scratch for 2 weeks. In the end, it

2:02

delivered a functional version with zero

2:04

manual coding. You can even run Doom

2:06

inside this compiler, as well. At same

2:09

time, open claw start gaining attention

2:11

and had this explosive growth that we

2:13

never seen before. And it was very

2:14

difficult to understand what was going

2:16

on with open claw, cuz from outside,

2:18

it's very easy to categorize open claw

2:20

just be another menace, but living

2:23

inside your own computer and can also

2:25

access from Telegram. Like, why is it so

2:28

popular? And only later after I used a

2:30

deep play, I realized that the real

2:32

difference is that open claw represent

2:35

this type of always-on, long-running,

2:37

fully autonomous agents. That is very

2:39

different from all the other agentic

2:41

system we used before, where human is

2:43

main driver to prompt for the next

2:45

action. Open claw is always-on and it is

2:47

proactive. And this autonomous feeding

2:50

is created by a very simple

2:51

architecture, where it has memory

2:53

context layer with a trigger and a cron

2:55

job to automatically take actions and

2:57

have the full computer access, which is

2:59

powerful environment it can operate in.

3:01

And I believe open claw is the first

3:03

project that really open up the biggest

3:05

paradigm shift in 2026. That we are

3:08

moving from a co-pilot, simple

3:10

task-based agent system to those

3:12

long-running, fully autonomous agent.

3:14

Something that's always-on, always

3:16

ready, autonomously delivering super

3:18

complex, coordinated work. This is a

3:20

critical shift you have to understand.

3:22

The model today is actually much more

3:24

powerful than you think, as long as you

3:26

design right system to unlock it. And

3:28

this is the crux of what I want to talk

3:29

about today. The harness engineer to

3:31

re-enable long-running autonomous

3:33

systems. If it's first time you hear

3:35

about harness engineer, this is like

3:36

evolution from what we've been

3:38

previously talked about, which is

3:39

context engineer or prompt engineer. So,

3:41

previously, we really focused on how to

3:43

optimize the prompts within the

3:44

effective context window to get a model

3:46

have the best performance for a single

3:48

agent loop session. But harness engineer

3:50

is really focused on those long-running

3:51

tasks, which means how do you design a

3:54

system that can works across different

3:56

sessions and multiple different agents?

3:57

And how do you design the right workflow

3:59

to making sure the relevant context will

4:01

be retrieved for each session and right

4:03

set of toolings to extract most out of

4:04

models. This is fairly new concept, but

4:06

the good thing is that industry already

4:08

convert on some best practice that you

4:10

can use from Anthropic, Vercel,

4:12

LangChain, and many others. We'll go

4:14

through each one of them one by one so

4:15

you can see the patterns. But before you

4:17

dive into this, with this paradigm shift

4:19

fully autonomous agents, one of the

4:20

biggest opportunity for the next 6-12

4:22

months is build open cloth for a certain

4:24

verticals, which means you deeply

4:26

investigate and understand the

4:27

end-to-end workflow of a certain

4:29

vertical and build it autonomous agent

4:31

with correct environment and tooling to

4:33

enable the end-to-end process. That's

4:34

why I want to introduce you to this

4:36

awesome research HubSpot did on the AI

4:38

adoption in email marketing report. It

4:40

is fascinating report for you to

4:42

understand for a vertical like email

4:43

marketing, where people actually use AI

4:45

today and what are the gaps. Cuz this

4:47

report showcase clear workflow and

4:49

opportunity email marketing that you can

4:51

potentially automate. They survey

4:52

hundreds of email marketers from top

4:54

companies to understand exactly how AI

4:56

is reshaping their workflows. They talk

4:58

about why marketers are still doing a

5:00

lot of heavy editing, what were the cost

5:03

to it, as well as the biggest challenge

5:04

they are facing today when implementing

5:06

AI in the email marketings. And each of

5:08

them is a big opportunity for you to

5:09

build a fully autonomous agents. They

5:11

even dive into the specific KPI that

5:13

they care more about and AI has show

5:15

proven results, as well as what exactly

5:17

things email marketers are really want

5:19

from AI. So, if you're a builder who are

5:22

thinking about the next big agent

5:23

product to build, I highly recommend you

5:25

go check out this awesome resource. I

5:27

put the link in the description below

5:28

for you to download for free. And thanks

5:30

HubSpot for sponsoring this video. Now,

5:32

let's get back to harness engineer for

5:34

long-running agent systems. And at high

5:36

level, there are three learnings I took

5:38

away from those. One is that for

5:40

long-running task agents, the critical

5:42

part of system design is creating this

5:44

legible environment where each sub-agent

5:47

or sessions can actually understand

5:49

where things are at. And most likely

5:50

there's some workflows that can be done

5:52

to enforce eligibility of the

5:53

environment. And I'll expand a bit more

5:55

on that. The second is verification is

5:57

critical. You can improve assistant

5:59

output significantly by allowing it to

6:01

verify its work effectively with faster

6:03

feedback loop. And third is that we need

6:05

to trust model more instead of building

6:07

specialized tooling that wrap a lot of

6:09

reasoning and logic prematurely. We

6:11

should give model max context with

6:13

generic tooling that they need to be

6:15

able to understand and explore like

6:17

human. And I'll unpack those three

6:18

things one by one as we go through each

6:19

block here. First is Anthropic's

6:21

effective harness for long-running

6:23

agents blocks. So they've experimented

6:25

using Cloud Code SDK to build a

6:27

specialized agent for super long-running

6:29

tasks like build a clone of cloud.ai

6:32

website. The very first failures they

6:34

observed is that firstly agent tend to

6:36

do too much at once. Essentially it will

6:38

always try to one-shot the whole app.

6:40

And this led to the model running out of

6:42

context in the middle of its

6:43

implementation and leaving the next

6:45

session to start with the feature half

6:47

implemented or documented. Then agent

6:49

would have to guess what actually

6:50

happened and spend substantial time

6:52

trying to get the basic app working

6:54

again. And second failure they observed

6:56

is that agent tend to declare job

6:58

complete prematurely. You probably

7:00

experienced this a few times yourself as

7:01

well. The Cloud Code or Cursor would

7:03

just claim the project or feature is

7:04

completed. But once you test it, it

7:06

actually didn't work. So their approach

7:08

to solve those default model failure

7:10

behavior is that first they set up

7:11

initial environment that lays the

7:13

foundation for all the features that

7:15

given prompt requires, which set ups

7:17

agent to work step by step and feature

7:18

by feature. So this kind of similar to

7:20

the plan or PRD approach that we

7:23

normally took. The second is that they

7:24

start prompt each agent to make

7:25

incremental progress towards its goal

7:28

while also leaving the environment in

7:30

clean state at end of each session. What

7:32

they did is starting design this

7:34

two-part solution. First they would have

7:36

this initializer agent that is used a

7:38

specialized prompt to ask model to set

7:40

up initial environment with a init.sh

7:43

script, which will set up dev server,

7:44

for example, so that next model don't

7:46

need to worry about those things. And

7:48

also it create progress.txt file that

7:50

keeps logs on what agent have done, as

7:52

well as initial Git commit that shows

7:54

what file has been added. Then it

7:56

calling agent for each subsequent

7:58

session to ask the model to make

7:59

incremental progress, then leave

8:01

structured updates. And all those

8:03

efforts are really try to serve one

8:05

purpose, is how can they define an

8:07

environment where agents can quickly

8:09

understand state of work when starting

8:11

with a fresh context window. So workflow

8:13

is that initializer agent would firstly

8:15

try to set up a environment, or you can

8:17

call it documentation system, to track

8:20

and maintain overall plan. And the

8:21

environment they define here is firstly

8:23

they will have a feature list documents

8:25

to prevent agent one shotting the whole

8:27

app or prematurely considering the

8:29

project complete. Instead, they would

8:30

get initializer agent to break down the

8:32

project into over 200 features and logs

8:35

them in a local JSON file look something

8:37

like this, where each task has detailed

8:38

spec, as well as pass or fail state. At

8:41

default, all tasks will be marked as

8:43

fail. So it force model to always look

8:45

at overall project goal and the

8:47

progress, pick up highest priority task

8:49

and do the next thing. But to make this

8:51

workflow works, they also need a way to

8:52

force the model leave the environment in

8:55

a clean state after making the code

8:57

change. In their experiment, they found

8:58

the best way is to ask the model to

9:00

commit its progress to Git with

9:02

descriptive comment message and write a

9:05

summary of its progress in progress

9:06

file. But with just documentation and

9:08

context environment itself, is not

9:10

enough, because model at default have

9:11

this tendency to mark something as

9:13

completed without proper testing. And at

9:15

beginning, they were just prompting

9:17

Cloud Code to always do the test after

9:19

the code change by doing unit test or

9:21

API test for the dev server. But all

9:23

those things were often failed to

9:24

recognize that a feature is not working

9:26

end-to-end. But things really start

9:27

changing when they give model proper

9:29

tooling to do the end-to-end test by

9:31

itself, like Puppeteer, MCP, or Chrome

9:33

DevTools, where agent was able to

9:35

identify and fix bug that were not

9:37

directly obvious from the code itself.

9:39

So, basically, they are setting up a

9:40

structure where they have the

9:41

initialized agent to break down the

9:43

user's goal into a list of features

9:45

alongside in the SSH to be able to run

9:47

the dev server and progress files. So,

9:49

the next coding agent can just read the

9:51

feature list to get an understanding

9:52

about overall project plan and pick up

9:54

high priority task and progress file and

9:56

get locked to understand where things

9:58

are at. Then run in the SSH to start dev

10:00

server immediately and do end-to-end

10:03

test to verify the environment is clean.

10:05

So, that it can get a full picture,

10:07

faster feedback loop while each new

10:08

session and context window happen. In

10:10

OpenAI's blog, they talk about very

10:12

similar thing. You have to making sure

10:14

your application environment is legible.

10:15

They make the whole repository knowledge

10:17

the system of record. Initially, they

10:19

put a gigantic agents.md file and fell

10:22

in predictable ways because it's just

10:24

too much context for any agent to manage

10:26

and maintain. So, what they did is

10:27

design a proper document environment

10:29

structure and treat agents.md file as a

10:31

table of contents. So, they set up this

10:34

documentation system from architectures,

10:36

the design docs, the execution plan, DB

10:38

schema, product specs, and design

10:40

front-end plan, security, and many more.

10:42

And put this table of content into

10:44

agents.md file. So, the agent can

10:46

actually retrieve back relevant

10:47

information when needed. And this

10:49

enables progressive disclosure. And

10:51

OpenAI actually do that even further.

10:53

They would try to push not only the code

10:55

knowledge, but also Google Docs, Slack

10:57

message, all those other fragmented

10:59

information, feed the data into the

11:01

repository as a repository local version

11:03

artifacts. So, the agent can also

11:05

retrieve. Because from agent point of

11:07

view, if anything can't be accessed in

11:09

the environment, then effectively it

11:11

didn't exist. But again, documentation

11:12

itself didn't really keep a fully

11:14

agent-generated code base coherent. They

11:16

also introduced certain programmatic

11:18

workflow to enforce invariants. For

11:20

example, they layered domain

11:21

architecture with explicit cross-cutting

11:23

boundaries, which allowed them to

11:25

enforce those rules with custom checks,

11:27

linters, and structural tests, which can

11:29

be automatically triggered and injected

11:31

by every Git pre-commit. In those type

11:34

of architecture, usually you will

11:35

postpone until you have hundreds of

11:37

engineer in traditional software

11:38

company. But with coding agent, it's an

11:40

early prerequisite. Within those

11:42

boundaries, you allow teams and agent to

11:44

significant freedom in how solutions are

11:46

expressed without micromanaging and

11:48

worried architecture going to drift.

11:49

Meanwhile, they are also improved code

11:51

base a lot. For example, they made app

11:53

bootable per Git work trees. So, Codex

11:54

can just launch and drive many different

11:56

instance. And they also wired Chrome

11:58

DevTools protocol into the agent

12:00

runtime. So, that the agent can

12:01

reproduce bugs, validate fix by DOM

12:03

snapshots, screenshots, and navigation.

12:05

And with those environment and workflow

12:07

setup, the repository finally crossed a

12:09

minimum threshold where Codex can

12:11

end-to-end drive a new feature. So,

12:13

every time when Codex receive a single

12:15

prompt, the agent will start validating

12:17

the current state of code base,

12:18

reproduce a reported bug, record a video

12:20

to demonstrate the failure, implement

12:22

fix, validate the fix by driving

12:24

application, record a second video

12:26

demonstrating the resolution, and

12:27

eventually merge the change. So, those

12:29

two blocks show very good learnings and

12:31

necessary harness system you need to put

12:32

in place for fully autonomous system.

12:34

Meanwhile, there are also certain

12:35

learnings. Quite often when building

12:37

agents, especially vertical specific

12:38

agents, our tendency is to build

12:40

specialized tooling to do domain

12:42

specific task. The learning we got is

12:44

that large language model almost always

12:46

work better with generic tool that they

12:48

natively understand. We saw releases

12:50

awesome article about how they redesign

12:52

their text-to-SQL agent. So, they spent

12:54

months building a sophisticated internal

12:55

text-to-SQL agent D0 with specialized

12:58

tool, heavy prompt engineering, and

13:00

careful context management. But as many

13:02

of us experienced before, those type of

13:04

system kind of work, but is very

13:05

fragile, slow, and require constant

13:07

maintenance. Because every new edge

13:09

cases happen, you will need to engineer

13:11

new prompt to the agent. But later, they

13:12

tried one thing that totally changed

13:14

trajectory. They deleted most of the

13:16

specialized tool from the agent down to

13:18

a single bash command tool. And with

13:20

this much simpler architecture, the

13:22

agent actually performed 3.5 times

13:24

faster with 37% fewer tokens, and

13:27

success rate increased from 80% to 100%.

13:30

Similar learning has been shared from

13:31

Entropic team as well, where they talk

13:33

about instead of having specialized

13:35

search linked execute tools, they just

13:37

have one bash tool where it can run

13:38

grep, tail, npm, npm run lint. And

13:41

fundamentally, I think it's because all

13:43

this large language model is much more

13:44

familiar with those code native tools

13:46

that has billions of training tokens

13:48

versus bespoke tool calling JSON that it

13:50

needs to generate. And I've talked about

13:52

this in programmatic tool calling video

13:54

that I released last week. And I believe

13:55

it is similar fundamental principles

13:57

here. But the foundation of this simple

13:59

architecture is again the good context

14:01

and documentation environment where

14:03

model can use generic tools to retrieve

14:05

context progressively. And it is same

14:07

case with Open Claw. One reason Open

14:09

Claw is so interesting is that they have

14:11

a surprisingly simple but effective

14:13

context environment. They have list of

14:15

documentations to store core

14:16

information. With this foundation, they

14:18

only have the most basic tooling like

14:20

read, write, edit files, run bash

14:22

commands, and send message. All the rest

14:25

is coming from giving agent environment

14:26

to retrieve random context, plus a big

14:28

skill libraries to expand capabilities.

14:31

So, those are three practical learnings

14:33

about how to do harness engineer for

14:34

long running complex agents. I said have

14:36

a legible context environment to enable

14:39

each session to grab context

14:41

effectively, and write workflow and

14:42

tooling so that model can verify its

14:45

work effectively, drive faster feedback

14:47

loop, and trust agent with generic tools

14:49

that it natively understands. Anything

14:51

interesting, I'm going to share more in

14:52

depth about how do I take this learnings

14:54

and transform into a development life

14:56

cycle process. In AI Product Club, we

14:59

have courses and walk through about live

15:01

coding and building production agents.

15:03

And every week, myself and industry

15:05

experts share the latest practical

15:07

learnings. So, if you're interested in

15:09

learning what I'm learning every day,

15:11

you can click the link below to join

15:12

community. I hope you enjoyed this

15:14

video. Thank you, and I'll see you next

15:15

time.

Interactive Summary

This video explores the paradigm shift toward fully autonomous, long-running AI agents that gained significant momentum in late 2025. It introduces the concept of 'harness engineering'—a framework for designing robust, persistent environments where models can effectively manage complex tasks, perform self-verification, and utilize generic tools to achieve high-level outcomes. Key takeaways include the importance of making project environments 'legible' through structured documentation, prioritizing fast feedback loops, and trusting the model's native proficiency with standard tools over bespoke, complex abstractions.

Suggested questions

4 ready-made prompts