wtf is Harness Engineer & why is it important
479 segments
Thanks to HubSpot for sponsoring this
video.
So, something really big actually
happened in December 2025 and most of
the people didn't even realize that.
Entry Kapsi tweeted about this last
week. It's very hard to communicate how
much programming has changed due to AI
in the last 2 months, specifically since
last December. And Greg from OpenAI also
talked about this. Since December,
there's step function improvements in
what the model and tools are capable of.
And a few engineers have told him that
their job has fundamentally changed
since December 2025. So, what actually
happened in December 2025? In short
words, the latest model introduced then
is finally ready for fully autonomous
long-running tasks. So, with AI, the
ultimate dream is always that while we
are sleeping, AI can just work on tasks
fully autonomously 24/7. Even back 2023,
the most popular project, if you
remember, is called Auto GPT. It is
first time this fully autonomous agent
existing was introduced. And they have
very basic and simple architecture that
using GPT-4 as a model to autonomously
break down a list of tasks based on
user's goal simple memory storage to
store the result. And people were doing
some pretty crazy stuff like just give
it a goal, make a $100,000 and let it
loop through tasks infinitely until
complete. Back then, the system just
break and failed miserably because the
model is simply not ready. But since
December last year, this really changed.
The models have significantly higher
quality, long-term coherence, and they
can power through much larger and longer
tasks. And we saw all sorts of different
experimentation came out from industry.
Firstly, from January, we got this super
hot concept called rough loop. The most
basic and simple agent iteration loop to
force model work longer so that it can
take more complex tasks. You just follow
the model with some simple condition
checks. But already, we start seeing the
difference. And 1 week later, Cursor
also released their experimentation
where they used GPT-5.2 to autonomously
build a browser from scratch with 3
million lines of code. And Anthropic
also released this experimentation they
had where they get a team of cloud codes
to autonomously working on a C compiler
from scratch for 2 weeks. In the end, it
delivered a functional version with zero
manual coding. You can even run Doom
inside this compiler, as well. At same
time, open claw start gaining attention
and had this explosive growth that we
never seen before. And it was very
difficult to understand what was going
on with open claw, cuz from outside,
it's very easy to categorize open claw
just be another menace, but living
inside your own computer and can also
access from Telegram. Like, why is it so
popular? And only later after I used a
deep play, I realized that the real
difference is that open claw represent
this type of always-on, long-running,
fully autonomous agents. That is very
different from all the other agentic
system we used before, where human is
main driver to prompt for the next
action. Open claw is always-on and it is
proactive. And this autonomous feeding
is created by a very simple
architecture, where it has memory
context layer with a trigger and a cron
job to automatically take actions and
have the full computer access, which is
powerful environment it can operate in.
And I believe open claw is the first
project that really open up the biggest
paradigm shift in 2026. That we are
moving from a co-pilot, simple
task-based agent system to those
long-running, fully autonomous agent.
Something that's always-on, always
ready, autonomously delivering super
complex, coordinated work. This is a
critical shift you have to understand.
The model today is actually much more
powerful than you think, as long as you
design right system to unlock it. And
this is the crux of what I want to talk
about today. The harness engineer to
re-enable long-running autonomous
systems. If it's first time you hear
about harness engineer, this is like
evolution from what we've been
previously talked about, which is
context engineer or prompt engineer. So,
previously, we really focused on how to
optimize the prompts within the
effective context window to get a model
have the best performance for a single
agent loop session. But harness engineer
is really focused on those long-running
tasks, which means how do you design a
system that can works across different
sessions and multiple different agents?
And how do you design the right workflow
to making sure the relevant context will
be retrieved for each session and right
set of toolings to extract most out of
models. This is fairly new concept, but
the good thing is that industry already
convert on some best practice that you
can use from Anthropic, Vercel,
LangChain, and many others. We'll go
through each one of them one by one so
you can see the patterns. But before you
dive into this, with this paradigm shift
fully autonomous agents, one of the
biggest opportunity for the next 6-12
months is build open cloth for a certain
verticals, which means you deeply
investigate and understand the
end-to-end workflow of a certain
vertical and build it autonomous agent
with correct environment and tooling to
enable the end-to-end process. That's
why I want to introduce you to this
awesome research HubSpot did on the AI
adoption in email marketing report. It
is fascinating report for you to
understand for a vertical like email
marketing, where people actually use AI
today and what are the gaps. Cuz this
report showcase clear workflow and
opportunity email marketing that you can
potentially automate. They survey
hundreds of email marketers from top
companies to understand exactly how AI
is reshaping their workflows. They talk
about why marketers are still doing a
lot of heavy editing, what were the cost
to it, as well as the biggest challenge
they are facing today when implementing
AI in the email marketings. And each of
them is a big opportunity for you to
build a fully autonomous agents. They
even dive into the specific KPI that
they care more about and AI has show
proven results, as well as what exactly
things email marketers are really want
from AI. So, if you're a builder who are
thinking about the next big agent
product to build, I highly recommend you
go check out this awesome resource. I
put the link in the description below
for you to download for free. And thanks
HubSpot for sponsoring this video. Now,
let's get back to harness engineer for
long-running agent systems. And at high
level, there are three learnings I took
away from those. One is that for
long-running task agents, the critical
part of system design is creating this
legible environment where each sub-agent
or sessions can actually understand
where things are at. And most likely
there's some workflows that can be done
to enforce eligibility of the
environment. And I'll expand a bit more
on that. The second is verification is
critical. You can improve assistant
output significantly by allowing it to
verify its work effectively with faster
feedback loop. And third is that we need
to trust model more instead of building
specialized tooling that wrap a lot of
reasoning and logic prematurely. We
should give model max context with
generic tooling that they need to be
able to understand and explore like
human. And I'll unpack those three
things one by one as we go through each
block here. First is Anthropic's
effective harness for long-running
agents blocks. So they've experimented
using Cloud Code SDK to build a
specialized agent for super long-running
tasks like build a clone of cloud.ai
website. The very first failures they
observed is that firstly agent tend to
do too much at once. Essentially it will
always try to one-shot the whole app.
And this led to the model running out of
context in the middle of its
implementation and leaving the next
session to start with the feature half
implemented or documented. Then agent
would have to guess what actually
happened and spend substantial time
trying to get the basic app working
again. And second failure they observed
is that agent tend to declare job
complete prematurely. You probably
experienced this a few times yourself as
well. The Cloud Code or Cursor would
just claim the project or feature is
completed. But once you test it, it
actually didn't work. So their approach
to solve those default model failure
behavior is that first they set up
initial environment that lays the
foundation for all the features that
given prompt requires, which set ups
agent to work step by step and feature
by feature. So this kind of similar to
the plan or PRD approach that we
normally took. The second is that they
start prompt each agent to make
incremental progress towards its goal
while also leaving the environment in
clean state at end of each session. What
they did is starting design this
two-part solution. First they would have
this initializer agent that is used a
specialized prompt to ask model to set
up initial environment with a init.sh
script, which will set up dev server,
for example, so that next model don't
need to worry about those things. And
also it create progress.txt file that
keeps logs on what agent have done, as
well as initial Git commit that shows
what file has been added. Then it
calling agent for each subsequent
session to ask the model to make
incremental progress, then leave
structured updates. And all those
efforts are really try to serve one
purpose, is how can they define an
environment where agents can quickly
understand state of work when starting
with a fresh context window. So workflow
is that initializer agent would firstly
try to set up a environment, or you can
call it documentation system, to track
and maintain overall plan. And the
environment they define here is firstly
they will have a feature list documents
to prevent agent one shotting the whole
app or prematurely considering the
project complete. Instead, they would
get initializer agent to break down the
project into over 200 features and logs
them in a local JSON file look something
like this, where each task has detailed
spec, as well as pass or fail state. At
default, all tasks will be marked as
fail. So it force model to always look
at overall project goal and the
progress, pick up highest priority task
and do the next thing. But to make this
workflow works, they also need a way to
force the model leave the environment in
a clean state after making the code
change. In their experiment, they found
the best way is to ask the model to
commit its progress to Git with
descriptive comment message and write a
summary of its progress in progress
file. But with just documentation and
context environment itself, is not
enough, because model at default have
this tendency to mark something as
completed without proper testing. And at
beginning, they were just prompting
Cloud Code to always do the test after
the code change by doing unit test or
API test for the dev server. But all
those things were often failed to
recognize that a feature is not working
end-to-end. But things really start
changing when they give model proper
tooling to do the end-to-end test by
itself, like Puppeteer, MCP, or Chrome
DevTools, where agent was able to
identify and fix bug that were not
directly obvious from the code itself.
So, basically, they are setting up a
structure where they have the
initialized agent to break down the
user's goal into a list of features
alongside in the SSH to be able to run
the dev server and progress files. So,
the next coding agent can just read the
feature list to get an understanding
about overall project plan and pick up
high priority task and progress file and
get locked to understand where things
are at. Then run in the SSH to start dev
server immediately and do end-to-end
test to verify the environment is clean.
So, that it can get a full picture,
faster feedback loop while each new
session and context window happen. In
OpenAI's blog, they talk about very
similar thing. You have to making sure
your application environment is legible.
They make the whole repository knowledge
the system of record. Initially, they
put a gigantic agents.md file and fell
in predictable ways because it's just
too much context for any agent to manage
and maintain. So, what they did is
design a proper document environment
structure and treat agents.md file as a
table of contents. So, they set up this
documentation system from architectures,
the design docs, the execution plan, DB
schema, product specs, and design
front-end plan, security, and many more.
And put this table of content into
agents.md file. So, the agent can
actually retrieve back relevant
information when needed. And this
enables progressive disclosure. And
OpenAI actually do that even further.
They would try to push not only the code
knowledge, but also Google Docs, Slack
message, all those other fragmented
information, feed the data into the
repository as a repository local version
artifacts. So, the agent can also
retrieve. Because from agent point of
view, if anything can't be accessed in
the environment, then effectively it
didn't exist. But again, documentation
itself didn't really keep a fully
agent-generated code base coherent. They
also introduced certain programmatic
workflow to enforce invariants. For
example, they layered domain
architecture with explicit cross-cutting
boundaries, which allowed them to
enforce those rules with custom checks,
linters, and structural tests, which can
be automatically triggered and injected
by every Git pre-commit. In those type
of architecture, usually you will
postpone until you have hundreds of
engineer in traditional software
company. But with coding agent, it's an
early prerequisite. Within those
boundaries, you allow teams and agent to
significant freedom in how solutions are
expressed without micromanaging and
worried architecture going to drift.
Meanwhile, they are also improved code
base a lot. For example, they made app
bootable per Git work trees. So, Codex
can just launch and drive many different
instance. And they also wired Chrome
DevTools protocol into the agent
runtime. So, that the agent can
reproduce bugs, validate fix by DOM
snapshots, screenshots, and navigation.
And with those environment and workflow
setup, the repository finally crossed a
minimum threshold where Codex can
end-to-end drive a new feature. So,
every time when Codex receive a single
prompt, the agent will start validating
the current state of code base,
reproduce a reported bug, record a video
to demonstrate the failure, implement
fix, validate the fix by driving
application, record a second video
demonstrating the resolution, and
eventually merge the change. So, those
two blocks show very good learnings and
necessary harness system you need to put
in place for fully autonomous system.
Meanwhile, there are also certain
learnings. Quite often when building
agents, especially vertical specific
agents, our tendency is to build
specialized tooling to do domain
specific task. The learning we got is
that large language model almost always
work better with generic tool that they
natively understand. We saw releases
awesome article about how they redesign
their text-to-SQL agent. So, they spent
months building a sophisticated internal
text-to-SQL agent D0 with specialized
tool, heavy prompt engineering, and
careful context management. But as many
of us experienced before, those type of
system kind of work, but is very
fragile, slow, and require constant
maintenance. Because every new edge
cases happen, you will need to engineer
new prompt to the agent. But later, they
tried one thing that totally changed
trajectory. They deleted most of the
specialized tool from the agent down to
a single bash command tool. And with
this much simpler architecture, the
agent actually performed 3.5 times
faster with 37% fewer tokens, and
success rate increased from 80% to 100%.
Similar learning has been shared from
Entropic team as well, where they talk
about instead of having specialized
search linked execute tools, they just
have one bash tool where it can run
grep, tail, npm, npm run lint. And
fundamentally, I think it's because all
this large language model is much more
familiar with those code native tools
that has billions of training tokens
versus bespoke tool calling JSON that it
needs to generate. And I've talked about
this in programmatic tool calling video
that I released last week. And I believe
it is similar fundamental principles
here. But the foundation of this simple
architecture is again the good context
and documentation environment where
model can use generic tools to retrieve
context progressively. And it is same
case with Open Claw. One reason Open
Claw is so interesting is that they have
a surprisingly simple but effective
context environment. They have list of
documentations to store core
information. With this foundation, they
only have the most basic tooling like
read, write, edit files, run bash
commands, and send message. All the rest
is coming from giving agent environment
to retrieve random context, plus a big
skill libraries to expand capabilities.
So, those are three practical learnings
about how to do harness engineer for
long running complex agents. I said have
a legible context environment to enable
each session to grab context
effectively, and write workflow and
tooling so that model can verify its
work effectively, drive faster feedback
loop, and trust agent with generic tools
that it natively understands. Anything
interesting, I'm going to share more in
depth about how do I take this learnings
and transform into a development life
cycle process. In AI Product Club, we
have courses and walk through about live
coding and building production agents.
And every week, myself and industry
experts share the latest practical
learnings. So, if you're interested in
learning what I'm learning every day,
you can click the link below to join
community. I hope you enjoyed this
video. Thank you, and I'll see you next
time.
Ask follow-up questions or revisit key timestamps.
This video explores the paradigm shift toward fully autonomous, long-running AI agents that gained significant momentum in late 2025. It introduces the concept of 'harness engineering'—a framework for designing robust, persistent environments where models can effectively manage complex tasks, perform self-verification, and utilize generic tools to achieve high-level outcomes. Key takeaways include the importance of making project environments 'legible' through structured documentation, prioritizing fast feedback loops, and trusting the model's native proficiency with standard tools over bespoke, complex abstractions.
Videos recently processed by our community