How OpenClaw Works (and why you should build your own)
264 segments
OpenClaw is an AI agent that's taken the
world by storm. So much so that people
are calling it AGI. In this video, I'll
show you exactly how it works under the
hood and the fundamentals that let you
build your own agents that outperform it
5 to 10x on your specific use case. I'm
Roman. I published a top 3% paper at
Nurips, the largest AI conference in the
world. Now I'm on a mission to become
the best AI coder. So what is openclaw
really? At its core, it's a set of
building blocks around an LLM that gives
it the ability to perform a variety of
tasks. Think of it like an exoskeleton
built around the LLM, giving it the
capacity to perform complex tasks on a
computer.
So here's a motivating example of what
makes OpenClaw feel so special. Your
agent wakes up on its own, opens up your
browser. While logged into your account,
it starts scouring X for the latest AI
discourse. It reads through the post,
pulls out the key takeaways, and shoots
you a message. Here's your morning
briefing. All of this happens completely
autonomously without you lifting a
finger. But this isn't magic, and it
actually is relatively simple to
understand how OpenClaw actually works.
First, we start with an LLM, which would
typically be a very simple external API
call or local model. This is the entire
brains of the operation. But if we want
to talk to it from a chat interface,
especially our phone, we can route
Telegram or other channels to it. So, we
connect to a gateway, which is typically
a websocket and HTTP server, and it runs
24/7, and it ties everything together.
Since LLMs forget everything between API
calls, we need session persistence. The
way chatbots work is they paste the
entire conversation back into the next
API call. OpenClaw does this by
appending every message as a line to a
JSON L file on disk. On each API call,
that file is parsed into a messages
array and passed back to the LLM. But
long conversations eventually overflow
the model's context window. When the API
rejects the request is too large,
OpenClaw's compaction system kicks in.
It summarizes each chunk of prior
messages via the LLM, merges the
summaries and retries until the context
is below 50%.
So now we have set up a basic chatbot.
Hence, we need to get the model to
understand that it isn't just a simple
LLM and that it has tools and a
personality. The solution here is
simple. We give the model a system
prompt which is a set of markdown files
which tell it how to work in the
openclaw harness. These include soul
agents and memory. On top of that, we
give the LLM skills metadata and tool
schemas so that it knows which tool it
can actually call without giving it the
entire tool or skill into its context.
Finally, we also inject some safety and
runtime prompts that help the model
operate safely.
And to make the chatbot remember who we
are and previous conversations, we allow
the model to write to a previously
mentioned memory.mmd file for critical
information. And openclot also adds in a
rag style memory which uses a hybrid
retrieval system in order to store tons
of previous conversations and nuggets.
It then allows the model to call a
memory tool in order to search the
memory database for relevant details.
We also have an output function which is
just where the LLM talks to us. It might
output to telegram, discord or somewhere
else. Since we have provided the model
with an identity, we have to actually
tell it which actions it can take and
how. This is the exoskeleton. We call
these tool calls. The model can output
some tokens which calls the tool. The
tool triggers an external action to
occur. For example, writing code in the
computer and the tool returns tokens
back to the model. and a feedback loop
is hence created. We call this an
agentic loop. Your OpenClaw is now an
agent. One of the most critical tools
that makes OpenCloud different from most
agents is that of computer control.
OpenClaw controls your browser via a
Chrome extension relay similar to Claude
browser. And this means OpenClaw can
stay logged in. Also, it's not just
browser access. It's full-on computer
control with access to everything from
the terminal to the camera. Obviously,
this level of access comes with heavy
security trade-offs. So, please use your
at your own risk and with heavy
guardrails. But the big thing that
OpenCloud gets so much praise for is its
autonomous behavior. This is actually
built on two relatively simple
mechanisms. The first is the heartbeat,
a timer which is defaulted to every 30
minutes that fires a standard prompt
telling the agent to recall
heartbeat.mmd and follow its
instructions. The key insight here is
that the agent itself can write to
heartbeat.mmd. So it effectively
programs its own future behavior. On top
of this, there are cron jobs which are
scheduled tasks the agent can create,
modify and delete using the cron tool
with full cron expressions, one-time
triggers or intervals. The other method
is web hooks which are external events
that wake the agent with context about
the trigger. Something happens, the
model wakes up with that context and it
acts on it. And there it is. For the
most part, this is the entire
architecture behind OpenClaw. Most of
these are wired to the model through
various scripts and hooks. Obviously,
there are features and methodologies
that I can't explain in just a simple
video like this, such as multi-agent
methods, hooks, sandboxing, and more.
And you can notice a distinct pattern
here. Agents are comprised of four
categories. What triggers the agent,
what is injected on every turn, what
tools it can call, and what it outputs.
And the final gem is giving the agents
the ability to run in a loop. The LLM
calls a tool, gets feedback on the tool,
and decides the next step. Putting all
of this together is what we call a
harness. And learning to build your own
is one of the highest levered skills
going into the next decade. Think of it
like the newer version of coding. And
all you need to start are these four
categories of model behavior.
Let's break down each zone. First, what
triggers the LLM? Cron, heartbeats,
these are all core methods to wake
OpenClaw or the LLM up. Next, what gets
injected into the LLM's context on every
turn? The system prompt with soul.md in
this case, and personality files, JSON L
conversation history tool schemas.
Basically, you want to give the model
just enough information to operate well
without giving it too much information
and causing context rot. Then we have
the tools the LLM can actually call.
Examples might be memory retrieval via
rag, computer control, skills, plugins,
all executing in a sandboxed environment
with results flowing back to the model
in an agentic loop. This makes your
model an actual agent. And finally, what
the agent outputs or writes. This
answers the question of what the agent
can actually do and how it can actually
communicate with the world and remember.
So, OpenClaw is genuinely impressive,
but the truth is that it's a generalist
model. The OpenClaw architecture is just
a roundabout way of giving one agent the
power to do many, many things. As a
result, the context given to OpenClaw is
consistently overkill for a given task.
and they have to jump through many many
hoops. This is the core reason that
OpenClaw doesn't perform as well as you
would like and tends to be very very
expensive. On top of all of this,
Anthropic banned OpenClaw use on max
plans. OpenClaw has a massive security
vulnerability
and OpenClaw is very different, very
difficult to truly customize and see
what's going on inside. For this reason,
I implore you to build your own version
that serves just one purpose. I call
these sniper agents. Let me give you a
motivating example of open claused
context. So, on day one, you're looking
at about 7,000 tokens of fixed overhead
before you even say a word. This is
honestly impressively low. And
typically, it's comprised of the soul.md
agents, workspace files, skill
descriptions, tool schemas, and more.
But here's where things get bad. After a
month of daily use, memory files grow.
The agent creates skills. Every session
reset, saves a summary file, and more.
The more skills and plugins you install,
the more bloated things get. At this
point, you're looking at around 45,000
tokens of fixed overhead before even
sending a single prompt. based on the
results of the paper measuring context
rot. This results in up to a 40%
performance decrease in your model. But
look at a singlepurpose email reader
agent. It only needs about 1,400 tokens
and it works like a charm. After 6
months, the workspace files cap at
around 37,000 tokens, the skills cap at
7,500 tokens, and tools can get infinite
bloat, resulting in tens of thousands of
tokens. At those token counts, you're
looking at at least 50 to 90%
performance decreases and about 52 cents
of extra usage per message sent. And by
the way, once again, this is before you
even send your first message. That will
result in almost instant compaction.
So on top of all of this, OpenClaw has
hard limits on certain functionalities
such as memory, heartbeat, skills, and
more. This causes absolutely
catastrophic forgetting after months of
daily use. The model will forget what
you told it because open claw is
directly preventing those memories from
getting into context. It does this by
very basic truncation.
So something I want you to take away is
that learning how agents and harnesses
will work will allow insane performance
gains above a one-sizefits-all model.
And at very least, understanding
OpenClaw's internal mechanisms will
allow you to prevent context rot by
maintaining good context hygiene and
limiting plug-in usage. If you want to
learn more about how to build agents
that will streamline your work or help
you build your dream app, join my free
school community. It is the number one
agentic coding community on school.
Thanks for watching and I'll
Ask follow-up questions or revisit key timestamps.
The video explains OpenClaw, an AI agent that has gained significant attention for its capabilities, with some even considering it Artificial General Intelligence (AGI). It details OpenClaw's architecture, which involves a Large Language Model (LLM) enhanced with tools and a "personality" through system prompts. Key components include session persistence via JSONL files, a compaction system to manage context window limitations, and memory management through files and a RAG-style retrieval system. OpenClaw's ability to perform complex tasks is enabled by an "exoskeleton" of tools, including computer control via a Chrome extension, allowing it to interact with browsers, terminals, and other applications. Autonomous behavior is achieved through mechanisms like a "heartbeat" timer and cron jobs, enabling the agent to self-program its future actions. The video highlights four core categories for agent behavior: triggers, context injection, tools, and outputs. It also contrasts the generalist nature of OpenClaw with specialized "sniper agents," arguing that single-purpose agents offer better performance, reduced costs, and easier customization due to significantly lower overhead and less "context rot." The presenter encourages viewers to build their own agents for specific tasks to achieve higher performance and efficiency.
Videos recently processed by our community