How OpenClaw Works (and why you should build your own)

Watch on YouTube

Now Playing

Transcript

264 segments

0:00

OpenClaw is an AI agent that's taken the

0:03

world by storm. So much so that people

0:05

are calling it AGI. In this video, I'll

0:08

show you exactly how it works under the

0:10

hood and the fundamentals that let you

0:12

build your own agents that outperform it

0:14

5 to 10x on your specific use case. I'm

0:18

Roman. I published a top 3% paper at

0:20

Nurips, the largest AI conference in the

0:23

world. Now I'm on a mission to become

0:25

the best AI coder. So what is openclaw

0:29

really? At its core, it's a set of

0:31

building blocks around an LLM that gives

0:33

it the ability to perform a variety of

0:36

tasks. Think of it like an exoskeleton

0:39

built around the LLM, giving it the

0:41

capacity to perform complex tasks on a

0:44

computer.

0:45

So here's a motivating example of what

0:47

makes OpenClaw feel so special. Your

0:50

agent wakes up on its own, opens up your

0:52

browser. While logged into your account,

0:54

it starts scouring X for the latest AI

0:56

discourse. It reads through the post,

0:58

pulls out the key takeaways, and shoots

1:00

you a message. Here's your morning

1:02

briefing. All of this happens completely

1:05

autonomously without you lifting a

1:07

finger. But this isn't magic, and it

1:10

actually is relatively simple to

1:13

understand how OpenClaw actually works.

1:17

First, we start with an LLM, which would

1:19

typically be a very simple external API

1:22

call or local model. This is the entire

1:25

brains of the operation. But if we want

1:28

to talk to it from a chat interface,

1:30

especially our phone, we can route

1:32

Telegram or other channels to it. So, we

1:35

connect to a gateway, which is typically

1:37

a websocket and HTTP server, and it runs

1:40

24/7, and it ties everything together.

1:44

Since LLMs forget everything between API

1:47

calls, we need session persistence. The

1:50

way chatbots work is they paste the

1:52

entire conversation back into the next

1:55

API call. OpenClaw does this by

1:58

appending every message as a line to a

2:00

JSON L file on disk. On each API call,

2:04

that file is parsed into a messages

2:07

array and passed back to the LLM. But

2:10

long conversations eventually overflow

2:13

the model's context window. When the API

2:15

rejects the request is too large,

2:17

OpenClaw's compaction system kicks in.

2:20

It summarizes each chunk of prior

2:22

messages via the LLM, merges the

2:25

summaries and retries until the context

2:27

is below 50%.

2:30

So now we have set up a basic chatbot.

2:32

Hence, we need to get the model to

2:34

understand that it isn't just a simple

2:36

LLM and that it has tools and a

2:38

personality. The solution here is

2:40

simple. We give the model a system

2:42

prompt which is a set of markdown files

2:45

which tell it how to work in the

2:47

openclaw harness. These include soul

2:50

agents and memory. On top of that, we

2:53

give the LLM skills metadata and tool

2:55

schemas so that it knows which tool it

2:57

can actually call without giving it the

2:59

entire tool or skill into its context.

3:02

Finally, we also inject some safety and

3:05

runtime prompts that help the model

3:06

operate safely.

3:09

And to make the chatbot remember who we

3:11

are and previous conversations, we allow

3:14

the model to write to a previously

3:15

mentioned memory.mmd file for critical

3:18

information. And openclot also adds in a

3:21

rag style memory which uses a hybrid

3:24

retrieval system in order to store tons

3:26

of previous conversations and nuggets.

3:29

It then allows the model to call a

3:31

memory tool in order to search the

3:33

memory database for relevant details.

3:36

We also have an output function which is

3:38

just where the LLM talks to us. It might

3:41

output to telegram, discord or somewhere

3:43

else. Since we have provided the model

3:45

with an identity, we have to actually

3:47

tell it which actions it can take and

3:50

how. This is the exoskeleton. We call

3:52

these tool calls. The model can output

3:55

some tokens which calls the tool. The

3:57

tool triggers an external action to

3:59

occur. For example, writing code in the

4:02

computer and the tool returns tokens

4:04

back to the model. and a feedback loop

4:06

is hence created. We call this an

4:09

agentic loop. Your OpenClaw is now an

4:12

agent. One of the most critical tools

4:14

that makes OpenCloud different from most

4:16

agents is that of computer control.

4:18

OpenClaw controls your browser via a

4:21

Chrome extension relay similar to Claude

4:23

browser. And this means OpenClaw can

4:26

stay logged in. Also, it's not just

4:29

browser access. It's full-on computer

4:31

control with access to everything from

4:33

the terminal to the camera. Obviously,

4:36

this level of access comes with heavy

4:38

security trade-offs. So, please use your

4:40

at your own risk and with heavy

4:42

guardrails. But the big thing that

4:44

OpenCloud gets so much praise for is its

4:46

autonomous behavior. This is actually

4:49

built on two relatively simple

4:50

mechanisms. The first is the heartbeat,

4:54

a timer which is defaulted to every 30

4:56

minutes that fires a standard prompt

4:58

telling the agent to recall

5:00

heartbeat.mmd and follow its

5:02

instructions. The key insight here is

5:04

that the agent itself can write to

5:06

heartbeat.mmd. So it effectively

5:09

programs its own future behavior. On top

5:12

of this, there are cron jobs which are

5:14

scheduled tasks the agent can create,

5:16

modify and delete using the cron tool

5:19

with full cron expressions, one-time

5:21

triggers or intervals. The other method

5:24

is web hooks which are external events

5:26

that wake the agent with context about

5:28

the trigger. Something happens, the

5:30

model wakes up with that context and it

5:33

acts on it. And there it is. For the

5:36

most part, this is the entire

5:37

architecture behind OpenClaw. Most of

5:40

these are wired to the model through

5:42

various scripts and hooks. Obviously,

5:44

there are features and methodologies

5:46

that I can't explain in just a simple

5:48

video like this, such as multi-agent

5:50

methods, hooks, sandboxing, and more.

5:53

And you can notice a distinct pattern

5:55

here. Agents are comprised of four

5:59

categories. What triggers the agent,

6:01

what is injected on every turn, what

6:04

tools it can call, and what it outputs.

6:07

And the final gem is giving the agents

6:10

the ability to run in a loop. The LLM

6:13

calls a tool, gets feedback on the tool,

6:15

and decides the next step. Putting all

6:18

of this together is what we call a

6:19

harness. And learning to build your own

6:21

is one of the highest levered skills

6:23

going into the next decade. Think of it

6:25

like the newer version of coding. And

6:28

all you need to start are these four

6:32

categories of model behavior.

6:35

Let's break down each zone. First, what

6:38

triggers the LLM? Cron, heartbeats,

6:40

these are all core methods to wake

6:43

OpenClaw or the LLM up. Next, what gets

6:47

injected into the LLM's context on every

6:49

turn? The system prompt with soul.md in

6:53

this case, and personality files, JSON L

6:55

conversation history tool schemas.

6:58

Basically, you want to give the model

6:59

just enough information to operate well

7:02

without giving it too much information

7:04

and causing context rot. Then we have

7:07

the tools the LLM can actually call.

7:10

Examples might be memory retrieval via

7:12

rag, computer control, skills, plugins,

7:15

all executing in a sandboxed environment

7:18

with results flowing back to the model

7:20

in an agentic loop. This makes your

7:22

model an actual agent. And finally, what

7:26

the agent outputs or writes. This

7:28

answers the question of what the agent

7:30

can actually do and how it can actually

7:33

communicate with the world and remember.

7:36

So, OpenClaw is genuinely impressive,

7:38

but the truth is that it's a generalist

7:41

model. The OpenClaw architecture is just

7:44

a roundabout way of giving one agent the

7:46

power to do many, many things. As a

7:49

result, the context given to OpenClaw is

7:52

consistently overkill for a given task.

7:54

and they have to jump through many many

7:56

hoops. This is the core reason that

7:58

OpenClaw doesn't perform as well as you

8:00

would like and tends to be very very

8:02

expensive. On top of all of this,

8:05

Anthropic banned OpenClaw use on max

8:08

plans. OpenClaw has a massive security

8:10

vulnerability

8:12

and OpenClaw is very different, very

8:15

difficult to truly customize and see

8:17

what's going on inside. For this reason,

8:20

I implore you to build your own version

8:22

that serves just one purpose. I call

8:26

these sniper agents. Let me give you a

8:29

motivating example of open claused

8:32

context. So, on day one, you're looking

8:34

at about 7,000 tokens of fixed overhead

8:37

before you even say a word. This is

8:39

honestly impressively low. And

8:42

typically, it's comprised of the soul.md

8:44

agents, workspace files, skill

8:46

descriptions, tool schemas, and more.

8:49

But here's where things get bad. After a

8:52

month of daily use, memory files grow.

8:54

The agent creates skills. Every session

8:57

reset, saves a summary file, and more.

8:59

The more skills and plugins you install,

9:02

the more bloated things get. At this

9:04

point, you're looking at around 45,000

9:07

tokens of fixed overhead before even

9:09

sending a single prompt. based on the

9:12

results of the paper measuring context

9:14

rot. This results in up to a 40%

9:18

performance decrease in your model. But

9:20

look at a singlepurpose email reader

9:22

agent. It only needs about 1,400 tokens

9:25

and it works like a charm. After 6

9:28

months, the workspace files cap at

9:30

around 37,000 tokens, the skills cap at

9:34

7,500 tokens, and tools can get infinite

9:37

bloat, resulting in tens of thousands of

9:39

tokens. At those token counts, you're

9:42

looking at at least 50 to 90%

9:44

performance decreases and about 52 cents

9:48

of extra usage per message sent. And by

9:52

the way, once again, this is before you

9:54

even send your first message. That will

9:56

result in almost instant compaction.

10:00

So on top of all of this, OpenClaw has

10:02

hard limits on certain functionalities

10:04

such as memory, heartbeat, skills, and

10:06

more. This causes absolutely

10:09

catastrophic forgetting after months of

10:11

daily use. The model will forget what

10:13

you told it because open claw is

10:16

directly preventing those memories from

10:18

getting into context. It does this by

10:21

very basic truncation.

10:24

So something I want you to take away is

10:26

that learning how agents and harnesses

10:29

will work will allow insane performance

10:32

gains above a one-sizefits-all model.

10:35

And at very least, understanding

10:36

OpenClaw's internal mechanisms will

10:39

allow you to prevent context rot by

10:41

maintaining good context hygiene and

10:43

limiting plug-in usage. If you want to

10:46

learn more about how to build agents

10:48

that will streamline your work or help

10:49

you build your dream app, join my free

10:52

school community. It is the number one

10:54

agentic coding community on school.

10:57

Thanks for watching and I'll

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video explains OpenClaw, an AI agent that has gained significant attention for its capabilities, with some even considering it Artificial General Intelligence (AGI). It details OpenClaw's architecture, which involves a Large Language Model (LLM) enhanced with tools and a "personality" through system prompts. Key components include session persistence via JSONL files, a compaction system to manage context window limitations, and memory management through files and a RAG-style retrieval system. OpenClaw's ability to perform complex tasks is enabled by an "exoskeleton" of tools, including computer control via a Chrome extension, allowing it to interact with browsers, terminals, and other applications. Autonomous behavior is achieved through mechanisms like a "heartbeat" timer and cron jobs, enabling the agent to self-program its future actions. The video highlights four core categories for agent behavior: triggers, context injection, tools, and outputs. It also contrasts the generalist nature of OpenClaw with specialized "sniper agents," arguing that single-purpose agents offer better performance, reduced costs, and easier customization due to significantly lower overhead and less "context rot." The presenter encourages viewers to build their own agents for specific tasks to achieve higher performance and efficiency.