Ralph-loop 2.0? The real autonomous coder is coming...

Watch on YouTube

Now Playing

Transcript

370 segments

0:00

So Open AI released this goal feature in

0:02

Codex, which allows agent to

0:03

continuously working for hours and hours

0:06

on bigger and complex projects. And we

0:08

already saw people getting pretty

0:09

phenomenal results like one prompt to

0:11

generate fully functional game, to

0:13

building and testing an iOS app by

0:15

itself for 6 hours. And key developers

0:18

from Codex team also mentioned this is

0:20

probably the most consequential thing

0:21

they have shipped in Codex this year.

0:23

And very quickly, Hermes agent also

0:25

released this persist ghost feature,

0:27

which is similar type of feature that

0:28

allow you to set standing goal for

0:30

Hermes agent to work across turns until

0:33

it is achieved. So how exactly does this

0:35

goal feature works and what are the best

0:37

practice of actually using this? So what

0:39

is problem this goal feature actually

0:40

try to solve? So even though model has

0:42

getting significantly better where it

0:44

can consistently finish well scope

0:46

tickets, but as we let it take it over

0:48

more and more complex projects, you

0:49

probably experience yourself that the

0:51

model can sometimes get lazy and declare

0:53

the victory too early. For example, you

0:55

might ask agent to just fix all the

0:57

failing test in your repo and most

0:59

likely agent will do some work maybe for

1:01

10 or 15 minutes and then come back

1:03

saying it fixed all the issues, which

1:05

most likely is not exhaustive. You need

1:07

to prompt the agent say, "Hey, there's

1:09

still XYZ are not done." And you need to

1:11

keep doing this for some complex tasks.

1:13

And there were a very popular project

1:15

earlier this year called rough loop,

1:17

which is basically running your coding

1:18

agent in for loop where every time when

1:20

agent finish, it will write output of

1:22

his work in the file system and

1:24

programmatically trigger the coding

1:26

agent again. And very quickly it was

1:27

implemented in cloud code through a

1:29

plugin as well. But fundamentally rough

1:31

loop is a pretty simple or a kind of

1:34

dump programmatic loop. You just run in

1:36

the cloud through a specific prompt in

1:38

while loop and define the maximum

1:39

iterations. And this goal feature in

1:41

Codex and Hermes is an intuition of this

1:43

rough loop where it is no longer doing a

1:45

simple dump programmatic loop. Instead

1:47

it use large language model to decide

1:49

and judge whether task has been

1:51

satisfied. So when user send out a goal

1:53

command, it will trigger the agent to do

1:55

things. And once finished, there will be

1:57

one large language model call to decide

1:59

whether the goal has been satisfied. If

2:01

yes, then finish the session, but if no,

2:03

it will trigger the agent again with

2:04

some prompts. And this means instead

2:06

agent prematurely claims the task has

2:09

been done, it will have this large

2:10

language model call identify and capture

2:12

scenario and guide agent to continuously

2:15

working until the goal actually

2:17

satisfied. Meanwhile, it also made it

2:19

good to handle ambiguous tasks. So,

2:21

instead of a list of tasks that has well

2:23

scoped out, it can handle things like

2:25

just cut a Docker image size by 60%

2:28

where you probably at beginning don't

2:29

know exactly how to do this. We can get

2:31

agent to start exploring, trying out

2:33

multiple different methods and approach,

2:35

and it's step-by-step adding the

2:37

improvements. This is somewhat similar

2:39

to the other very popular project

2:41

earlier this year from Andrew Cabसी like

2:43

auto research, where fundamentally it's

2:45

also a for loop that getting agent

2:47

continuously working and save the state

2:49

result. So, this goal feature is really

2:52

good for this type of complex

2:53

long-running coding work like code

2:55

migration, large refactories, as well as

2:57

ambiguous goals like experiments, where

3:00

agent can keep making scoped progress.

3:02

So, compared with original rough loop,

3:04

the stop condition instead being a

3:06

programmatic manual limitations, it will

3:08

be using large language model to judge

3:09

and decide whether the goal is finished.

3:11

And when new loop start, instead of

3:13

feeding the identical prompt.md into the

3:16

agent loop, this goal feature has

3:17

continuous prompt about goal context as

3:20

well as a state. And the way it works,

3:22

as we mentioned before, is this loop

3:23

that is running. And once finished, it

3:25

will send to this large language model

3:27

call with a prompt here that define the

3:29

definition of done, the output format,

3:31

as well as the goal and response. And

3:33

this large language model call can

3:34

output status as well as reasoning. If

3:36

you judge the goal is not finished, then

3:38

a message will be sent back to the

3:40

agent. You will see a status like this,

3:42

but in reality, the agent receive a

3:44

message like continuing toward your

3:45

standing goal that list out the goal

3:47

file and the special prompt to

3:49

continuously working toward this goal.

3:51

Take the next concrete steps. If you

3:52

believe the goal is completed, state it

3:54

so explicitly and stop. And the Codex

3:56

continues prompt is a little bit more

3:58

sophisticated, where it has things like

4:00

do not accept proxy signal as completion

4:02

by themselves. It only marks the goal

4:04

achieved when the audit shows the

4:06

objective has actually been achieved, no

4:09

required work remain. Then use this

4:11

update goal with the status complete. So

4:13

Codex goal feature actually ask agent to

4:15

mark this goal as complete by itself

4:17

versus Hermes agent would have special

4:19

larger model called judge result. And to

4:21

activate this goal feature in Codex, you

4:24

can do Codex features list. This will

4:27

list out all the experimental features

4:29

that Codex has, and you will see goals

4:30

feature here. Each one labeled as stable

4:33

or under development and whether it has

4:35

been toggled on or not. And then you can

4:37

just do Codex features enable goals. You

4:40

should see a success message here. Now,

4:42

if you run Codex and do {slash} goal,

4:45

you should see this goal command. And

4:46

then you just type out the goal. I hear

4:48

help me migrate my code base from

4:50

JavaScript to TypeScript and making sure

4:52

all screens stay exactly same visually

4:54

using Playwright interactive to verify

4:56

the output. And once you sent, it will

4:58

receive this goal active message. And

5:00

then agent will start working. And while

5:02

it is working, you can always just run

5:04

this goal command again to check the

5:07

status of the goal in term of how long

5:09

it has been running, total amount token

5:11

has been used. You can also run goal

5:13

pause or goal clear to stop the work

5:15

anytime. Meanwhile, if you want to,

5:17

let's say, ask it some questions or

5:19

branch out a conversation while it is

5:21

working, you can also run the side

5:23

command, which will basically fork the

5:25

conversation from this point. With this,

5:27

I've been using Codex to run some

5:29

migration work for 9 hours overnight and

5:31

it is still going. But there are some

5:33

grooves you should put in place to

5:35

making sure the goal actually apply. So

5:36

the command prompt will look something

5:38

like this, like complete a certain

5:40

objective without stopping until

5:42

verifiable end state. So a good goal

5:45

prompt should be bigger than one prompt

5:47

but smaller than open any backlog. You

5:49

should define what Codex should achieve,

5:51

what it should not change, how it should

5:53

validate progress, and when it should

5:55

stop. And the most important part is

5:56

that Codex should know what done means

5:59

before it starts. For example, if you

6:00

ask it to do a migration task, the goal

6:03

should look like migrate this project

6:04

from legacy stack to a new stack and

6:07

making sure all screens stay exactly

6:09

same visually and using Playwright

6:10

interactive to verify the output. So,

6:13

you can make sculpted progress

6:14

step-by-step. And if it's a first time,

6:16

then you should probably point it to a

6:18

plan.md file or PRD file, creating tasks

6:21

for each milestone and verify the output

6:23

with Playwright interactive. Even

6:25

include reference screens as needed, so

6:26

it can verify whether the game UI looks

6:29

exactly same as design. And if you have

6:30

evaluation set, you can even run it like

6:33

a auto research loop. Give it a go,

6:35

optimize the prompts in the prompt file

6:37

until the eval suite reach your target

6:39

score. And after each change, run the

6:41

eval command, inspect the failing cases,

6:44

and keep the prompt at its minimum and

6:46

target it. Stop when the target is met.

6:48

So, what you see here is that you have

6:50

to define very explicitly what does done

6:52

and finished means, so agent will know

6:54

when to stop. Cuz it without defining

6:56

that, what do you see is that agent will

6:58

again just get lazy and decide, "Okay,

7:00

this task has been done." and finish out

7:02

very quickly. And there are some similar

7:04

learnings from Vincent, who is one of

7:05

the maintainers for Open Crawl. He has

7:07

been running Go for 3 days on Open Crawl

7:09

across 13 rounds, gazillion tokens, and

7:11

many, many PRs. One of the learning he

7:13

had is that you should spend time to

7:15

align with agent early on. Most time, if

7:18

you just simply pasting a prompt and ask

7:20

it to do, it is likely give you garbage.

7:22

So, instead, here have a conversation

7:24

with agent about all the context like

7:26

what this project is, what are the

7:28

things he care about, what bad looks

7:29

like for user, what he already tried and

7:32

ruled out, and kind of box he keep

7:34

missing. And ask the model to ask

7:35

anything before he start. So, initial

7:37

interview alignment conversation is very

7:39

critical here. Meanwhile, it will also

7:41

try to quantify done as we mentioned

7:43

before. So, you should not prompt it to

7:45

something like keep going until

7:47

everything is fixed. Once the definition

7:48

of done is fuzzy like this, the model

7:50

will either quit too early or spiral

7:53

into nonsense. You have to give it some

7:55

quantifiable number. And here's one

7:57

example QA prompt that he has been

7:59

using. It will define the pass and has

8:01

very clear stop conditions, which is

8:04

once it found 20 discrete new issues.

8:06

And for each issue, produce repo,

8:08

proposed fix, push fix to a branch as

8:10

you go, and log the result to the run

8:13

folder. All this prompt he has been

8:14

using for new projects, like we are

8:16

building X reference implementation to

8:18

different repos, and also point out to a

8:21

list of files including anti-patterns,

8:23

logs, and design pattern you want it

8:25

follow, as well as what my user would

8:27

expect. So, here you can see that even

8:29

for new project, it's very important to

8:31

list out the expected criteria and

8:33

behavior, rather than a loose goal that

8:35

optimize or improve our app UX. So,

8:38

having a good code prompt is quite a

8:40

critical to decide whether you get good

8:41

results or not. And there's a one open

8:43

source project called Go buddy, which is

8:45

basically a skill that help you

8:47

construct a good prompt. The way it

8:48

works is that you can run NPX Go buddy,

8:51

and then run code X. If you do dollar

8:53

sign, it will list out all skills and

8:55

plugin. You can type in Go prep. This

8:57

will load up a workflow. To trigger code

8:59

X, they'll interview with you and

9:01

construct a good prompt folders. So,

9:03

even though I give a pretty vague goal

9:05

like building a rating type game using

9:07

image gen for image assets and beautiful

9:09

graphics, verify it on desktop. And then

9:11

Go buddy will start creating some files.

9:13

One is this goal.md file. This will turn

9:16

your goal into a well-written md file

9:19

that clearly describe the requests, the

9:21

constraints, the stop rules, and detail

9:23

loop. It also has this state.yaml file

9:25

that's listing out the tasks based on

9:27

the goal. So, instead of running a Go

9:29

pass your prompt, you can do {slash} go

9:31

and point to this goal.md file. And then

9:33

code X will start working on the task,

9:35

update state.yaml file to keep a record

9:38

and reference to go to MD file on every

9:40

single loop. With this in just one

9:42

single prompt, agent is able to generate

9:44

image assets for the game and stitch

9:46

together a fully functional game. So,

9:48

this is a Codex and Hermes agent's goal

9:49

feature. It is really good for those

9:51

complex coding work that will require

9:53

not just minutes but hours to complete.

9:55

And we'll address some models default

9:57

behavior where they will stop things

9:58

prematurely. However, this feature still

10:01

has limitations based on my testing. For

10:03

example, it's mainly designed for longer

10:05

coding sessions that runs for hours. But

10:07

if there are things that you wanted to

10:09

do for weeks or months, like improve

10:11

your SEO or GEO strategy, optimize

10:13

return on ad spending, it didn't quite

10:15

work, especially in scenarios that don't

10:17

have immediate verifiable results or

10:19

feedback. And my team has been

10:21

experimenting with this concept called

10:22

mission. The way this mission works is

10:24

actually quite straightforward. We'll

10:26

basically capture those long-running

10:28

goals or missions into a mission.md that

10:30

clearly define the metrics to optimize

10:32

and will trigger agent run where agent

10:34

will try to form a hypothesis about a

10:36

few strategies it should try to complete

10:38

the mission. Do one step and output its

10:41

work as artifacts. And at the end of

10:43

that, instead of just keep running a for

10:45

loop, it will schedule next run. Could

10:47

be in hours or days or even weeks,

10:49

depending on situation. And every time

10:51

when next run is triggered, the new

10:52

session will receive the mission.md as

10:55

well as the previous steps summary. So,

10:56

you can always iterate and improve and

10:58

learn from previous steps. And for those

11:00

type of really long-running missions, we

11:02

also found it's quite useful to have

11:04

those human in the loop experience. If

11:06

agent realize want to try something

11:08

really dramatic or realize a goal or

11:10

mission is unclear or not verifiable, it

11:13

can send message to human and change the

11:15

mission status. We've been experimenting

11:17

with long-running mission like grow

11:19

Twitter follower to 10,000, let agent to

11:21

iteratively take actions, run

11:23

experiments. And at each step, it can

11:25

output artifacts like a certain type of

11:28

post, as well as specific analysis

11:30

report. And based on that form, certain

11:32

hypothesis and schedule next action. It

11:35

already delivered some quite interesting

11:37

results. Like for Coolest own Twitter

11:38

account, initially made a first tweet

11:40

like this, which got kind of average

11:42

performance. But then based on the

11:44

observation, it decided next tweet we

11:46

should probably do a strat and use kind

11:48

of founder voice. And next one

11:50

immediately got pretty good performance.

11:51

And based on that observation, it

11:53

decided to double down on this type of

11:55

content and post the next one, which is

11:56

not ex- explosive, but already the

11:58

performance is much higher than original

12:00

baseline. And we found same type of

12:02

setup can be used for like optimizing

12:04

ads campaign, SEO,

12:06

and even product growth. We're currently

12:08

open close beta, so if you're

12:09

interested, you can join early access

12:11

program. I have put the link in the

12:12

description below for you to join. I

12:14

hope you enjoyed this video. Thank you

12:15

and I'll see you next time.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This video explores the new 'Goal' feature in AI coding agents like Codex and Hermes. It explains how these features allow agents to work continuously on complex, long-running projects by using an LLM to verify task completion, preventing premature termination. The video details best practices for prompt engineering to define clear goals and stop conditions, introduces the 'Goal Buddy' tool for creating effective goal documentation, and discusses the limitations of current agents for long-term missions, while introducing a 'mission' framework for extended, multi-stage, human-in-the-loop projects.