Ralph-loop 2.0? The real autonomous coder is coming...
370 segments
So Open AI released this goal feature in
Codex, which allows agent to
continuously working for hours and hours
on bigger and complex projects. And we
already saw people getting pretty
phenomenal results like one prompt to
generate fully functional game, to
building and testing an iOS app by
itself for 6 hours. And key developers
from Codex team also mentioned this is
probably the most consequential thing
they have shipped in Codex this year.
And very quickly, Hermes agent also
released this persist ghost feature,
which is similar type of feature that
allow you to set standing goal for
Hermes agent to work across turns until
it is achieved. So how exactly does this
goal feature works and what are the best
practice of actually using this? So what
is problem this goal feature actually
try to solve? So even though model has
getting significantly better where it
can consistently finish well scope
tickets, but as we let it take it over
more and more complex projects, you
probably experience yourself that the
model can sometimes get lazy and declare
the victory too early. For example, you
might ask agent to just fix all the
failing test in your repo and most
likely agent will do some work maybe for
10 or 15 minutes and then come back
saying it fixed all the issues, which
most likely is not exhaustive. You need
to prompt the agent say, "Hey, there's
still XYZ are not done." And you need to
keep doing this for some complex tasks.
And there were a very popular project
earlier this year called rough loop,
which is basically running your coding
agent in for loop where every time when
agent finish, it will write output of
his work in the file system and
programmatically trigger the coding
agent again. And very quickly it was
implemented in cloud code through a
plugin as well. But fundamentally rough
loop is a pretty simple or a kind of
dump programmatic loop. You just run in
the cloud through a specific prompt in
while loop and define the maximum
iterations. And this goal feature in
Codex and Hermes is an intuition of this
rough loop where it is no longer doing a
simple dump programmatic loop. Instead
it use large language model to decide
and judge whether task has been
satisfied. So when user send out a goal
command, it will trigger the agent to do
things. And once finished, there will be
one large language model call to decide
whether the goal has been satisfied. If
yes, then finish the session, but if no,
it will trigger the agent again with
some prompts. And this means instead
agent prematurely claims the task has
been done, it will have this large
language model call identify and capture
scenario and guide agent to continuously
working until the goal actually
satisfied. Meanwhile, it also made it
good to handle ambiguous tasks. So,
instead of a list of tasks that has well
scoped out, it can handle things like
just cut a Docker image size by 60%
where you probably at beginning don't
know exactly how to do this. We can get
agent to start exploring, trying out
multiple different methods and approach,
and it's step-by-step adding the
improvements. This is somewhat similar
to the other very popular project
earlier this year from Andrew Cabसी like
auto research, where fundamentally it's
also a for loop that getting agent
continuously working and save the state
result. So, this goal feature is really
good for this type of complex
long-running coding work like code
migration, large refactories, as well as
ambiguous goals like experiments, where
agent can keep making scoped progress.
So, compared with original rough loop,
the stop condition instead being a
programmatic manual limitations, it will
be using large language model to judge
and decide whether the goal is finished.
And when new loop start, instead of
feeding the identical prompt.md into the
agent loop, this goal feature has
continuous prompt about goal context as
well as a state. And the way it works,
as we mentioned before, is this loop
that is running. And once finished, it
will send to this large language model
call with a prompt here that define the
definition of done, the output format,
as well as the goal and response. And
this large language model call can
output status as well as reasoning. If
you judge the goal is not finished, then
a message will be sent back to the
agent. You will see a status like this,
but in reality, the agent receive a
message like continuing toward your
standing goal that list out the goal
file and the special prompt to
continuously working toward this goal.
Take the next concrete steps. If you
believe the goal is completed, state it
so explicitly and stop. And the Codex
continues prompt is a little bit more
sophisticated, where it has things like
do not accept proxy signal as completion
by themselves. It only marks the goal
achieved when the audit shows the
objective has actually been achieved, no
required work remain. Then use this
update goal with the status complete. So
Codex goal feature actually ask agent to
mark this goal as complete by itself
versus Hermes agent would have special
larger model called judge result. And to
activate this goal feature in Codex, you
can do Codex features list. This will
list out all the experimental features
that Codex has, and you will see goals
feature here. Each one labeled as stable
or under development and whether it has
been toggled on or not. And then you can
just do Codex features enable goals. You
should see a success message here. Now,
if you run Codex and do {slash} goal,
you should see this goal command. And
then you just type out the goal. I hear
help me migrate my code base from
JavaScript to TypeScript and making sure
all screens stay exactly same visually
using Playwright interactive to verify
the output. And once you sent, it will
receive this goal active message. And
then agent will start working. And while
it is working, you can always just run
this goal command again to check the
status of the goal in term of how long
it has been running, total amount token
has been used. You can also run goal
pause or goal clear to stop the work
anytime. Meanwhile, if you want to,
let's say, ask it some questions or
branch out a conversation while it is
working, you can also run the side
command, which will basically fork the
conversation from this point. With this,
I've been using Codex to run some
migration work for 9 hours overnight and
it is still going. But there are some
grooves you should put in place to
making sure the goal actually apply. So
the command prompt will look something
like this, like complete a certain
objective without stopping until
verifiable end state. So a good goal
prompt should be bigger than one prompt
but smaller than open any backlog. You
should define what Codex should achieve,
what it should not change, how it should
validate progress, and when it should
stop. And the most important part is
that Codex should know what done means
before it starts. For example, if you
ask it to do a migration task, the goal
should look like migrate this project
from legacy stack to a new stack and
making sure all screens stay exactly
same visually and using Playwright
interactive to verify the output. So,
you can make sculpted progress
step-by-step. And if it's a first time,
then you should probably point it to a
plan.md file or PRD file, creating tasks
for each milestone and verify the output
with Playwright interactive. Even
include reference screens as needed, so
it can verify whether the game UI looks
exactly same as design. And if you have
evaluation set, you can even run it like
a auto research loop. Give it a go,
optimize the prompts in the prompt file
until the eval suite reach your target
score. And after each change, run the
eval command, inspect the failing cases,
and keep the prompt at its minimum and
target it. Stop when the target is met.
So, what you see here is that you have
to define very explicitly what does done
and finished means, so agent will know
when to stop. Cuz it without defining
that, what do you see is that agent will
again just get lazy and decide, "Okay,
this task has been done." and finish out
very quickly. And there are some similar
learnings from Vincent, who is one of
the maintainers for Open Crawl. He has
been running Go for 3 days on Open Crawl
across 13 rounds, gazillion tokens, and
many, many PRs. One of the learning he
had is that you should spend time to
align with agent early on. Most time, if
you just simply pasting a prompt and ask
it to do, it is likely give you garbage.
So, instead, here have a conversation
with agent about all the context like
what this project is, what are the
things he care about, what bad looks
like for user, what he already tried and
ruled out, and kind of box he keep
missing. And ask the model to ask
anything before he start. So, initial
interview alignment conversation is very
critical here. Meanwhile, it will also
try to quantify done as we mentioned
before. So, you should not prompt it to
something like keep going until
everything is fixed. Once the definition
of done is fuzzy like this, the model
will either quit too early or spiral
into nonsense. You have to give it some
quantifiable number. And here's one
example QA prompt that he has been
using. It will define the pass and has
very clear stop conditions, which is
once it found 20 discrete new issues.
And for each issue, produce repo,
proposed fix, push fix to a branch as
you go, and log the result to the run
folder. All this prompt he has been
using for new projects, like we are
building X reference implementation to
different repos, and also point out to a
list of files including anti-patterns,
logs, and design pattern you want it
follow, as well as what my user would
expect. So, here you can see that even
for new project, it's very important to
list out the expected criteria and
behavior, rather than a loose goal that
optimize or improve our app UX. So,
having a good code prompt is quite a
critical to decide whether you get good
results or not. And there's a one open
source project called Go buddy, which is
basically a skill that help you
construct a good prompt. The way it
works is that you can run NPX Go buddy,
and then run code X. If you do dollar
sign, it will list out all skills and
plugin. You can type in Go prep. This
will load up a workflow. To trigger code
X, they'll interview with you and
construct a good prompt folders. So,
even though I give a pretty vague goal
like building a rating type game using
image gen for image assets and beautiful
graphics, verify it on desktop. And then
Go buddy will start creating some files.
One is this goal.md file. This will turn
your goal into a well-written md file
that clearly describe the requests, the
constraints, the stop rules, and detail
loop. It also has this state.yaml file
that's listing out the tasks based on
the goal. So, instead of running a Go
pass your prompt, you can do {slash} go
and point to this goal.md file. And then
code X will start working on the task,
update state.yaml file to keep a record
and reference to go to MD file on every
single loop. With this in just one
single prompt, agent is able to generate
image assets for the game and stitch
together a fully functional game. So,
this is a Codex and Hermes agent's goal
feature. It is really good for those
complex coding work that will require
not just minutes but hours to complete.
And we'll address some models default
behavior where they will stop things
prematurely. However, this feature still
has limitations based on my testing. For
example, it's mainly designed for longer
coding sessions that runs for hours. But
if there are things that you wanted to
do for weeks or months, like improve
your SEO or GEO strategy, optimize
return on ad spending, it didn't quite
work, especially in scenarios that don't
have immediate verifiable results or
feedback. And my team has been
experimenting with this concept called
mission. The way this mission works is
actually quite straightforward. We'll
basically capture those long-running
goals or missions into a mission.md that
clearly define the metrics to optimize
and will trigger agent run where agent
will try to form a hypothesis about a
few strategies it should try to complete
the mission. Do one step and output its
work as artifacts. And at the end of
that, instead of just keep running a for
loop, it will schedule next run. Could
be in hours or days or even weeks,
depending on situation. And every time
when next run is triggered, the new
session will receive the mission.md as
well as the previous steps summary. So,
you can always iterate and improve and
learn from previous steps. And for those
type of really long-running missions, we
also found it's quite useful to have
those human in the loop experience. If
agent realize want to try something
really dramatic or realize a goal or
mission is unclear or not verifiable, it
can send message to human and change the
mission status. We've been experimenting
with long-running mission like grow
Twitter follower to 10,000, let agent to
iteratively take actions, run
experiments. And at each step, it can
output artifacts like a certain type of
post, as well as specific analysis
report. And based on that form, certain
hypothesis and schedule next action. It
already delivered some quite interesting
results. Like for Coolest own Twitter
account, initially made a first tweet
like this, which got kind of average
performance. But then based on the
observation, it decided next tweet we
should probably do a strat and use kind
of founder voice. And next one
immediately got pretty good performance.
And based on that observation, it
decided to double down on this type of
content and post the next one, which is
not ex- explosive, but already the
performance is much higher than original
baseline. And we found same type of
setup can be used for like optimizing
ads campaign, SEO,
and even product growth. We're currently
open close beta, so if you're
interested, you can join early access
program. I have put the link in the
description below for you to join. I
hope you enjoyed this video. Thank you
and I'll see you next time.
Ask follow-up questions or revisit key timestamps.
This video explores the new 'Goal' feature in AI coding agents like Codex and Hermes. It explains how these features allow agents to work continuously on complex, long-running projects by using an LLM to verify task completion, preventing premature termination. The video details best practices for prompt engineering to define clear goals and stop conditions, introduces the 'Goal Buddy' tool for creating effective goal documentation, and discusses the limitations of current agents for long-term missions, while introducing a 'mission' framework for extended, multi-stage, human-in-the-loop projects.
Videos recently processed by our community