AI Agent "Vibe Coding" Breakdown: Code, Tests, Quality, Maintainability
1142 segments
Welcome to the Internet of Bugs.
My name is Carl.
I've been a software professional for 36 years
or so.
And today I'm going to talk to you about
agentic coding and "Vibe Coding",
or at least as close to "Vibe Coding"
as I've ever been, or as I ever get.
I have lived through a number of transitions
in the software industry where new automations
had come in, things that used to be done by
hand
that were now done automatically.
A lot of people freaked out, thought their jobs
were going to go away.
So far that's never happened.
Jobs go temporarily, but we always end up
coming out of it
with more programmers than we had before.
I expect AI is not going to be any different.
To me, this is just another way of taking a
bunch
of the boilerplate and taking a bunch of the
tedious work,
automating it, giving it to a tool
that does a better job of it more quickly
and letting us focus on the things
that are higher level and more important.
So to the extent that that's true, I'm all for
it.
To the extent that people think that it's as
good
at writing code as a programmer
that's been doing this for a while,
I'm not so much for it because people make
enough bugs
and AI is way less secure and way less
competent
when it comes to code writing, at least in my
experience.
I don't see that changing anytime soon,
but we'll see what happens.
So let me kind of walk you through what I've
got here.
I've got three different instances of Chrome.
Each one is running with the debug port
activated.
And then each one of these terminals right here
is going to be a different agent command line
tool
that's going to theoretically connect over the
web socket
to the debug port of that browser
and then drive the browser.
And that way we can see what the browser is
doing.
We can see what the agent is doing.
Also, it means that I don't have to teach the
agent
how to do authentication, which is a giant pain
in the rear.
There's been some talk recently,
I'll link the article below.
There's somebody who was talking about using
code code
and said the only MCP that they ever use was
playwright.
I have not had great luck with the playwright MCP.
Periodically, what happens...
I mean, and I've got a fairly specific setup
where I have a page that's already set up
with where I need it to be.
It's got all of the OAuth authentication
already set up.
And it's not easy or maybe even possible for
the AI to do that.
I mean, in order for me to get stuff configured,
I've got my USB key I have to log in with.
OAuth, all kinds of stuff like that.
And it's just not feasible to get the AI to do
that.
So what I need the AI to do instead of
connecting to a page
and logging in and going through all that thing
is I need it to connect to a page
that I've already set up for it
and then to exercise it from there.
It doesn't always do a great job of that.
And so, by instead of using the MCP,
which has a lot of things that happen under the
covers,
if I make it right, a Python script or a Node
script
or something like that, that exercises a playwright
API,
if and when it goes wrong,
I can look at the code that it generated
that's trying to use playwright and go,
"Oh yeah, this is what you screwed up"
and I can either tell it what to fix or fix it
it myself.
In theory, in "Vibe Coding" land,
you would just ask it over and over
to keep working on it until it finally fixed it.
I don't have time for that.
I don't have the patience for that.
So after it's beat its head against a wall for
a while,
I'll step in and fix it
'cause I get too annoyed otherwise.
This particular setup, the OAuth stuff
is really important for me personally.
So as you can tell, I'm a YouTuber, among other
things.
And a lot of the things that I spend a lot of
time doing
are dealing with YouTube.
And YouTube has an API,
but the API is fairly limited
and a lot of the things that I need to do as a
YouTuber,
I can only do once I'm authenticated with the OAuth
API.
So if I want to write code
to automate stuff that I have to do manually,
which is stuff that I do periodically that
irritates me
and automating things is one of the main
reasons
I got into programming in the first place back
in the early 80s.
I need to figure out how to write code
that can interact with a very complicated OAuth
setup
that YouTube uses.
And me being paranoid, I've got one-time
passwords
and I've got hardware keys and all kinds of
stuff
'cause I don't want people hacking my channel
or whatever.
Well, I don't want people hacking anything.
I'm just paranoid.
And I have good reason to be paranoid.
I've dealt with more hacks over time
than a lot of people have.
Anyway, so what I'm using here,
I am using a site called Code crafters.
I've used them before.
I find them to be very useful.
They have these challenges for developers
and these challenges will explain to a
developer
in a step-by-step process
how to build a piece of open-source software.
And it's, they're very interesting.
It's very, very useful as a developer
to understand how the basics of a lot of the
services
that you're working with function.
When you're trying to debug
some really complicated web thing,
there's lots and lots and lots of stuff going
on
and having spent some time
inside the guts of the basics of what a web
server does
and thoroughly understanding that
means that you don't get lost as bad
and not be able to see the forest for the trees.
When you're looking at all of the stuff that's
going on
'cause you have an understanding of what the
flow is
and you understand how everything is working
and so all of the different log messages
and all that kind of stuff make more sense to
you
and you have kind of a structure to hang all
that stuff on.
So I definitely recommend that folks understand
the basics of whatever tools that you use,
be that web servers,
if you're a web programmer,
which most of us are these days,
there's a project at CodeCrafters
where you can understand Git,
where you can build your own Git server.
Understanding Git is really handy
for not getting yourself in a position
where you end up losing code
or you have to find somebody who's better than
you
to come in and try to untangle everything.
So that's all really useful.
If you wanna try out some of the stuff yourself,
I'll put a link in the show notes
and I'll put it down here.
If you use that link to sign up there,
I may get an affiliate fee,
which kind of helps the channel out.
So if you're gonna go try it anyway,
but I would appreciate it, I think it's useful.
But for these purposes, I'm using it because
one,
it has a similar OAuth setup to YouTube, which
is cool.
And two, it gives me a good benchmark
for what a human I know can do.
And a lot of these tests will say of the,
however many people that have tried this test,
96% have done this at age correctly.
So if the AI can or can't do the stage
correctly,
that gives me some idea of kind of where that
AI falls
in terms of human programmers,
or at least human programmers that are advanced
enough
that they're willing to take that kind of
challenge.
Today, I'm gonna be working on the simplest,
at least for the AI, the simplest challenge
that they have,
which is build an HTTP server in Python.
That is the thing that as near as I can tell
is most represented in the AI's training data.
So this is not really a test and I've had other
videos
where I've run them through this particular
test
and they do pretty well.
So this is not a test yet of how well the AI's
can code.
This is more a test of how the AI's can be an
agent
and follow directions and act in a loop
and do the kinds of things with little to no
supervision,
hopefully that as a normal human programmer
would.
So let me go ahead and get started.
All right, so I've got my three different
Chrome instances
set up, one for each of the command lines.
I've got my three different command lines ready
to go.
And by the way, this is my fork of the OpenAI
Codex
because if you try to use their fork at least
at the moment,
you get this error that your organization is
not verified.
And then you go to this URL that they want you
to go to
and then you get this page which says,
hey, click here to verify your organization
and then you go, oh, I'm gonna go to with
persona.
It takes you to with persona.com
and then it requires you to consent
to the processing of your biometric information.
So no.
So I forked my own Codex
and I ripped out the part that requires that.
So I would recommend that you don't use theirs.
I would not be using it at all
if it weren't for the fact that I'm trying to
do a test on it.
I would have hit that and gone, I'm done.
And they would never have heard from me again,
but for you folks,
I went ahead and edited the thing around it
and then went ahead and got it working
so that I could show you what the difference is.
Okay.
So I'm gonna run each of these folks
and this one makes me say yes.
And then I'm gonna come over here
and I'm gonna grab my prompt.
So this is my prompt.
"You're a software engineer.
All of your interactions must be done
either with a Git command line or by using a
script."
I don't want it using the browser use
or the playwright command lines.
I want to see what it's doing.
I want to see the code that it's writing that
it's running.
"Don't install any software dependencies
yourself.
There's a config.json file.
It gives you the URLs that you need to use.
You're gonna need to write a script
to use the web socket to connect the browser
that's appropriate to you
and do all of your interacts with the web pages
that way.
Read and complete the requirements
of the first challenge stage of the challenge
when you get the first page and then wait."
So that's these first set of instructions.
And then we will see how long it takes each one
of them
to get to the end of the first page.
Generally Claude and Gemini do pretty well
on this.
OpenAI, it's anybody's guess.
Sometimes it goes right to it.
Sometimes it blows up.
The to-do's, I really like the way the to-do's
work in Claude.
It gives me a much better idea
of what it's trying to accomplish.
They're all running through this.
And here,
OpenAI is not following instructions.
It's basically trying to run through all of the
command lines.
It doesn't need to know.
It doesn't need to run pip3.
It doesn't need to run playwright.
It needs to write a script about it.
So we'll see if it actually does what I told it
to do
or if it ends up down the rabbit hole.
Why is it needing to run playwright, run server?
That's not of any use at all for what it's
trying to do.
It seems to be really hard for OpenAI
to be able to stay on track.
It bounces all over the place.
The Oops thing is interesting.
That doesn't happen very often.
There is this thing, for some reason,
Claude has a really hard time remembering
what directory it's supposed to be in.
'Cause it's got a step up here.
And then below that, it's got the directory
that it cloned
and it constantly gets confused about
which one it's supposed to be
and it bounces back and forth.
So it's like Gemini got done fastest this
time.
That isn't always the case.
They're generally, these two are pretty close.
Yeah, and Claude is thinking it's done.
I wonder why this web page is not.
There we go.
I'm not sure how it ended up on the web space.
I'm gonna look at the logs at some point
try to figure out what it was doing.
And meanwhile, we're still sitting here waiting
on
OpenAI.
What is it doing?
Pre-commit config yaml.
What, why is that relevant at all?
It just, the OpenAI one just goes in the weeds.
And it's not the thing that I changed.
You can look at my GitHub.
All I did was rip out some header stuff
that had to do with organizations
and I ripped out some sandbox stuff.
'Cause it doesn't want to let itself just do
things.
I guess because they know it's the crappiest of
them all.
So it's constantly asking you stuff
and I don't want it to do that.
So I went and ripped out the sandbox crap.
Or are you asking me to run it?
[Blank Audio]
[Blank Audio]
[Blank Audio]
[Blank Audio]
What is it doing there?
Valid context zero.
[Blank Audio]
[Blank Audio]
[Blank Audio]
I'm kind of at a loss as to what it's trying to
do.
Grab clone in the repository, okay.
I mean, it's on the page.
That's not the command that you need.
[Blank Audio]
What are you doing?
It's reading page content.
[Blank Audio]
All right, well, while we're seeing if it ever
comes back,
I'm gonna come over here.
I'm gonna grab my next set of
commands and I'm gonna give it to these two
guys.
And this is just for the rest of the session.
"When you see 'mark stage as complete,'
click that.
If you see 'move to next stage', click that."
They have a tendency to try to look at the page,
not see the button, try to go on,
and then the button shows up after.
So I told them to wait a little bit.
So let me give them that instructions,
how we give instructions and get those started.
[Blank Audio]
If it get done, are you gonna stop,
or are you just gonna keep going?
What are you doing?
[Blank Audio]
What are you doing?
Give you the next set of instructions.
'Cause like Gemini is ready to go to the
next stage,
if it can arrange to hit the button,
sometimes it gets confused about that,
and it will jump to the next stage instead.
[Blank Audio]
Looks like Gemini is doing better this time.
It didn't get stuck on the,
oh, it didn't seem to go to the next page
though.
Not sure what it did.
[Blank Audio]
[Blank Audio]
[Blank Audio]
[Blank Audio]
[Blank Audio]
[Blank Audio]
Look at this one, a little more space.
I'm not sure what's going on with that.
All right, so Claude Code, where are we?
So it's completed bindable port.
Gemini is ahead.
It is completed to respond with 200.
[Blank Audio]
[Blank Audio]
[Blank Audio]
So, Claude got off by page, it looked like.
Well, it looks like it's figured it out.
All right, so Claude and Gemini
are both working on the same issue now.
Or open AI is still pretty far behind.
[Blank Audio]
Gemini is fast, but I like when you're
reading through
what's actually going on, it's a lot easier for
me
to tell what Claude is intending to do.
I really like the way it puts up to do list
and crosses things out.
And it's a lot easier for me that way to go:
"Oh wait, you're off in the weeds, don't do that.
Go somewhere else."
Wow, they finished almost exactly the same time.
Now we'll see who can navigate the best.
I have a script running in the background.
Wow, they are neck and neck.
I have a script running in the background
on the host that hosts the virtual machines
that is basically keeping all of these every
few seconds,
it's checking in all of the scripts
from all the different AIs.
So when it's all said and done,
I can go back and look on a second, my second,
or not second, but like minute by minute
or every couple of minutes basis of who had
edited
and all that kind of stuff.
So I can go back and look at the quality of the
code.
I can look at what got edited and what order
I can look at what errors the AIs created for
themselves
and then had to go back out of.
So I will do a breakdown of how well they did.
How well they did at some point.
And then I will also, although I might leave
open AI out
'cause it's just not doing as well,
but and then the next thing to do is to pick
more interesting, more complicated challenges.
Now that we know that in theory, they can
follow directions,
which is really the, seeing how well they
follow directions
was really the purpose of this exercise.
(silence)
Yeah, see, you see what Claude is doing there,
where it's bouncing back and forth between view
next stage
and that, that generally is a sign that it is,
not waiting long enough for the button
and it's refreshing the page.
It's not seeing the button.
That used to happen all the time
and then when I put in the wait five seconds
before you check to see if the button is there,
that stopped happening as often.
But generally, I guess it's a race condition
and eventually, generally, we'll find it on
very rare occasions.
I get sick of watching it
and I'll just click the button itself
or click the button manually myself.
(silence)
Also, I've got a direction in here
that in addition to any requirements given in
the challenge,
they're each supposed to be writing unit tests
of all of the things that they create.
So I will go through and see how well the unit
tests get written.
That's a, that has been a sore spot for me.
Generally when I ask, well,
generally when I ask large language models to
write tests,
the tests that they write are not nearly as
good
and I think they ought to be.
Although I'll be honest, that often happens
with junior developers for me too.
You kind of have to, I guess, get to the point
where you've gotten bit a couple of times
by not having good enough tests before you
realize
how important it is to have thorough ones.
Looks like Claude is back to its,
not waiting long enough for the button to pop
up
before it tries to hit the button thing.
Um,
is Gemini done?
(silence)
So it stopped a concurrent connection.
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
What is Claude doing?
(silence)
(silence)
(silence)
I do not know, it is navigating,
whatever that means.
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
And OpenAI is still working on trying to
respond with a 200.
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
Interesting.
So they are both off to the races.
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
Let's see.
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
Alright, so these are both having passed the
current connection stage.
I wonder what's next.
(silence)
Let me go pull it up on my own browser.
(silence)
So return a file is the next one.
You will see when they get around to clicking.
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
(silence)
And Claude is done with the page.
Claude is done with the page, we're done with
the stage, and it's having trouble figuring out
how
to push the button again. It's interesting how
inconsistent they are.
Different ones fall into different patterns
sometimes. Sometimes the Claude Code does this
more
often. Sometimes I've had it go through all the
way through and never hit the bit where it's
trying
to find the stage completed button and not able
to find it. Gemini has a tendency to jump
forward
a page too early and then get stuck and have to
go back a page. But it hasn't done that this
run.
So a lot of it just seems to be what the random
seed that started the whole process ended up
with,
I guess.
Looks like Gemini
is done with the return a file bit.
Claude is now done with the return a file bit.
Now we'll see who marks the stage is complete
first. There's only one more
challenge in the first section of this, which
is where they'll stop, which is reading the
request
body. So basically send a post or reads
a post and you just take the find the body of
the
post and return it or something. Wow, those
were real close.
So these two might end up being neck and neck.
My bad Gemini was waiting on me.
I don't know why because it did a couple of
challenges in a row and then it got to the
point
where it's starting to ask for directions again.
So I'm not sure why that is.
But this should be the last one.
Because Claude hasn't gotten stuck and started
asking, "hey, do you want me to go?
Do you want me to keep going or not?" So I'm not
sure why Gemini just randomly
decided that every time it got done with the
stage, it would stop because in the
instructions
I gave it, I explicitly said... Where's that?
In the instructions I gave it, I explicitly
said "for the rest of the session, at the
successful end of each stage, go to the next
stage." So I
don't know why it's suddenly decided it was
going to have to ask every time it got to the
end of a
stage, but we're about done. So shouldn't be
that big a deal.
I've got some other stuff I've been playing
with. I have one that mostly works.
That's a local version that runs on Olama
against a 16 gigabyte. I forget.
It's a titanium card. I'm trying to remember
exactly which model it is.
60, 40, 90, I don't remember. It's a 16 gigabyte
card and I have a
Mistral model for coding that will run in it.
It's slow compared to these guys,
but it works. So if you're willing to... Wow,
all right, at the same time.
So if you're willing to put up with waiting a
long time, if all you're going to do is say,
hey, go do this, and then you're going to go to
lunch and then come back and hopefully it will
be done, then a lot of this can get done on
local. A lot of this can get done on a local
machine
with an open weight model without having to pay
anything or
have your code go anywhere else. But I'm not
going to demo that live because it takes
forever. It takes several hours to get
not as far as these guys have gotten. But it is
possible to do it on a local model for whatever
that's worth. So Claude and Gemini are both
done. They finished at about the same time,
although Gemini got confused and started
asking me, hey, do you want me to keep going?
The last few stages, which cost it some time.
OpenAI is still stuck. So I think I'm going to
call that there. I will go pull up the...
I will go pull up the code that it wrote, or
that they wrote, and kind of look at that.
And I will also put some stuff together on
basically the configuration that I used to
actually
get all this set up and running and everything.
But for now, I'm going to go ahead and close it
here. I just wanted you all to see what these
models were capable of doing by themselves,
the kinds of things I get stuck on. I want you
to remember that this is literally
the easiest challenge that code crafters have.
There are tons and tons and tons of
implementations
of HTTP demons in Python out there. So this is
incredibly well-represented in our training set.
And before too long, I'll be doing the same
kind of thing again with more complicated
projects
that they're trying to do. And again, if you're
interested, if you're going to join code crafters,
that's cool. I wouldn't recommend doing what I'm
doing and joining code crafters and just
throwing the AI at it because you're not going
to learn anything. And the value of
code crafters is, at least when a human does it,
when you get done with it, you actually
have some understanding of step-by-step how the
tool that you use every day actually works
under
the covers. But I'm going to call the open AI
thing because it's still on stage four and
everybody
else is done. So in the interest of
transparency, let's talk about the changes that
I made to the
open AI tool. Here's the repo that I've got.
And all I did is I went through -- what? Okay,
that'll work. So all I did is I went through
and
I found like the organizations and project
stuff. I pulled that out. I found the other
places where
things like organization were referenced. I
pulled those things out.
I ripped out all of this running sandbox stuff
and just told it to return sandbox.none.
And then here's more organization and project
stuff that I took out.
This access to token plane. I'm not exactly
sure what that does.
But once I pulled out the authorizations, the
project and the organization stuff,
I always get an error from it, so I pulled it
out too. So here's a whole chunk of stuff that
I pulled
out. And here's some more stuff I pulled out.
And like I said, you can go look at GitHub
and see this if you want. But it's all just
organization, organization and project.
These are all the things that seem to be
associated with the thing that makes you have
to do the
biometrics. Oh, let me. Oh, so I took out this
stuff and then I changed the setup variable
just to just hard code it to be false and
pulled out the organization thing down here.
So that's all I did is I just ripped out a
bunch of stuff that said organization and
projects and stuff. If that ended up breaking
the ability for it to actually do
good direction following, I don't know what to
tell you. I don't think that could have had
anything to do with it, but whatever. So if you
look at my GitHub, that Codex thing should
still
be there. Onto next thing. So this is the code
that Google Gemini CLI has generated.
I'm going to walk through this one first
because it's small enough to fit on one page.
We'll talk
about the the Claude Code after because it's
quite a bit bigger. So this is basically this
whole
thing is basically just one big long if then
else statement. Is it '/'? Is it looking at
'/echo'? Is it looking '/user-agent'? Remember
this user agent thing, Because we'll see when
we go to
the Claude Code, Claude ended up losing that
block. I'm not sure what happened to it.
I guess it just forgot it needed to be doing
that anymore.
We've got this bit that starts with files. We've
got if the method is get, if the method is post,
it does two different things. And then if it
all falls down, it ends up with a 404 not found.
Some things to note. Notice that get and post
are useful or get and post are checked for in
the
files. They're not checked for in the user
agent. They're not checked for in the echo.
They're not
checked for in here. So if you try to post
something to slash user agent, it will treat it
as if you
did a get or whatever. So that's not the way
it's supposed to work. It didn't do much error
checking. It didn't check for the wrong methods.
It didn't check to see most of the other kinds
of
things down here. We've got this bit where you
post a file and it writes the file. Notice it
didn't
put a try catch around this. It didn't check to
see if the directory existed, all that kind of
stuff.
If you try to send a file to a bad directory,
it'll probably just crash. Also, there's no
kind of sanitation or anything. It doesn't try
to look at the file system path and see that it's
not dot, dot, slash, dot, slash, et cetera,
password or something that you're posting.
So it's not the brightest thing. In its defense,
the challenge didn't ask it for any of that
stuff.
There's nothing in the challenge that says, oh,
you have to do error detection or any of that
kind of
stuff. A good programmer will put that in. But
technically, it wasn't asked to do that.
Let me run over here to the tests. I previously
looked at the end, the last
file for the main.py. In the Claude Code, I'm
going to look at both the end products.
For the Gemini one, I need to walk you
through the whole thing. This is the output of
Git log.
And it's from oldest to newest. So it's the
opposite direction of Git log usually is.
So we start off with this test case. And we
check and we try to connect to the server.
And it's a normal kind of, what are we going to
do to check for things, right?
And then the next check-in is for responding
with 200, okay, right?
So what it does is it deletes the bind to port
test that it had and replaces it with a respond
to 200
statement, right? And then down here, the next
check-in to handle different paths,
it deletes the respond with 200 test and then
creates a test repath test, right?
So this pattern every time the Google bot is
given or the Gemini is given another thing
it's supposed to do, it dismantles the previous
version of the test. I guess this one, it didn't,
it wouldn't have added the echo, but the echo
gets taken away later.
And so it just keeps going and it goes, okay, I
don't need echo anymore.
I don't need not found anymore because right
now I'm trying to do current connections.
So it just wipes out all of its old tests and
creates a new test every time, or pretty much
every time. So it's not the test that you
thought you were getting, the test that you
would think
you would need to be getting. You're not
getting because it's not going to keep anything.
You're
only testing the last or the last one or two
things every time it has something else it
needs to do.
That's a thing that, like I said in the other
video, I'm confident that I can tell it, I can
make it
do the right thing, assuming that I know what
the right thing is and I crack the thing open
and
look at it and say, "Oh no, you need to do this
too and you need to do this too and you need to
do
this too." That kind of violates the spirit of
"Vibe Coding", right? The idea is basically what
can this thing do when you tell it, "do what you
ought to do to satisfy these conditions," right?
So those of you who potentially are going to
accuse me of just prompting it wrong,
yes, I know I could prompt it specifically for
the purpose of getting it to write the tests,
the way the tests need to be written, but the
purpose of this exercise is not,
you know, "if you micromanage it, how well does
it do?" It's "how well does it do,
given the same kind of instructions that you
would normally give to a human programmer?"
Okay, so now we're on to the main for Claude.
This code is much, I don't know, 8 or 10 pages
long,
it's much better laid out. The Parse HTTP
request is one piece, the handle request is
another piece,
they make a lot, it makes a lot more sense.
This basically pulls out the very top part,
ripped out the headers, returns the various
different chunks of the request,
which is kind of handy. It's a lot cleaner than
the way that Gemini did it.
Then we come down to this block right here is
pretty much the same as the block that was
in the center of the file in Gemini's, right?
It's a lot bigger, there's a lot more,
a lot more white space. It's a lot easier to
read, there are actually comments, which is
nice.
We've got our, you know, path is slash, we've
got echo, we've got files. Notice we don't have
the
user agent thing, it just went away. This is
cool, right? So this is actually a try except
respond with a 500 error there. Nobody told it to
do that, but it's the right thing to do.
It knows method not allowed, which is something
that the, it's something that the
challenge didn't tell it to do, but obviously
it's copied that out of some other
implementation
it's got, so that's kind of useful. So that the
implementation that it wrote is much cleaner
and much easier to read. I like it a lot better,
but it did drop the user agent piece that
it was asked to do, and that Gemini left in.
So there is that issue, right? So this is the,
this is the test code from Claude. This is the
end result. So we've got our setups,
we've got our tear downs, that's pretty normal.
We test to make sure that the import runs.
I'm not sure exactly how useful that is,
because if the import fails, the thing will
throw right
at the very beginning. Test to make sure that
the main function exists. Sure. Test to make
sure
that the debug message gets printed. Why do we
need to do that? And like spin up a mock server
and everything, just to make sure that we get a
debug message? That's incredibly silly and not
useful. We test cross-socket creation, that's
great. I guess the test that we accepted the
socket,
it's a little bit overkill. Test that we can
bind to a port. That, the test that we can bind
to a
port isn't actually going to tell you anything,
because just because you can bind to a port,
when you're running your test, doesn't
necessarily mean that port's going to be free
when you try to run it in the real world. So
this test is of limited utility.
We test that the request is valid. We get a
simple test. We make sure that we get the right
answers
back. That's fine. We test to make sure that
the root path returns what it should. That's
great.
We check to see if we get an invalid format, if
we send just invalid requests instead of get
path H to be, that's fantastic. Nobody asked us
to do that, but it's great that it did.
It checks for incomplete, which is cool. Checks
for empty, which is fine.
Now we're back to the same kind of test. So we're
doing the root path again.
We're doing other path here. This is the return
404. Now we're doing it with mock sockets,
instead of just looking strictly at the parsing
of the thing. We check our valid requests again.
This is with a mock socket instead of just
looking at the parsing the strings.
We look at handling empty data. We look at
client connections. We look at current
connections.
And we're done. This is all part of the threading
for the client connections test.
So if you come back here, you'll notice we've
got tests for the root path. We've got tests
for
404. We don't have tests for a user agent,
which would have failed because it pulled that
out.
We don't have tests for echo. We don't have
tests for files. The only thing it's testing
for
is does it handle the root path correctly? And
does it handle, does it return 404 for a random
crap? It's not testing any of the other things
that it was asked to do.
And not only that, but one of the things, at
least, that it was asked to do,
it stopped doing. But because the tests that it
wrote were completely focused around garbage,
like making sure that the debug print statement
happened, instead of making sure that the
giant if statement, that's the core of the
logic of the program, didn't bother to write
tests
for that, really. But it cared about the debug
statement. This is the kind of ridiculous
things
that you get when you ask an AI to write tests,
unfortunately. I'm not too surprised. This is
pretty consistent with what I usually see when
I ask them to write tests. My guess is it's
just
because there's so much more example code than
test code in the world. And it's just it's
sample
set that, however bad it might be at writing
real code, it's going to be much, much, much
orders of magnitude or writing tests. That just
seems to be the case.
All right, let me jump back up here. All right,
so let me go through some, I'm going to go
through
some setup stuff. So this is this is the prompt
that I used. You could pull that if you were
quick out of one of the other things, but I'll
leave this on screen for a while. But this is
the
first prompt that I use. And then once it
manages to correctly clone the repo,
I paste in the second chunk. And those are the
only prompts that I give it.
So from a consistency standpoint, this is what
I do.
So if you're wondering about what prompt I used,
this is the prompt that I used.
And then a couple other things. This is the
setup script I use. So this runs on
the host that the virtual machines are running
on top of or running inside of.
So basically this bit right here just cleans up
from the previous run.
It creates a new Git repository to track all
the things that the clients are going to do.
Create some directories. It grabs a bunch of
information from the Chrome instances that I
have
running. I'll have a script that starts those.
I'll show you in a second. It pulls out the
socket debugger URL. It creates the user name
and the Git email
from what the directory name is. It creates
this config.json file, which is what the prompt
tells
it, tells the agent to go look at. It cleans up
some of the temporary files. And then it goes
through this loop that basically just sits
there in every 90 seconds. It adds, by the way,
these
directories right here, these directories that
get made, those directories are each mounted
onto
each one of those, one of the virtual machines
mounts. So then this script sits above that
directory and basically just checks in every 90
seconds, everything that the clients have done.
And so that way, if one of the machines crashes
or something, I can look at what the state of
it
was the last time that it wrote anything to
disk. This is the script that I used for
to actually launch the VM. I'm using QEMU to
actually run the VMs. So I have a big
Ubuntu, pretty simple, Ubuntu 24 desktop box. I'm
using raw format because it's a lot faster,
six gigabytes of memory. There's nothing really
fancy about this. What I'm using for the
virtual
machine, I am using the serial console. If you
don't understand how the serial console works,
just if you've got a GUI, just use the GUI, I
just do it so I don't have to walk over there
and
actually look at the console if I don't need to.
And then this is the script that I use to
actually
launch Chrome. I use a thing called screen. I
use screen for everything. Basically, it's a
thing
that creates a virtual terminal. You can run a
process inside of it and then you can detach
from
it, just to be attached to it. It's really
convenient. And then basically exec, this arch,
ARM64 thing, basically make sure that it doesn't
run the, doesn't accidentally run the X64
version
in Rosetta, which it generally doesn't, but
just in case launches Chrome, gives it a remote
debugging
port, gives it the directory where it's profile
is supposed to live, which is named, this is
all
one line, which is named, Claude Code, whatever.
And then it gives it the URL for ease to which,
for its route, for the first page for it to go
to. And then I have to manually set up,
after I'll manually log into the code crafters
there, and then I have to get it set up on the
right challenge. And then I can run the setup
script that goes and pulls the information from
it. And then I think that's it. All right. So
those are all of the, those are all the various
scripts that I've used. And the code that I
changed. So that's stuff that might be useful
for you.
Ask follow-up questions or revisit key timestamps.
Carl, a software engineer with 36 years of experience, explores the current state of 'agentic coding' by testing how Claude, Gemini, and OpenAI models perform as autonomous agents. He tasks them with completing a CodeCrafters challenge to build an HTTP server in Python. While the AI agents successfully navigate the coding task, Carl highlights major weaknesses in their ability to write high-quality unit tests and their tendency to overlook specific requirements. The video also details a technical setup using debug ports and pre-authenticated browser sessions to bypass complex OAuth hurdles that current AI models cannot handle independently.
Videos recently processed by our community