HomeVideos

AI Agent "Vibe Coding" Breakdown: Code, Tests, Quality, Maintainability

Now Playing

AI Agent "Vibe Coding" Breakdown: Code, Tests, Quality, Maintainability

Transcript

1142 segments

0:00

Welcome to the Internet of Bugs.

0:02

My name is Carl.

0:03

I've been a software professional for 36 years

0:06

or so.

0:07

And today I'm going to talk to you about

0:09

agentic coding and "Vibe Coding",

0:11

or at least as close to "Vibe Coding"

0:13

as I've ever been, or as I ever get.

0:15

I have lived through a number of transitions

0:21

in the software industry where new automations

0:26

had come in, things that used to be done by

0:29

hand

0:29

that were now done automatically.

0:30

A lot of people freaked out, thought their jobs

0:32

were going to go away.

0:34

So far that's never happened.

0:35

Jobs go temporarily, but we always end up

0:40

coming out of it

0:41

with more programmers than we had before.

0:43

I expect AI is not going to be any different.

0:46

To me, this is just another way of taking a

0:51

bunch

0:51

of the boilerplate and taking a bunch of the

0:54

tedious work,

0:55

automating it, giving it to a tool

0:58

that does a better job of it more quickly

1:01

and letting us focus on the things

1:03

that are higher level and more important.

1:05

So to the extent that that's true, I'm all for

1:09

it.

1:09

To the extent that people think that it's as

1:11

good

1:12

at writing code as a programmer

1:14

that's been doing this for a while,

1:17

I'm not so much for it because people make

1:19

enough bugs

1:20

and AI is way less secure and way less

1:24

competent

1:24

when it comes to code writing, at least in my

1:27

experience.

1:28

I don't see that changing anytime soon,

1:30

but we'll see what happens.

1:32

So let me kind of walk you through what I've

1:34

got here.

1:35

I've got three different instances of Chrome.

1:38

Each one is running with the debug port

1:41

activated.

1:41

And then each one of these terminals right here

1:46

is going to be a different agent command line

1:50

tool

1:50

that's going to theoretically connect over the

1:53

web socket

1:54

to the debug port of that browser

1:56

and then drive the browser.

1:57

And that way we can see what the browser is

1:59

doing.

1:59

We can see what the agent is doing.

2:03

Also, it means that I don't have to teach the

2:05

agent

2:06

how to do authentication, which is a giant pain

2:09

in the rear.

2:09

There's been some talk recently,

2:15

I'll link the article below.

2:17

There's somebody who was talking about using

2:19

code code

2:20

and said the only MCP that they ever use was

2:23

playwright.

2:26

I have not had great luck with the playwright MCP.

2:29

Periodically, what happens...

2:32

I mean, and I've got a fairly specific setup

2:36

where I have a page that's already set up

2:41

with where I need it to be.

2:42

It's got all of the OAuth authentication

2:45

already set up.

2:46

And it's not easy or maybe even possible for

2:53

the AI to do that.

2:55

I mean, in order for me to get stuff configured,

2:58

I've got my USB key I have to log in with.

3:00

OAuth, all kinds of stuff like that.

3:04

And it's just not feasible to get the AI to do

3:06

that.

3:07

So what I need the AI to do instead of

3:10

connecting to a page

3:11

and logging in and going through all that thing

3:12

is I need it to connect to a page

3:14

that I've already set up for it

3:16

and then to exercise it from there.

3:17

It doesn't always do a great job of that.

3:22

And so, by instead of using the MCP,

3:27

which has a lot of things that happen under the

3:30

covers,

3:30

if I make it right, a Python script or a Node

3:33

script

3:33

or something like that, that exercises a playwright

3:36

API,

3:37

if and when it goes wrong,

3:39

I can look at the code that it generated

3:41

that's trying to use playwright and go,

3:42

"Oh yeah, this is what you screwed up"

3:43

and I can either tell it what to fix or fix it

3:45

it myself.

3:46

In theory, in "Vibe Coding" land,

3:50

you would just ask it over and over

3:52

to keep working on it until it finally fixed it.

3:56

I don't have time for that.

3:59

I don't have the patience for that.

4:01

So after it's beat its head against a wall for

4:03

a while,

4:04

I'll step in and fix it

4:05

'cause I get too annoyed otherwise.

4:08

This particular setup, the OAuth stuff

4:15

is really important for me personally.

4:17

So as you can tell, I'm a YouTuber, among other

4:20

things.

4:20

And a lot of the things that I spend a lot of

4:23

time doing

4:24

are dealing with YouTube.

4:27

And YouTube has an API,

4:28

but the API is fairly limited

4:30

and a lot of the things that I need to do as a

4:32

YouTuber,

4:33

I can only do once I'm authenticated with the OAuth

4:36

API.

4:36

So if I want to write code

4:39

to automate stuff that I have to do manually,

4:41

which is stuff that I do periodically that

4:45

irritates me

4:46

and automating things is one of the main

4:48

reasons

4:48

I got into programming in the first place back

4:50

in the early 80s.

4:51

I need to figure out how to write code

4:56

that can interact with a very complicated OAuth

5:00

setup

5:00

that YouTube uses.

5:02

And me being paranoid, I've got one-time

5:06

passwords

5:07

and I've got hardware keys and all kinds of

5:11

stuff

5:11

'cause I don't want people hacking my channel

5:12

or whatever.

5:13

Well, I don't want people hacking anything.

5:15

I'm just paranoid.

5:16

And I have good reason to be paranoid.

5:18

I've dealt with more hacks over time

5:20

than a lot of people have.

5:22

Anyway, so what I'm using here,

5:27

I am using a site called Code crafters.

5:32

I've used them before.

5:33

I find them to be very useful.

5:36

They have these challenges for developers

5:40

and these challenges will explain to a

5:44

developer

5:45

in a step-by-step process

5:46

how to build a piece of open-source software.

5:48

And it's, they're very interesting.

5:52

It's very, very useful as a developer

5:56

to understand how the basics of a lot of the

5:59

services

6:00

that you're working with function.

6:02

When you're trying to debug

6:05

some really complicated web thing,

6:10

there's lots and lots and lots of stuff going

6:15

on

6:16

and having spent some time

6:19

inside the guts of the basics of what a web

6:23

server does

6:24

and thoroughly understanding that

6:26

means that you don't get lost as bad

6:30

and not be able to see the forest for the trees.

6:34

When you're looking at all of the stuff that's

6:36

going on

6:36

'cause you have an understanding of what the

6:39

flow is

6:39

and you understand how everything is working

6:41

and so all of the different log messages

6:44

and all that kind of stuff make more sense to

6:45

you

6:46

and you have kind of a structure to hang all

6:47

that stuff on.

6:48

So I definitely recommend that folks understand

6:52

the basics of whatever tools that you use,

6:57

be that web servers,

7:02

if you're a web programmer,

7:05

which most of us are these days,

7:06

there's a project at CodeCrafters

7:10

where you can understand Git,

7:12

where you can build your own Git server.

7:14

Understanding Git is really handy

7:17

for not getting yourself in a position

7:20

where you end up losing code

7:21

or you have to find somebody who's better than

7:24

you

7:24

to come in and try to untangle everything.

7:27

So that's all really useful.

7:28

If you wanna try out some of the stuff yourself,

7:33

I'll put a link in the show notes

7:36

and I'll put it down here.

7:38

If you use that link to sign up there,

7:41

I may get an affiliate fee,

7:45

which kind of helps the channel out.

7:46

So if you're gonna go try it anyway,

7:48

but I would appreciate it, I think it's useful.

7:50

But for these purposes, I'm using it because

7:53

one,

7:54

it has a similar OAuth setup to YouTube, which

7:56

is cool.

7:57

And two, it gives me a good benchmark

8:01

for what a human I know can do.

8:05

And a lot of these tests will say of the,

8:10

however many people that have tried this test,

8:14

96% have done this at age correctly.

8:17

So if the AI can or can't do the stage

8:20

correctly,

8:21

that gives me some idea of kind of where that

8:23

AI falls

8:24

in terms of human programmers,

8:27

or at least human programmers that are advanced

8:30

enough

8:30

that they're willing to take that kind of

8:32

challenge.

8:33

Today, I'm gonna be working on the simplest,

8:37

at least for the AI, the simplest challenge

8:39

that they have,

8:40

which is build an HTTP server in Python.

8:43

That is the thing that as near as I can tell

8:47

is most represented in the AI's training data.

8:51

So this is not really a test and I've had other

8:55

videos

8:55

where I've run them through this particular

8:56

test

8:57

and they do pretty well.

8:58

So this is not a test yet of how well the AI's

9:03

can code.

9:03

This is more a test of how the AI's can be an

9:10

agent

9:10

and follow directions and act in a loop

9:13

and do the kinds of things with little to no

9:16

supervision,

9:17

hopefully that as a normal human programmer

9:21

would.

9:22

So let me go ahead and get started.

9:24

All right, so I've got my three different

9:29

Chrome instances

9:29

set up, one for each of the command lines.

9:32

I've got my three different command lines ready

9:34

to go.

9:35

And by the way, this is my fork of the OpenAI

9:40

Codex

9:40

because if you try to use their fork at least

9:45

at the moment,

9:46

you get this error that your organization is

9:50

not verified.

9:51

And then you go to this URL that they want you

9:55

to go to

9:55

and then you get this page which says,

10:00

hey, click here to verify your organization

10:02

and then you go, oh, I'm gonna go to with

10:05

persona.

10:05

It takes you to with persona.com

10:07

and then it requires you to consent

10:09

to the processing of your biometric information.

10:12

So no.

10:15

So I forked my own Codex

10:18

and I ripped out the part that requires that.

10:21

So I would recommend that you don't use theirs.

10:26

I would not be using it at all

10:30

if it weren't for the fact that I'm trying to

10:32

do a test on it.

10:33

I would have hit that and gone, I'm done.

10:36

And they would never have heard from me again,

10:39

but for you folks,

10:41

I went ahead and edited the thing around it

10:44

and then went ahead and got it working

10:48

so that I could show you what the difference is.

10:50

Okay.

10:51

So I'm gonna run each of these folks

10:52

and this one makes me say yes.

10:58

And then I'm gonna come over here

11:00

and I'm gonna grab my prompt.

11:04

So this is my prompt.

11:06

"You're a software engineer.

11:08

All of your interactions must be done

11:09

either with a Git command line or by using a

11:12

script."

11:14

I don't want it using the browser use

11:16

or the playwright command lines.

11:18

I want to see what it's doing.

11:19

I want to see the code that it's writing that

11:21

it's running.

11:22

"Don't install any software dependencies

11:25

yourself.

11:25

There's a config.json file.

11:27

It gives you the URLs that you need to use.

11:29

You're gonna need to write a script

11:30

to use the web socket to connect the browser

11:34

that's appropriate to you

11:35

and do all of your interacts with the web pages

11:37

that way.

11:39

Read and complete the requirements

11:41

of the first challenge stage of the challenge

11:44

when you get the first page and then wait."

11:46

So that's these first set of instructions.

11:49

And then we will see how long it takes each one

11:53

of them

11:54

to get to the end of the first page.

12:01

Generally Claude and Gemini do pretty well

12:06

on this.

12:10

OpenAI, it's anybody's guess.

12:12

Sometimes it goes right to it.

12:17

Sometimes it blows up.

12:18

The to-do's, I really like the way the to-do's

12:22

work in Claude.

12:24

It gives me a much better idea

12:25

of what it's trying to accomplish.

12:27

They're all running through this.

12:39

And here,

12:44

OpenAI is not following instructions.

12:50

It's basically trying to run through all of the

12:52

command lines.

12:53

It doesn't need to know.

12:54

It doesn't need to run pip3.

12:55

It doesn't need to run playwright.

12:57

It needs to write a script about it.

12:59

So we'll see if it actually does what I told it

13:02

to do

13:02

or if it ends up down the rabbit hole.

13:07

Why is it needing to run playwright, run server?

13:10

That's not of any use at all for what it's

13:13

trying to do.

13:14

It seems to be really hard for OpenAI

13:21

to be able to stay on track.

13:23

It bounces all over the place.

13:26

The Oops thing is interesting.

13:32

That doesn't happen very often.

13:35

There is this thing, for some reason,

13:37

Claude has a really hard time remembering

13:43

what directory it's supposed to be in.

13:45

'Cause it's got a step up here.

13:48

And then below that, it's got the directory

13:50

that it cloned

13:51

and it constantly gets confused about

13:54

which one it's supposed to be

13:55

and it bounces back and forth.

13:56

So it's like Gemini got done fastest this

13:59

time.

13:59

That isn't always the case.

14:00

They're generally, these two are pretty close.

14:03

Yeah, and Claude is thinking it's done.

14:08

I wonder why this web page is not.

14:25

There we go.

14:26

I'm not sure how it ended up on the web space.

14:30

I'm gonna look at the logs at some point

14:31

try to figure out what it was doing.

14:33

And meanwhile, we're still sitting here waiting

14:36

on

14:36

OpenAI.

14:40

What is it doing?

14:42

Pre-commit config yaml.

14:48

What, why is that relevant at all?

14:59

It just, the OpenAI one just goes in the weeds.

15:04

And it's not the thing that I changed.

15:10

You can look at my GitHub.

15:13

All I did was rip out some header stuff

15:15

that had to do with organizations

15:20

and I ripped out some sandbox stuff.

15:24

'Cause it doesn't want to let itself just do

15:28

things.

15:29

I guess because they know it's the crappiest of

15:31

them all.

15:32

So it's constantly asking you stuff

15:34

and I don't want it to do that.

15:35

So I went and ripped out the sandbox crap.

15:58

Or are you asking me to run it?

15:59

[Blank Audio]

16:02

[Blank Audio]

16:05

[Blank Audio]

16:07

[Blank Audio]

16:37

What is it doing there?

16:38

Valid context zero.

16:43

[Blank Audio]

16:46

[Blank Audio]

16:48

[Blank Audio]

17:16

I'm kind of at a loss as to what it's trying to

17:19

do.

17:19

Grab clone in the repository, okay.

17:33

I mean, it's on the page.

17:40

That's not the command that you need.

17:43

[Blank Audio]

17:46

What are you doing?

17:51

It's reading page content.

18:01

[Blank Audio]

18:04

All right, well, while we're seeing if it ever

18:22

comes back,

18:24

I'm gonna come over here.

18:27

I'm gonna grab my next set of

18:29

commands and I'm gonna give it to these two

18:34

guys.

18:34

And this is just for the rest of the session.

18:40

"When you see 'mark stage as complete,'

18:42

click that.

18:42

If you see 'move to next stage', click that."

18:44

They have a tendency to try to look at the page,

18:52

not see the button, try to go on,

18:57

and then the button shows up after.

18:59

So I told them to wait a little bit.

19:01

So let me give them that instructions,

19:13

how we give instructions and get those started.

19:18

[Blank Audio]

19:46

If it get done, are you gonna stop,

19:54

or are you just gonna keep going?

19:55

What are you doing?

20:06

[Blank Audio]

20:08

What are you doing?

20:35

Give you the next set of instructions.

20:39

'Cause like Gemini is ready to go to the

20:52

next stage,

20:57

if it can arrange to hit the button,

20:58

sometimes it gets confused about that,

21:00

and it will jump to the next stage instead.

21:02

[Blank Audio]

21:05

Looks like Gemini is doing better this time.

21:28

It didn't get stuck on the,

21:32

oh, it didn't seem to go to the next page

21:34

though.

21:34

Not sure what it did.

21:39

[Blank Audio]

21:42

[Blank Audio]

21:45

[Blank Audio]

21:47

[Blank Audio]

21:50

[Blank Audio]

21:52

[Blank Audio]

22:22

Look at this one, a little more space.

22:23

I'm not sure what's going on with that.

22:25

All right, so Claude Code, where are we?

22:38

So it's completed bindable port.

22:41

Gemini is ahead.

22:44

It is completed to respond with 200.

22:46

[Blank Audio]

22:49

[Blank Audio]

22:51

[Blank Audio]

23:21

So, Claude got off by page, it looked like.

23:26

Well, it looks like it's figured it out.

23:31

All right, so Claude and Gemini

23:38

are both working on the same issue now.

23:42

Or open AI is still pretty far behind.

23:47

[Blank Audio]

23:50

Gemini is fast, but I like when you're

24:04

reading through

24:04

what's actually going on, it's a lot easier for

24:09

me

24:09

to tell what Claude is intending to do.

24:14

I really like the way it puts up to do list

24:18

and crosses things out.

24:19

And it's a lot easier for me that way to go:

24:24

"Oh wait, you're off in the weeds, don't do that.

24:27

Go somewhere else."

24:28

Wow, they finished almost exactly the same time.

24:31

Now we'll see who can navigate the best.

24:35

I have a script running in the background.

24:43

Wow, they are neck and neck.

24:46

I have a script running in the background

24:48

on the host that hosts the virtual machines

24:53

that is basically keeping all of these every

24:57

few seconds,

24:58

it's checking in all of the scripts

25:02

from all the different AIs.

25:04

So when it's all said and done,

25:05

I can go back and look on a second, my second,

25:08

or not second, but like minute by minute

25:10

or every couple of minutes basis of who had

25:13

edited

25:14

and all that kind of stuff.

25:15

So I can go back and look at the quality of the

25:17

code.

25:18

I can look at what got edited and what order

25:20

I can look at what errors the AIs created for

25:25

themselves

25:25

and then had to go back out of.

25:29

So I will do a breakdown of how well they did.

25:38

How well they did at some point.

25:41

And then I will also, although I might leave

25:46

open AI out

25:47

'cause it's just not doing as well,

25:51

but and then the next thing to do is to pick

25:56

more interesting, more complicated challenges.

26:02

Now that we know that in theory, they can

26:04

follow directions,

26:05

which is really the, seeing how well they

26:10

follow directions

26:10

was really the purpose of this exercise.

26:16

(silence)

26:18

Yeah, see, you see what Claude is doing there,

26:38

where it's bouncing back and forth between view

26:39

next stage

26:40

and that, that generally is a sign that it is,

26:46

not waiting long enough for the button

26:48

and it's refreshing the page.

26:50

It's not seeing the button.

26:52

That used to happen all the time

26:56

and then when I put in the wait five seconds

26:59

before you check to see if the button is there,

27:03

that stopped happening as often.

27:07

But generally, I guess it's a race condition

27:11

and eventually, generally, we'll find it on

27:15

very rare occasions.

27:16

I get sick of watching it

27:19

and I'll just click the button itself

27:21

or click the button manually myself.

27:25

(silence)

27:55

Also, I've got a direction in here

27:57

that in addition to any requirements given in

28:09

the challenge,

28:09

they're each supposed to be writing unit tests

28:11

of all of the things that they create.

28:15

So I will go through and see how well the unit

28:17

tests get written.

28:19

That's a, that has been a sore spot for me.

28:25

Generally when I ask, well,

28:29

generally when I ask large language models to

28:32

write tests,

28:33

the tests that they write are not nearly as

28:35

good

28:35

and I think they ought to be.

28:38

Although I'll be honest, that often happens

28:40

with junior developers for me too.

28:42

You kind of have to, I guess, get to the point

28:48

where you've gotten bit a couple of times

28:50

by not having good enough tests before you

28:53

realize

28:53

how important it is to have thorough ones.

28:55

Looks like Claude is back to its,

28:58

not waiting long enough for the button to pop

29:04

up

29:04

before it tries to hit the button thing.

29:06

Um,

29:13

is Gemini done?

29:18

(silence)

29:20

So it stopped a concurrent connection.

29:34

(silence)

29:36

(silence)

29:38

(silence)

29:40

(silence)

29:42

(silence)

29:44

(silence)

29:46

(silence)

29:48

(silence)

29:50

(silence)

29:52

(silence)

29:54

(silence)

29:56

(silence)

29:58

(silence)

30:00

(silence)

30:02

(silence)

30:04

(silence)

30:06

(silence)

30:08

(silence)

30:10

(silence)

30:34

(silence)

30:36

(silence)

30:40

What is Claude doing?

30:42

(silence)

30:46

(silence)

30:48

(silence)

30:50

I do not know, it is navigating,

30:52

whatever that means.

30:54

(silence)

30:56

(silence)

30:58

(silence)

31:00

(silence)

31:04

(silence)

31:06

(silence)

31:08

(silence)

31:10

(silence)

31:12

And OpenAI is still working on trying to

31:14

respond with a 200.

31:14

(silence)

31:16

(silence)

31:18

(silence)

31:20

(silence)

31:22

(silence)

31:24

(silence)

31:26

(silence)

31:28

(silence)

31:30

(silence)

31:32

(silence)

31:34

(silence)

31:36

(silence)

31:38

(silence)

31:42

(silence)

31:44

Interesting.

31:46

So they are both off to the races.

31:50

(silence)

31:52

(silence)

31:54

(silence)

31:56

(silence)

31:58

(silence)

32:00

(silence)

32:02

(silence)

32:04

(silence)

32:06

(silence)

32:08

(silence)

32:10

Let's see.

32:12

(silence)

32:14

(silence)

32:16

(silence)

32:18

(silence)

32:22

(silence)

32:24

(silence)

32:26

(silence)

32:28

(silence)

32:30

Alright, so these are both having passed the

32:33

current connection stage.

32:34

I wonder what's next.

32:36

(silence)

32:38

Let me go pull it up on my own browser.

32:42

(silence)

32:44

So return a file is the next one.

32:48

You will see when they get around to clicking.

32:52

(silence)

32:54

(silence)

32:56

(silence)

33:00

(silence)

33:02

(silence)

33:04

(silence)

33:06

(silence)

33:08

(silence)

33:10

(silence)

33:12

(silence)

33:14

(silence)

33:16

(silence)

33:18

And Claude is done with the page.

33:20

Claude is done with the page, we're done with

33:23

the stage, and it's having trouble figuring out

33:26

how

33:26

to push the button again. It's interesting how

33:37

inconsistent they are.

33:48

Different ones fall into different patterns

33:54

sometimes. Sometimes the Claude Code does this

34:01

more

34:01

often. Sometimes I've had it go through all the

34:03

way through and never hit the bit where it's

34:07

trying

34:08

to find the stage completed button and not able

34:09

to find it. Gemini has a tendency to jump

34:14

forward

34:15

a page too early and then get stuck and have to

34:18

go back a page. But it hasn't done that this

34:22

run.

34:22

So a lot of it just seems to be what the random

34:29

seed that started the whole process ended up

34:33

with,

34:33

I guess.

34:46

Looks like Gemini

35:16

is done with the return a file bit.

35:23

Claude is now done with the return a file bit.

35:28

Now we'll see who marks the stage is complete

35:33

first. There's only one more

35:37

challenge in the first section of this, which

35:40

is where they'll stop, which is reading the

35:43

request

35:44

body. So basically send a post or reads

35:48

a post and you just take the find the body of

35:49

the

35:49

post and return it or something. Wow, those

35:54

were real close.

35:55

So these two might end up being neck and neck.

36:05

My bad Gemini was waiting on me.

36:13

I don't know why because it did a couple of

36:15

challenges in a row and then it got to the

36:17

point

36:17

where it's starting to ask for directions again.

36:21

So I'm not sure why that is.

36:22

But this should be the last one.

36:37

Because Claude hasn't gotten stuck and started

36:39

asking, "hey, do you want me to go?

36:41

Do you want me to keep going or not?" So I'm not

36:46

sure why Gemini just randomly

36:49

decided that every time it got done with the

36:50

stage, it would stop because in the

36:53

instructions

36:53

I gave it, I explicitly said... Where's that?

37:02

In the instructions I gave it, I explicitly

37:05

said "for the rest of the session, at the

37:07

successful end of each stage, go to the next

37:11

stage." So I

37:12

don't know why it's suddenly decided it was

37:15

going to have to ask every time it got to the

37:18

end of a

37:18

stage, but we're about done. So shouldn't be

37:23

that big a deal.

37:25

I've got some other stuff I've been playing

37:37

with. I have one that mostly works.

37:45

That's a local version that runs on Olama

37:48

against a 16 gigabyte. I forget.

37:55

It's a titanium card. I'm trying to remember

38:01

exactly which model it is.

38:02

60, 40, 90, I don't remember. It's a 16 gigabyte

38:06

card and I have a

38:10

Mistral model for coding that will run in it.

38:13

It's slow compared to these guys,

38:19

but it works. So if you're willing to... Wow,

38:25

all right, at the same time.

38:27

So if you're willing to put up with waiting a

38:34

long time, if all you're going to do is say,

38:37

hey, go do this, and then you're going to go to

38:39

lunch and then come back and hopefully it will

38:41

be done, then a lot of this can get done on

38:45

local. A lot of this can get done on a local

39:03

machine

39:04

with an open weight model without having to pay

39:07

anything or

39:08

have your code go anywhere else. But I'm not

39:15

going to demo that live because it takes

39:19

forever. It takes several hours to get

39:26

not as far as these guys have gotten. But it is

39:32

possible to do it on a local model for whatever

39:36

that's worth. So Claude and Gemini are both

39:47

done. They finished at about the same time,

39:50

although Gemini got confused and started

39:53

asking me, hey, do you want me to keep going?

39:56

The last few stages, which cost it some time.

40:01

OpenAI is still stuck. So I think I'm going to

40:05

call that there. I will go pull up the...

40:20

I will go pull up the code that it wrote, or

40:22

that they wrote, and kind of look at that.

40:26

And I will also put some stuff together on

40:31

basically the configuration that I used to

40:35

actually

40:36

get all this set up and running and everything.

40:38

But for now, I'm going to go ahead and close it

40:41

here. I just wanted you all to see what these

40:45

models were capable of doing by themselves,

40:50

the kinds of things I get stuck on. I want you

40:53

to remember that this is literally

40:55

the easiest challenge that code crafters have.

40:59

There are tons and tons and tons of

41:02

implementations

41:02

of HTTP demons in Python out there. So this is

41:07

incredibly well-represented in our training set.

41:11

And before too long, I'll be doing the same

41:15

kind of thing again with more complicated

41:19

projects

41:20

that they're trying to do. And again, if you're

41:22

interested, if you're going to join code crafters,

41:26

that's cool. I wouldn't recommend doing what I'm

41:29

doing and joining code crafters and just

41:31

throwing the AI at it because you're not going

41:34

to learn anything. And the value of

41:36

code crafters is, at least when a human does it,

41:38

when you get done with it, you actually

41:40

have some understanding of step-by-step how the

41:43

tool that you use every day actually works

41:47

under

41:47

the covers. But I'm going to call the open AI

41:50

thing because it's still on stage four and

41:52

everybody

41:52

else is done. So in the interest of

41:55

transparency, let's talk about the changes that

41:58

I made to the

41:59

open AI tool. Here's the repo that I've got.

42:06

And all I did is I went through -- what? Okay,

42:15

that'll work. So all I did is I went through

42:20

and

42:20

I found like the organizations and project

42:22

stuff. I pulled that out. I found the other

42:26

places where

42:27

things like organization were referenced. I

42:29

pulled those things out.

42:30

I ripped out all of this running sandbox stuff

42:39

and just told it to return sandbox.none.

42:43

And then here's more organization and project

42:52

stuff that I took out.

42:54

This access to token plane. I'm not exactly

42:57

sure what that does.

42:58

But once I pulled out the authorizations, the

43:01

project and the organization stuff,

43:03

I always get an error from it, so I pulled it

43:06

out too. So here's a whole chunk of stuff that

43:11

I pulled

43:12

out. And here's some more stuff I pulled out.

43:16

And like I said, you can go look at GitHub

43:24

and see this if you want. But it's all just

43:28

organization, organization and project.

43:32

These are all the things that seem to be

43:35

associated with the thing that makes you have

43:43

to do the

43:46

biometrics. Oh, let me. Oh, so I took out this

43:50

stuff and then I changed the setup variable

43:56

just to just hard code it to be false and

44:00

pulled out the organization thing down here.

44:03

So that's all I did is I just ripped out a

44:04

bunch of stuff that said organization and

44:10

projects and stuff. If that ended up breaking

44:17

the ability for it to actually do

44:20

good direction following, I don't know what to

44:26

tell you. I don't think that could have had

44:31

anything to do with it, but whatever. So if you

44:38

look at my GitHub, that Codex thing should

44:42

still

44:42

be there. Onto next thing. So this is the code

44:48

that Google Gemini CLI has generated.

44:53

I'm going to walk through this one first

44:55

because it's small enough to fit on one page.

44:57

We'll talk

44:58

about the the Claude Code after because it's

45:01

quite a bit bigger. So this is basically this

45:06

whole

45:07

thing is basically just one big long if then

45:09

else statement. Is it '/'? Is it looking at

45:19

'/echo'? Is it looking '/user-agent'? Remember

45:21

this user agent thing, Because we'll see when

45:25

we go to

45:27

the Claude Code, Claude ended up losing that

45:30

block. I'm not sure what happened to it.

45:36

I guess it just forgot it needed to be doing

45:38

that anymore.

45:39

We've got this bit that starts with files. We've

45:43

got if the method is get, if the method is post,

45:47

it does two different things. And then if it

45:51

all falls down, it ends up with a 404 not found.

45:56

Some things to note. Notice that get and post

46:00

are useful or get and post are checked for in

46:04

the

46:04

files. They're not checked for in the user

46:06

agent. They're not checked for in the echo.

46:09

They're not

46:09

checked for in here. So if you try to post

46:13

something to slash user agent, it will treat it

46:17

as if you

46:17

did a get or whatever. So that's not the way

46:21

it's supposed to work. It didn't do much error

46:27

checking. It didn't check for the wrong methods.

46:30

It didn't check to see most of the other kinds

46:36

of

46:36

things down here. We've got this bit where you

46:41

post a file and it writes the file. Notice it

46:46

didn't

46:46

put a try catch around this. It didn't check to

46:50

see if the directory existed, all that kind of

46:52

stuff.

46:52

If you try to send a file to a bad directory,

46:56

it'll probably just crash. Also, there's no

47:00

kind of sanitation or anything. It doesn't try

47:02

to look at the file system path and see that it's

47:04

not dot, dot, slash, dot, slash, et cetera,

47:06

password or something that you're posting.

47:08

So it's not the brightest thing. In its defense,

47:14

the challenge didn't ask it for any of that

47:17

stuff.

47:18

There's nothing in the challenge that says, oh,

47:20

you have to do error detection or any of that

47:22

kind of

47:22

stuff. A good programmer will put that in. But

47:27

technically, it wasn't asked to do that.

47:34

Let me run over here to the tests. I previously

47:41

looked at the end, the last

47:44

file for the main.py. In the Claude Code, I'm

47:48

going to look at both the end products.

47:52

For the Gemini one, I need to walk you

47:55

through the whole thing. This is the output of

48:00

Git log.

48:03

And it's from oldest to newest. So it's the

48:05

opposite direction of Git log usually is.

48:07

So we start off with this test case. And we

48:10

check and we try to connect to the server.

48:13

And it's a normal kind of, what are we going to

48:15

do to check for things, right?

48:17

And then the next check-in is for responding

48:22

with 200, okay, right?

48:25

So what it does is it deletes the bind to port

48:28

test that it had and replaces it with a respond

48:31

to 200

48:32

statement, right? And then down here, the next

48:37

check-in to handle different paths,

48:39

it deletes the respond with 200 test and then

48:46

creates a test repath test, right?

48:51

So this pattern every time the Google bot is

48:55

given or the Gemini is given another thing

48:58

it's supposed to do, it dismantles the previous

49:01

version of the test. I guess this one, it didn't,

49:05

it wouldn't have added the echo, but the echo

49:06

gets taken away later.

49:08

And so it just keeps going and it goes, okay, I

49:18

don't need echo anymore.

49:19

I don't need not found anymore because right

49:23

now I'm trying to do current connections.

49:26

So it just wipes out all of its old tests and

49:29

creates a new test every time, or pretty much

49:32

every time. So it's not the test that you

49:36

thought you were getting, the test that you

49:39

would think

49:39

you would need to be getting. You're not

49:41

getting because it's not going to keep anything.

49:43

You're

49:44

only testing the last or the last one or two

49:46

things every time it has something else it

49:48

needs to do.

49:51

That's a thing that, like I said in the other

49:54

video, I'm confident that I can tell it, I can

49:57

make it

49:58

do the right thing, assuming that I know what

50:04

the right thing is and I crack the thing open

50:06

and

50:07

look at it and say, "Oh no, you need to do this

50:08

too and you need to do this too and you need to

50:10

do

50:10

this too." That kind of violates the spirit of

50:15

"Vibe Coding", right? The idea is basically what

50:21

can this thing do when you tell it, "do what you

50:26

ought to do to satisfy these conditions," right?

50:30

So those of you who potentially are going to

50:34

accuse me of just prompting it wrong,

50:37

yes, I know I could prompt it specifically for

50:42

the purpose of getting it to write the tests,

50:47

the way the tests need to be written, but the

50:50

purpose of this exercise is not,

50:53

you know, "if you micromanage it, how well does

50:57

it do?" It's "how well does it do,

50:59

given the same kind of instructions that you

51:01

would normally give to a human programmer?"

51:03

Okay, so now we're on to the main for Claude.

51:08

This code is much, I don't know, 8 or 10 pages

51:12

long,

51:12

it's much better laid out. The Parse HTTP

51:15

request is one piece, the handle request is

51:19

another piece,

51:20

they make a lot, it makes a lot more sense.

51:24

This basically pulls out the very top part,

51:27

ripped out the headers, returns the various

51:31

different chunks of the request,

51:35

which is kind of handy. It's a lot cleaner than

51:39

the way that Gemini did it.

51:41

Then we come down to this block right here is

51:47

pretty much the same as the block that was

51:52

in the center of the file in Gemini's, right?

51:54

It's a lot bigger, there's a lot more,

52:00

a lot more white space. It's a lot easier to

52:03

read, there are actually comments, which is

52:05

nice.

52:05

We've got our, you know, path is slash, we've

52:12

got echo, we've got files. Notice we don't have

52:15

the

52:16

user agent thing, it just went away. This is

52:20

cool, right? So this is actually a try except

52:27

respond with a 500 error there. Nobody told it to

52:30

do that, but it's the right thing to do.

52:32

It knows method not allowed, which is something

52:36

that the, it's something that the

52:40

challenge didn't tell it to do, but obviously

52:44

it's copied that out of some other

52:46

implementation

52:46

it's got, so that's kind of useful. So that the

52:50

implementation that it wrote is much cleaner

52:55

and much easier to read. I like it a lot better,

52:59

but it did drop the user agent piece that

53:03

it was asked to do, and that Gemini left in.

53:08

So there is that issue, right? So this is the,

53:12

this is the test code from Claude. This is the

53:17

end result. So we've got our setups,

53:20

we've got our tear downs, that's pretty normal.

53:24

We test to make sure that the import runs.

53:27

I'm not sure exactly how useful that is,

53:31

because if the import fails, the thing will

53:34

throw right

53:35

at the very beginning. Test to make sure that

53:38

the main function exists. Sure. Test to make

53:42

sure

53:42

that the debug message gets printed. Why do we

53:46

need to do that? And like spin up a mock server

53:52

and everything, just to make sure that we get a

53:53

debug message? That's incredibly silly and not

53:56

useful. We test cross-socket creation, that's

54:02

great. I guess the test that we accepted the

54:06

socket,

54:07

it's a little bit overkill. Test that we can

54:09

bind to a port. That, the test that we can bind

54:12

to a

54:12

port isn't actually going to tell you anything,

54:14

because just because you can bind to a port,

54:17

when you're running your test, doesn't

54:19

necessarily mean that port's going to be free

54:21

when you try to run it in the real world. So

54:23

this test is of limited utility.

54:32

We test that the request is valid. We get a

54:39

simple test. We make sure that we get the right

54:43

answers

54:43

back. That's fine. We test to make sure that

54:46

the root path returns what it should. That's

54:51

great.

54:52

We check to see if we get an invalid format, if

54:55

we send just invalid requests instead of get

54:58

path H to be, that's fantastic. Nobody asked us

55:01

to do that, but it's great that it did.

55:03

It checks for incomplete, which is cool. Checks

55:06

for empty, which is fine.

55:08

Now we're back to the same kind of test. So we're

55:14

doing the root path again.

55:16

We're doing other path here. This is the return

55:20

404. Now we're doing it with mock sockets,

55:23

instead of just looking strictly at the parsing

55:26

of the thing. We check our valid requests again.

55:30

This is with a mock socket instead of just

55:32

looking at the parsing the strings.

55:34

We look at handling empty data. We look at

55:39

client connections. We look at current

55:41

connections.

55:42

And we're done. This is all part of the threading

55:47

for the client connections test.

55:51

So if you come back here, you'll notice we've

55:54

got tests for the root path. We've got tests

55:57

for

55:57

404. We don't have tests for a user agent,

56:01

which would have failed because it pulled that

56:03

out.

56:04

We don't have tests for echo. We don't have

56:06

tests for files. The only thing it's testing

56:08

for

56:08

is does it handle the root path correctly? And

56:15

does it handle, does it return 404 for a random

56:18

crap? It's not testing any of the other things

56:22

that it was asked to do.

56:23

And not only that, but one of the things, at

56:27

least, that it was asked to do,

56:30

it stopped doing. But because the tests that it

56:35

wrote were completely focused around garbage,

56:38

like making sure that the debug print statement

56:40

happened, instead of making sure that the

56:44

giant if statement, that's the core of the

56:48

logic of the program, didn't bother to write

56:52

tests

56:52

for that, really. But it cared about the debug

56:54

statement. This is the kind of ridiculous

56:56

things

56:56

that you get when you ask an AI to write tests,

57:00

unfortunately. I'm not too surprised. This is

57:06

pretty consistent with what I usually see when

57:08

I ask them to write tests. My guess is it's

57:12

just

57:12

because there's so much more example code than

57:15

test code in the world. And it's just it's

57:19

sample

57:19

set that, however bad it might be at writing

57:23

real code, it's going to be much, much, much

57:27

orders of magnitude or writing tests. That just

57:30

seems to be the case.

57:32

All right, let me jump back up here. All right,

57:35

so let me go through some, I'm going to go

57:37

through

57:38

some setup stuff. So this is this is the prompt

57:44

that I used. You could pull that if you were

57:49

quick out of one of the other things, but I'll

57:50

leave this on screen for a while. But this is

57:53

the

57:53

first prompt that I use. And then once it

57:59

manages to correctly clone the repo,

58:02

I paste in the second chunk. And those are the

58:04

only prompts that I give it.

58:06

So from a consistency standpoint, this is what

58:09

I do.

58:09

So if you're wondering about what prompt I used,

58:15

this is the prompt that I used.

58:16

And then a couple other things. This is the

58:20

setup script I use. So this runs on

58:24

the host that the virtual machines are running

58:28

on top of or running inside of.

58:34

So basically this bit right here just cleans up

58:37

from the previous run.

58:39

It creates a new Git repository to track all

58:42

the things that the clients are going to do.

58:45

Create some directories. It grabs a bunch of

58:49

information from the Chrome instances that I

58:52

have

58:52

running. I'll have a script that starts those.

58:55

I'll show you in a second. It pulls out the

58:59

socket debugger URL. It creates the user name

59:03

and the Git email

59:05

from what the directory name is. It creates

59:14

this config.json file, which is what the prompt

59:17

tells

59:18

it, tells the agent to go look at. It cleans up

59:22

some of the temporary files. And then it goes

59:25

through this loop that basically just sits

59:29

there in every 90 seconds. It adds, by the way,

59:33

these

59:33

directories right here, these directories that

59:37

get made, those directories are each mounted

59:41

onto

59:42

each one of those, one of the virtual machines

59:44

mounts. So then this script sits above that

59:48

directory and basically just checks in every 90

59:51

seconds, everything that the clients have done.

59:53

And so that way, if one of the machines crashes

59:54

or something, I can look at what the state of

59:57

it

59:57

was the last time that it wrote anything to

60:01

disk. This is the script that I used for

60:05

to actually launch the VM. I'm using QEMU to

60:10

actually run the VMs. So I have a big

60:16

Ubuntu, pretty simple, Ubuntu 24 desktop box. I'm

60:21

using raw format because it's a lot faster,

60:24

six gigabytes of memory. There's nothing really

60:27

fancy about this. What I'm using for the

60:31

virtual

60:32

machine, I am using the serial console. If you

60:41

don't understand how the serial console works,

60:47

just if you've got a GUI, just use the GUI, I

60:49

just do it so I don't have to walk over there

60:51

and

60:52

actually look at the console if I don't need to.

60:54

And then this is the script that I use to

60:56

actually

60:56

launch Chrome. I use a thing called screen. I

61:01

use screen for everything. Basically, it's a

61:05

thing

61:06

that creates a virtual terminal. You can run a

61:08

process inside of it and then you can detach

61:10

from

61:10

it, just to be attached to it. It's really

61:15

convenient. And then basically exec, this arch,

61:18

ARM64 thing, basically make sure that it doesn't

61:21

run the, doesn't accidentally run the X64

61:24

version

61:25

in Rosetta, which it generally doesn't, but

61:27

just in case launches Chrome, gives it a remote

61:30

debugging

61:30

port, gives it the directory where it's profile

61:33

is supposed to live, which is named, this is

61:35

all

61:35

one line, which is named, Claude Code, whatever.

61:40

And then it gives it the URL for ease to which,

61:44

for its route, for the first page for it to go

61:47

to. And then I have to manually set up,

61:49

after I'll manually log into the code crafters

61:52

there, and then I have to get it set up on the

61:57

right challenge. And then I can run the setup

61:58

script that goes and pulls the information from

62:00

it. And then I think that's it. All right. So

62:03

those are all of the, those are all the various

62:05

scripts that I've used. And the code that I

62:08

changed. So that's stuff that might be useful

62:12

for you.

Interactive Summary

Carl, a software engineer with 36 years of experience, explores the current state of 'agentic coding' by testing how Claude, Gemini, and OpenAI models perform as autonomous agents. He tasks them with completing a CodeCrafters challenge to build an HTTP server in Python. While the AI agents successfully navigate the coding task, Carl highlights major weaknesses in their ability to write high-quality unit tests and their tendency to overlook specific requirements. The video also details a technical setup using debug ports and pre-authenticated browser sessions to bypass complex OAuth hurdles that current AI models cannot handle independently.

Suggested questions

5 ready-made prompts