HomeVideos

Technical Breakdown: How AI Agents Ignore 40 Years of Security Progress

Now Playing

Technical Breakdown: How AI Agents Ignore 40 Years of Security Progress

Transcript

361 segments

0:00

Welcome to today's video on the security

0:02

nightmare inherent to AI agents.

0:04

This is going to be kind of technical.

0:06

So if you're not a technical person or you want

0:07

an on technical version of this information

0:09

to give to a non-technical person in your life,

0:11

I have a greatly simplified version of this

0:12

video on my main channel and I've linked that

0:15

video below.

0:15

The last few weeks, everyone has been all abuzz

0:18

about AI agents.

0:19

Microsoft declared 2026 the year of the agent.

0:21

There was a recent tweet from a senior

0:23

developer at Google praising Claude Code that

0:25

went viral.

0:26

The Claude Code team has been releasing a bunch

0:28

of "how we use Claude Code agents ourselves"

0:30

posts.

0:31

It's just been a thing.

0:32

So let's talk briefly about the technology that

0:33

underlies virtually all modern computers,

0:35

which is called the "Von Newmann architecture."

0:37

This design is incredibly flexible and it

0:38

allows for easier construction of computer

0:40

hardware

0:41

than some alternative designs, but it contains

0:43

what has been referred to as the "original sin"

0:45

of computer security, which is that the code

0:47

the computer is supposed to be executing

0:48

and the data the computer is keeping in memory

0:50

are both stored in the same memory in the

0:52

same way and the CPU has no distinguishing

0:55

between instructions that it's supposed to

0:57

be following and the instructions that came

0:59

from malicious data acquired from an untrusted

1:01

source that really shouldn't be executed ever.

1:04

This original sin has led to pretty much every

1:05

remote code execution vulnerability since

1:07

the history of computers and as well as a lot

1:10

of other security weaknesses that have

1:12

plagued our industry over the decades.

1:14

There have been numerous attempts to build on

1:16

top of this architecture to try to mitigate

1:18

this problem, it can't be fixed, but it can be

1:20

made much much safer and harder for attackers

1:23

to exploit.

1:24

One of these mitigations built into modern

1:26

programming languages like Go or Rust, those

1:28

languages built in type safety to prevent data

1:30

of one type from being interpreted as

1:31

a different type.

1:32

They have strict built in bounds checking to

1:34

try to prevent overflows.

1:35

They have ownership and concurrency primitives

1:37

to try to prevent race conditions.

1:39

There are other features to help with this like

1:40

data execution prevention, which marks

1:42

pages and memories being non-executable.

1:44

There's address space layout randomization,

1:45

which means that attackers won't be able

1:47

to predict where code will be in memory to jump

1:49

to it, which makes it a lot harder for

1:50

them to know what to try to overwrite.

1:52

There are stack canaries, which are special

1:54

values placed on the stack that are verified

1:55

prior to executions so that the execution will

1:58

be stopped if the stack gets overwritten.

2:00

But I want to make sure you understand, those

2:01

mitigations, as sophisticated as they are and

2:03

as much work has gone into them over the last

2:05

60 years, are just mitigations, not fixes.

2:08

We still have frequent remote code execution

2:10

vulnerabilities being exploited in the wild.

2:12

People across the entire computing industry

2:14

have been working tirelessly for decades to

2:17

try to reduce the risk of bad things happening,

2:19

and yet our best security minds, even after

2:22

more than a half century of work, haven't been

2:23

able to alleviate this problem.

2:25

They've just managed to make it mostly tolerable

2:27

for most people most of the time.

2:29

Enter the AI companies who, in their infinite

2:31

wisdom, and I mean that in the most sarcastic

2:33

way possible

2:34

in case that wasn't clear, decided that the Von

2:36

Neumann architecture's lack of memory

2:38

safety was far, far too secure and structured

2:40

for what they had in mind, said that the

2:42

security industry "hold my beer" and then

2:43

implemented an architecture that not only makes

2:45

no distinction between instructions and data,

2:48

but requires the code and the data to be

2:49

combined into a single embedding matrix that

2:52

neither contains nor retains any information

2:54

about what started out as a prompt, what was

2:56

additional context and what was previously

2:58

output, so that during neural network

3:00

processing at the core of the architecture

3:02

there's literally

3:02

no difference between the way the instructions

3:04

and the data are treated, and they're all

3:06

run through one after the other.

3:09

To be clear, this design knowingly exempts

3:11

itself from all of the security advances since

3:13

the 1980s, and then it takes the initial

3:15

original sin of computing and it deliberately

3:18

makes

3:18

it even worse by erasing even the tiny

3:20

insufficient amount of distinction there used

3:22

to be between

3:23

code and data.

3:24

The way L.L.M.'s work, and this is not an L.L.M.

3:26

internals video, so I'm not going

3:27

to get too deep, but the L.L.M.'s take the

3:28

prompt and they run it through a bunch of

3:29

matrix math to decide what the next token

3:31

should be, and then it depends on that token

3:33

to what it operated on last time, or as that

3:34

concatenation through again to select the

3:36

next token and a piece of steps over and over.

3:38

When things get complicated is when there's not

3:40

only a prompt in the output, but a bunch

3:42

of other contexts as well.

3:44

So example, when your prompt says 'summarize'

3:45

this web page, that web page gets grabbed

3:47

from the internet and added to the context it

3:49

has run through the matrix operator, along

3:51

with the prompt.

3:52

When that happens, there's no distinction

3:54

between what started out as a prompt and what

3:55

is the context it's operating on, and things

3:57

get scary when that web page or email or

3:59

whatever

4:00

other content is looking at, that the L.L.M. is

4:02

summarizing or otherwise operating one

4:04

came from a source that isn't trusted, because

4:06

if that content contains instructions that

4:08

may be interpreted by the L.L.M., then the L.L.M.

4:10

has no inherent way to know the difference

4:12

between the prompt it was asked to follow in a

4:13

prompt-like language embedded in the content

4:15

it's processing.

4:16

When the context the L.L.M. is processing

4:18

contains prompt-sounding language that the L.L.M.

4:21

might execute, we call that a prompt injection.

4:23

And when the prompt being injected was pulled

4:25

in, not directly because the user gave it to

4:27

the L.L.M. themselves, but as a result of some

4:29

operation that the L.L.M. took, like

4:30

grabbing a web page, that's called an indirect

4:33

prompt injection.

4:34

And you know how untrusted content gets pulled

4:36

into the current context and combined with

4:38

the prompt that the user wants to have executed?

4:39

AI agents who can fetch web pages and pull

4:41

them in on demand, operate on other untrusted

4:43

data like emails or code dependencies. And

4:45

how do malicious prompt hidden in content, pull

4:48

in from untrusted sources, manage to

4:49

harm the user? Again, AI agents, who can leak

4:52

the user's private information to the internet,

4:54

write malware into the user's disks to delete

4:56

and encrypt the user's files to facilitate

4:57

a ransomware attack.

4:58

Basically, anything that the agent has the

5:00

power to do, a malicious prompt can force

5:02

the agent to use to act against you for benefit

5:04

of the hacker.

5:05

And now, knowing full well that this is the

5:06

security situation they built into the

5:08

technology

5:09

that they're stacked on, and even having been

5:10

forced to admit buried in paragraph 10

5:12

of a long, unpleasant jargon-heavy blog post,

5:15

that they expect this problem to be unsolved

5:17

for years,

5:18

The AI vendors like OpenAI are pushing ahead

5:20

with trying to get the public to adopt AI

5:22

agents and AI-enabled browsers with embedded

5:24

agents in them.

5:25

This tech knowingly and deliberately pulls

5:27

content in from third-party sites, visited

5:29

by the browser or accessed by the agent, and

5:31

combines that untrusted third-party content

5:34

with the agent's previous instructions in such

5:36

a way that the underlying tech has no

5:37

way of knowing which words turned vectors came

5:39

from which kind of source.

5:41

I've lived through a nightmare like this before.

5:43

In the days of the early web, when attackers

5:45

were discovering what the new browsers and

5:47

browser features like JavaScript allowed them

5:49

to get away with, and how they could

5:50

use that to steal data and money from users of

5:52

those browsers, and steal data and money

5:55

they did, lots of it for years.

5:57

I've seen what happens when someone pushes an

5:59

insecure architecture on the world before.

6:01

In the 1990s, the bad guys created a thing we

6:03

called "malvertizing," where an advertisement

6:06

gets included on a website and injects malware

6:08

onto unsuspecting users' computers when they

6:10

visit inside sites just the New York Times,

6:12

believe it or not.

6:13

Since then, a lot of infrastructure has been

6:14

created to detect ad content that might be

6:16

malicious and might take malicious actions and

6:18

prevent them from being spread through

6:20

the ad networks, and yet, the malvertizing

6:22

still exists.

6:23

Here's an article from a couple of weeks ago

6:25

describing a malvertizing scene found in

6:27

the wild, although it is much, much less common

6:29

than it used to be.

6:31

But verifying the integrity of ads in a web

6:33

page context is fairly well understood at

6:35

this point, and it has pretty clear warning

6:36

flags that you can look for.

6:38

JavaScript codes pretty much have semicolons in

6:40

it, so you can look for those, likewise,

6:41

HTML tags have angle brackets, aka less than

6:44

and greater than signs, you can look for those.

6:46

That way, you can see where you need to pay

6:48

more attention.

6:48

It's, I'm oversimplifying, but there are

6:50

markers that you can look at.

6:52

But there's no easy indications like that for

6:54

LLM prompt injection.

6:55

It's much harder to detect, since the malicious

6:57

content is just made of ordinary language

6:58

phrases.

7:00

As more and more people start using agents, the

7:02

malware makers are going to have a field

7:03

day coming up with various ways to steal users'

7:06

data and infect their machines.

7:08

How many times have you heard warnings about

7:10

how clicking on links can infect your computer?

7:12

Well, with indirect prompt injection, an agentic

7:15

web and skilled malware creators,

7:16

no clicking will be necessary.

7:19

This is unsafe by design, and they're pushing

7:20

it anyway.

7:21

I've put links to a bunch of different exploits

7:23

that have already been found below.

7:24

I predict that the next few years will be more

7:26

and more exports found, and the AI companies

7:28

will just try to patch them up, and then the

7:30

attackers will figure out how to work around

7:31

the patches and go find more.

7:34

So let's talk about some defenses that articles

7:36

claim will help.

7:37

So here's an article from Google.

7:38

It lists these defenses: prompt injection content

7:40

classifiers, security thought reinforcement,

7:43

mark down sanitation and suspicious URL

7:45

reduction, user confirmation frameworks, and

7:48

end user

7:48

security mitigation notifications.

7:50

So let's go through these one at a time.

7:53

Prompt injection classifiers means trying to

7:55

find phrases in content that might be malicious.

7:58

Good luck with that.

7:59

I've already explained how much harder it is to

8:00

detect bad prompts than bad JavaScript.

8:02

And history from watching how attackers have

8:03

exploited buffer overflow attacks tells us

8:05

that the bad guys will find ways around that

8:07

over and over again.

8:09

Security thought reinforcement means telling

8:11

the AI not to let itself get tricked by bad

8:13

content.

8:14

If you could trust the AI not to be tricked by

8:16

bad content, then we wouldn't be here

8:17

talking about this at all.

8:18

So good luck with that one too.

8:20

Markdown sanitation and suspicious URL

8:22

reduction is just a specific type of content

8:25

classifier.

8:26

User confirmation framework means putting pop-ups

8:28

in the workflow to tell the user c"lick here

8:29

if it's OK."

8:30

First off, this is just putting the blame on

8:32

the user.

8:32

And besides that, the attackers will be working

8:34

on tricking the AI to think it doesn't need

8:36

to ask the user.

8:37

And besides that, users just tend to stop

8:38

paying attention after all those "are-you-sure?"

8:41

dialogue boxes and just click them.

8:43

End user security mitigation notifications is

8:45

also just making the user responsible for

8:47

getting hacked.

8:48

So here's OpenAI's recommendations.

8:51

Quote, limit logged in access were possible,

8:53

carefully review confirmation requests, and

8:56

give agents explicit instructions when possible,

8:58

which is blame the user, blame the user, and

9:00

blame the user.

9:03

They just don't have any good way of preventing

9:05

this, but they're not going to let that stop

9:07

them from pushing it on every user they can.

9:09

There's another fundamental computing issue at

9:11

play here, which is called the halting problem.

9:14

I assume if you're watching this, you're

9:15

familiar with the halting problem, but to

9:16

refresh

9:16

your memory, it's not possible for one program

9:18

to predict whether or not a particular set

9:20

of inputs will cause another program to halt or

9:22

not.

9:23

Likewise, prompt injection is not a problem

9:25

you can use AI to fix. Asking one AI to read

9:27

through potentially malicious content before

9:29

that content gets given to another AI just won't

9:32

help, because the first AI is also vulnerable

9:34

to the same kind of prompt injection attacks

9:37

that malicious content might contain.

9:39

Now there are specific versions of this problem

9:41

that you and I as programmers have to deal

9:43

with that normal users do not, which is

9:44

malicious prompt injection hidden in code,

9:47

specifically

9:47

hidden in open source libraries that our code

9:50

pulls in.

9:50

As your Claud Code or whatever reached through

9:52

your whole code base, including whatever

9:54

dependencies

9:54

your code pulls in, Claud Code is vulnerable to

9:57

any sufficiently bad prompts embedded in

10:00

the code comments or even in the read me files.

10:02

It's a huge problem.

10:04

So what I do, and I made a video talking about

10:06

my setup, I'll put a link here.

10:08

I have a dedicated machine that I run my code

10:10

agents on.

10:11

It's an Intel based Mac mini that was my

10:12

desktop once upon a time.

10:14

I removed Mac OS completely put Linux on it,

10:16

and I use QEMU to run virtual machine, and

10:19

I run each AI agent in its own virtual machine.

10:22

I don't give agents my GitHub credentials or

10:24

any other credentials, I have all the agents

10:25

right to local clones of GitHub repositories,

10:28

and then I grab the code that the agents

10:30

generated

10:30

to review it manually, and then I push whatever

10:33

up to GitHub that I want from my machine.

10:36

And if anything goes wrong with the agent, I

10:37

can just revert the virtual machine back

10:39

to the last known good state.

10:41

It's a pain in the ass, no doubt, but it's not

10:43

nearly as much of a pain in the ass as

10:44

cleaning up a machine after it gets hacked,

10:46

trust me, I had to do that many times.

10:49

I'm sure I will be talking about this more.

10:51

This is going to be haunting us for years, but

10:53

I'm going to wrap this video up now.

10:54

I could probably keep complaining about this

10:56

for hours, but I want to get this published.

10:58

So until next time, thanks for watching, let's

11:01

be careful out there.

Interactive Summary

This video delves into the critical security vulnerabilities inherent in AI agents, tracing the issue back to the "original sin" of the Von Neumann architecture where code and data are stored indistinguishably. The speaker explains how Large Language Models (LLMs) exacerbate this problem by merging instructions and context into a single embedding matrix, leading to risks like indirect prompt injection. After critiquing industry-standard mitigations as largely ineffective or user-blaming, the video provides a technical guide on using isolated virtual machines to safely run AI agents in a development environment.

Suggested questions

5 ready-made prompts