How Data warehouses and Web Search show AI is not Unprecedented

Watch on YouTube

Now Playing

Transcript

485 segments

0:00

Hi, I'm Carl from Internet of Bugs, and today's

0:02

books, oh, I could pick a lot of them.

0:04

Let's try the search, building search

0:07

applications, Lucine in action, and

0:15

foundations of statistical

0:16

natural language processing.

0:19

All of those books are fairly old, so I don't

0:21

know that any of them I would particularly

0:23

recommend for someone who's just trying to get

0:25

into the field.

0:26

But that's kind of the point of this video is

0:27

that a lot of the stuff we're talking

0:29

about today is actually really old.

0:30

So for those of you that are new here, I've

0:31

been a software professional for 35 plus years

0:33

now, and this is my more technical channel

0:35

where I talk about understanding and applying

0:37

the technology and profession of software.

0:40

This video is about large language models as a

0:42

new form of big data search.

0:43

And just to clarify, when I say large language

0:45

models in this video, I'm just talking about

0:47

the inference part of large language models.

0:49

In other words, how large language models

0:52

behave when they're responding to a prompt.

0:54

There are a bunch of other aspects of LLMs,

0:55

like the way they're trained and agentic

0:56

loops and a bunch of stuff, that's all beyond

0:58

the scope of this video.

0:59

I'm just talking about inference here also.

1:01

This is not an LLM internals video, so I'm

1:03

going to be oversimplifying and talking at

1:05

a very high level about what the LLMs actually

1:07

do, one of the covers, to keep things moving.

1:09

If you want an explanation of how they work,

1:11

see the LLM playlist at three blue one brown,

1:14

I'll put a link to it below.

1:15

So let me anchor this video with a quote from

1:18

the CEO of Anthropic, which is, "People

1:19

outside the field are often surprised and

1:22

alarmed to learn that we do not understand

1:23

how our own AI creations work.

1:25

They are right to be concerned.

1:27

This lack of understanding is essentially

1:29

unprecedented in the history of technology."

1:31

So this quote, as far as I'm concerned, is

1:33

utter BS.

1:34

This quote is from an essay, I'll put a link to

1:35

that below, and the quote surfaced recently

1:37

in an AI doomer video that was full of this

1:39

kind of garbage, I might end up doing a

1:41

response

1:41

video to that at some point.

1:43

It might not be clear, but there are actually

1:46

two somewhat unrelated piles of BS in this

1:48

quote.

1:49

The first is that "we don't understand" part, and

1:51

the second is the "essentially unprecedented

1:53

in the history of technology" part.

1:55

So let's talk about the "Do Not Understand"

1:56

part first.

1:57

So there are two high-level interpretations of

2:00

their "AI creations" in this quote, as he

2:01

calls them.

2:02

He and a lot of people in the AI space from

2:04

what I can tell pick a particular

2:06

interpretation

2:07

that seems overly complicated and unnecessary

2:09

to me.

2:09

I find that there's a much, much simpler

2:11

explanation.

2:11

The dispute can be looked at as a disagreement

2:14

about what does or doesn't count as AI

2:16

creations.

2:17

To them, it's a really, really unknowably

2:19

complicated process that can't be understood.

2:21

To me, it's a really, really simple process

2:23

that's executed against gigabytes or terabytes

2:25

of data that can be predicted.

2:27

We know how transformers work.

2:28

We know what the algorithms do.

2:29

We know that at a simple level, a prompt is

2:31

encoded into a bunch of vectors, those vectors

2:33

are multiplied against lots and lots of matrices

2:35

that were created during the training process

2:36

to produce a result, and then that result is

2:39

decoded back into words or images or whatever.

2:41

The part that we supposedly don't understand is

2:43

how the various metric multiplications create

2:45

the result, except we do, kind of until we don't.

2:48

An additional thing you need to understand is

2:50

that there's a step inside the LLM inference

2:52

where they literally insert randomness,

2:54

although they call it temperature, saying there's

2:56

no way we can understand why the LLM output did

2:59

what it did when you know that you told

3:02

it to make the output random is just

3:03

deliberately disingenuous.

3:05

The problem isn't that we don't know the

3:06

mechanism of how this works.

3:08

The problem is that there's just way, way too

3:10

many of these matrices to keep track of.

3:13

We make a very small toy-sized neural network,

3:15

and then we can figure out the whole thing.

3:17

It's not very useful, but we can.

3:19

Anthropic has even been able to find the "Golden

3:21

Gate Bridge" neurons in one of its older smaller

3:23

models, put a link to that below.

3:25

The problem is not a lack of understanding,

3:27

despite what they say.

3:28

It's a lack of time.

3:29

It's not impossible to trace through how a

3:31

particular prompt was altered at every step

3:33

along the way passing through a trillion

3:34

parameter model.

3:35

It would probably just take so long that, by

3:37

only time we figured it out for one prompt,

3:39

the model we just traced will have been

3:40

deprecated already.

3:41

There's nothing unknowable, impossible, or

3:43

magical here.

3:44

It's just so large that it's impractical to do.

3:47

So another quote from the same article is, "If

3:48

an ordinary software program does something,

3:50

for example, a character in a video game says a

3:52

line of dialogue, or my food delivery

3:53

app allows me to tip my driver, it does those

3:55

things because a human specifically programmed

3:57

them in."

3:58

Now, that is a mind-bogglingly dumb example,

4:01

and there's no resemblance to what the LLMs

4:03

do.

4:04

A more honest analogy would be, "if 10,000

4:06

players are all connected to the same World of

4:08

Warcraft

4:09

server, and one of those players hears some

4:10

line of dialogue, there are potentially

4:12

thousands

4:13

of different dialogue lines that could have

4:14

been heard at that moment, depending on what

4:15

is happening, who was where, et cetera, and the

4:17

specific line of dialogue wasn't predetermined.

4:19

It was chosen in the result of a series of

4:21

interacting algorithms operating on all the

4:23

data relevant to the player's game state, and

4:24

wouldn't have been able to be predicted

4:26

in advance."

4:27

Likewise, "tipping your delivery driver" is very

4:29

straightforward, but for example, the process

4:31

by which the delivery app determines which

4:32

particular driver is going to be selected to

4:34

be the one that's assigned to your order, and

4:36

which particular location of the restaurant

4:37

that you picked is going to go to can depend on

4:39

dozens of factors, including where each

4:41

driver is, what other assignments they've been

4:43

given, what their rating is, what your

4:44

rating is, how long they take each driver to

4:45

respond to the new order alert in their

4:47

app, et cetera, et cetera.

4:49

Once you get to a large-scale, complex software

4:51

system, with lots of moving parts, many things

4:53

just aren't predictable anymore.

4:55

There are far too many variables that go into

4:57

any given output, and a human just isn't

4:59

capable of tracking all of that data in

5:00

anything close to a reasonable time frame.

5:03

Which leads us to the second issue, which is

5:05

the "unprecedented" part.

5:06

They say this because they want to interpret

5:08

the trillions of parameters in the model as

5:09

processes or code that we don't understand,

5:12

whereas I say that those trillions of

5:13

parameters

5:14

are just a huge data set that's too big for a

5:16

human to track anything close to a reasonable

5:17

time frame.

5:19

This kind of thing happens all the time.

5:21

I've done several big data-type projects, but I'm

5:23

going to talk vaguely without going

5:24

into any company secrets, about three of them

5:27

today.

5:27

You don't need to memorize these, this isn't a

5:28

test at the end, but just in case you're

5:30

curious, here's kind of a brief overview of the

5:31

different projects I'm going to talk

5:32

about a little.

5:34

One of the most for a company that remotely

5:35

managed networks, we had a data warehouse that

5:38

had network metric data, stuff like packets in,

5:41

packets out, error rates, bytes, interface

5:43

resets, that kind of thing, collected every

5:45

five minutes or so for every one of millions

5:47

of network interfaces across hundreds of

5:49

customer companies.

5:51

The next one was a data lake project, more on

5:52

the difference in a bit, for a group that

5:54

basically did supply chain management that

5:56

tracked products and vendors and parts and

5:58

shipments and such.

5:59

And the last one was kind of a custom-built

6:01

thing, not strictly a data warehouse, not even

6:03

a SQL database, but just a ton of full-text

6:06

indexes, using Apache Lucene, which is a full-text

6:09

search library used for classifying web content.

6:11

This is from a long time ago.

6:13

The goal here is to show you how a lot of the

6:15

features in large language models that

6:17

people talk about like their magic are really

6:18

just extensions or evolutions of stuff we've

6:20

been doing for years, or if not decades,

6:22

specifically the inference step occurring LLMs

6:25

is basically

6:25

just searching against a big, glossy copy of

6:27

the information found on the World Wide Web.

6:29

I'm not going to go into that here because I

6:31

already talked about that in my other channel,

6:33

below, if you haven't seen that video yet.

6:35

So first off, let's talk about lossy

6:36

compression, or more importantly, what happens

6:38

when you

6:38

need to present data that was stored in a lossy

6:41

way?

6:41

I'm talking about taking the information

6:43

encoded in an mp3 file, for example, and

6:45

turning it

6:45

into signals to send to a speaker, or taking a

6:48

jpeg file and making up the data for the

6:49

pixels that weren't written down when the file

6:52

was created, that kind of thing.

6:53

For example, there are a number of steps that

6:55

happen to turn an mp3 file into a speaker

6:57

output, including Huffman decoding, re-quantization,

6:59

inverse modified discrete cosine transform,

7:01

inverse polyphase filter bank, you don't even

7:03

know what any of that crap is, those steps

7:05

don't matter for our purposes today.

7:06

It's enough to understand that there are a set

7:08

of steps that involve taking the relevant

7:09

chunk of data from the mp3, running a bunch of

7:11

algorithms and mathematical transformations

7:13

on it, and then outputting the result.

7:15

The thing that's missing in the mp3, or jpeg or

7:18

mp4 case, is that these formats are not

7:20

really intended to be query, you just can jump

7:22

around in time, and you can zoom in on

7:24

particular parts of the frame, but that's about

7:26

it.

7:26

So let's talk about a warehouse full of network

7:28

metric data.

7:29

So as you can imagine, if you're collecting a

7:31

few numbers for each of tens of hundreds

7:32

of thousands of routers and switch interfaces

7:34

every five minutes, and stuffing them all

7:36

into a big database, the size of that database

7:37

grows pretty big pretty quickly, and it gets

7:39

really expensive to store all of it.

7:41

So what we do is create what are called roll-up

7:43

tables or aggregate tables, where instead

7:45

of storing each data point, we store some

7:47

summary, like an average or a count or maximum

7:49

of the points for a given time range.

7:51

And the older the data are, the more you

7:53

summarize.

7:54

For example, six-month-old data might be stored

7:56

at one data point per hour, year-old

7:57

data might be stored at one data point every

7:59

four hours, five-year-old data might be one

8:01

point data point per day, et cetera, et cetera.

8:03

Then when it's time to display or report on

8:05

that data, there are various options about

8:06

what transformations or formulas can be used to

8:09

do that.

8:09

Oddly enough, on that particular project, the

8:12

software package that we use to display

8:13

and create reports for that data was from a

8:15

company called MicroStrategy, which today

8:17

they're better known for their position on

8:19

cryptocurrency, but they actually used to

8:21

make business intelligence software, believe it

8:23

or not.

8:24

Sometimes the reconstruction was

8:25

straightforward, you know, show me the network

8:26

traffic from

8:27

last December.

8:28

There were a lot of more complicated things

8:29

like capacity planning or show me the network

8:31

changes that are predicted to produce the best

8:33

return on investment.

8:34

We would frequently get bug reports from

8:36

customers that they got a report that didn't

8:37

make sense

8:38

or that was truncated and by the time we went

8:41

and looked, it was working fine.

8:43

There are lots of things that could be gone

8:44

wrong from MicroStrategy bugs to a database

8:46

table that was locked or corrupted at the time

8:49

the report ran or it was unlocked or overwritten

8:51

by newly inserted data by the time we got to

8:53

the bug ticket.

8:54

It was impossible to predict what it was that

8:56

the report was going to say and it was

8:57

impossible

8:57

to figure out what went wrong unless we stored

8:59

a whole lot more data than we could afford

9:01

to.

9:02

There was just no way we could have done the

9:03

calculations by hand to create any of those

9:05

reports manually, the amount of data was just

9:08

far, far too much for that.

9:10

The data lake project, which is kind of like a

9:11

data warehouse except the data is constrained

9:13

at query read time instead of at insert write

9:15

time, led to a similar but much worse problem.

9:18

To track the supply chain, you end up with

9:19

thousands of suppliers who each have their

9:21

own reports about their products, their receivables,

9:24

their statuses, each one potentially in a

9:26

unique format.

9:27

These all get uploaded into the data lake on a

9:29

semi regular but often a predictable basis.

9:32

Then you have to write code to tie all that

9:33

data together to make real reports out of

9:35

it.

9:36

And any of those suppliers could have had any

9:37

number of bugs or upload problems or software

9:38

updates or changes that we hadn't been informed

9:41

of.

9:41

And a report that worked fine yesterday might

9:43

get you completely different results today

9:45

because somewhere the underlying data changed.

9:47

Maybe it changed formats, maybe it was a redaction,

9:49

maybe an error causing an upload

9:50

to get truncated.

9:51

And in theory, you could track down exactly

9:53

what happened, but by the time you do, several

9:55

other suppliers will probably upload their new

9:57

reports and what you had might have not

9:59

have been relevant anymore.

10:01

The full tech search project had a giant index

10:03

that was trained on lots and lots of

10:04

content that was categorized.

10:06

For example, all of the Wikipedia had been run

10:08

through the indexer.

10:09

Each page was classified by the hierarchy of

10:10

Wikipedia categories associated with those

10:12

pages.

10:13

Some of the other sites that were either

10:15

specific to a particular category like news

10:16

sites being

10:17

classified by the section heading for that

10:19

article, or one subject sites, like a video

10:21

game review site being all classified as gaming,

10:23

hopefully you get the idea.

10:24

The idea was to take an arbitrary website that

10:27

we had no knowledge about without any

10:28

human having to look at it scrape the top few

10:30

pages of the website and use that as a query

10:32

to see what category that website would most

10:34

likely belong to in our index and what text

10:36

it would be like.

10:37

Technically, it was to find what categories

10:39

content in the website in question was most

10:41

similar to.

10:42

We were running this trying to classify

10:43

hundreds or thousands of websites every day

10:45

without

10:46

a human ever looking at it.

10:47

This was in 2007, maybe 2008, so 10-ish years

10:50

before the current transform, our intention

10:53

technology happened.

10:54

So it understood words kind of and to some

10:56

extent phrases, but it didn't have any way

10:58

to get the idea of context.

11:00

So in other words, the vector embeddings of

11:02

chat GPT that now use the attention mechanism

11:04

to tell the difference between whether the word

11:06

"right" means in the sense of not incorrect

11:08

or the sense of not left.

11:10

Back then we had no such mechanism, so it would

11:12

get thrown off a lot.

11:13

It didn't matter how well you thought you had

11:15

the thing tuned, there was still a good

11:17

chance that the result that you got would be

11:18

garbage, because there was no way to predict

11:20

what the query content was going to be, and

11:22

again, you in theory could go step by step

11:24

and figure it out, but by the time you did that,

11:26

you'd have a whole new batch of errors

11:27

you had to deal with.

11:29

One thing we never quite got right was bands

11:30

and music.

11:31

They take pretty much any block of text, and

11:32

there's a decent chance that at some point,

11:34

somewhere in the world, some combination of a

11:37

band name, song title and album title,

11:38

and song lyrics would have a very similar set

11:40

of words to that block of text.

11:42

It was a completely unpredictable nightmare.

11:44

For all of these projects, we understood what

11:46

our code did, but there was just too much

11:48

data changing to quickly predict the results.

11:50

We could make the argument that we didn't

11:51

understand what it was that was happening,

11:53

but that just wasn't true.

11:54

We just couldn't predict, given the huge amount

11:57

of data we were working with, what the code

11:58

was going to turn up.

11:59

There are tons of other examples I could bring

12:01

up, like high-frequency trading algorithms

12:03

that operate on millisecond signals that are

12:05

pouring in way too fast for humans to

12:07

understand,

12:07

but the thing to understand is this isn't

12:10

unprecedented at all.

12:11

In fact, one of the biggest challenges in

12:13

software is figuring out how to deal with

12:14

situations

12:15

that are highly, highly data-dependent at

12:17

internet scale.

12:18

So quite often, I see situations like this

12:20

where the people in one specialty of the tech

12:23

industry just seem stunningly ignorant of the

12:25

obvious similarities between what they're

12:27

working on now and what is or has been going on

12:29

in other parts of the industry for a long

12:31

time.

12:32

And it's a real problem because it causes

12:33

people to reinvent the wheel when instead

12:35

they could just make use of the solutions that

12:37

have already existed elsewhere in the

12:38

industry.

12:39

Let's be honest, in this particular case, it

12:41

might not be that they don't know about

12:42

the existing precedence for what they're doing,

12:44

so much as they just want to pretend they're

12:45

working on something world-changingly new to

12:47

get more money, but they just aren't.

12:49

This is a really hard problem, but it's just

12:52

not unprecedented.

12:53

It's not magic.

12:54

And it's certainly not an indication that

12:55

anything in the system is alive or thinking

12:57

or any of that crap.

12:59

I feel comfortable insisting that there's

13:00

nothing actually intelligent in these systems

13:02

because of four things.

13:04

The first is Occam's razor, the simplest

13:06

explanation that covers all the observations is

13:08

most likely

13:08

the correct one.

13:09

We have a decades old mechanism by which the

13:11

unpredictability of the output of LLMs can

13:13

be explained without the need for anything

13:15

unprecedented.

13:16

The second is what's called the Sagan standard.

13:18

It's that extraordinary claims require

13:20

extraordinary evidence.

13:21

The claim that LLMs are unprecedented in the

13:24

history of computing is extraordinary, and

13:26

the evidence for that claim, at least so far,

13:28

is not extraordinary at all.

13:30

The third is that the people pushing this

13:32

narrative have a huge financial incentive to

13:34

make the

13:35

public believe that these things are actually

13:36

worth trillions of dollars.

13:37

And lastly, because the AI industry has a

13:39

history of just lying about stuff, I made a

13:41

video

13:41

on that here.

13:42

So, we will see if I'm right.

13:44

I could be wrong, which you'll rarely hear the

13:46

folks the AI industry ever say.

13:48

But so far, I'm just not buying the

13:49

unprecedented argument.

13:51

I don't see anything here that can't be

13:52

explained by just searching a lawfully

13:54

compressed image

13:55

of the web and randomizing the output.

13:57

So don't let them fool you.

13:59

If they want to claim their work as wholly

14:00

revolutionary, they need to be made to prove

14:02

it, just pointing you at a chatbot is not good

14:09

enough.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

Carl from 'Internet of Bugs' argues that Large Language Models (LLMs) are not a mysterious or unprecedented technology, but rather a high-scale evolution of big data search. He critiques statements from industry leaders claiming that AI is 'unknowable,' asserting that while the scale of matrices and parameters is too large for human tracking, the underlying algorithms are well-understood. Drawing from his 35-year career in software, he provides examples of other complex, data-dependent systems that exhibit similar unpredictability. Carl concludes that LLM inference is essentially searching a 'lossy' compression of the internet and that the public should remain skeptical of extraordinary claims made by those with financial incentives.