How Data warehouses and Web Search show AI is not Unprecedented
485 segments
Hi, I'm Carl from Internet of Bugs, and today's
books, oh, I could pick a lot of them.
Let's try the search, building search
applications, Lucine in action, and
foundations of statistical
natural language processing.
All of those books are fairly old, so I don't
know that any of them I would particularly
recommend for someone who's just trying to get
into the field.
But that's kind of the point of this video is
that a lot of the stuff we're talking
about today is actually really old.
So for those of you that are new here, I've
been a software professional for 35 plus years
now, and this is my more technical channel
where I talk about understanding and applying
the technology and profession of software.
This video is about large language models as a
new form of big data search.
And just to clarify, when I say large language
models in this video, I'm just talking about
the inference part of large language models.
In other words, how large language models
behave when they're responding to a prompt.
There are a bunch of other aspects of LLMs,
like the way they're trained and agentic
loops and a bunch of stuff, that's all beyond
the scope of this video.
I'm just talking about inference here also.
This is not an LLM internals video, so I'm
going to be oversimplifying and talking at
a very high level about what the LLMs actually
do, one of the covers, to keep things moving.
If you want an explanation of how they work,
see the LLM playlist at three blue one brown,
I'll put a link to it below.
So let me anchor this video with a quote from
the CEO of Anthropic, which is, "People
outside the field are often surprised and
alarmed to learn that we do not understand
how our own AI creations work.
They are right to be concerned.
This lack of understanding is essentially
unprecedented in the history of technology."
So this quote, as far as I'm concerned, is
utter BS.
This quote is from an essay, I'll put a link to
that below, and the quote surfaced recently
in an AI doomer video that was full of this
kind of garbage, I might end up doing a
response
video to that at some point.
It might not be clear, but there are actually
two somewhat unrelated piles of BS in this
quote.
The first is that "we don't understand" part, and
the second is the "essentially unprecedented
in the history of technology" part.
So let's talk about the "Do Not Understand"
part first.
So there are two high-level interpretations of
their "AI creations" in this quote, as he
calls them.
He and a lot of people in the AI space from
what I can tell pick a particular
interpretation
that seems overly complicated and unnecessary
to me.
I find that there's a much, much simpler
explanation.
The dispute can be looked at as a disagreement
about what does or doesn't count as AI
creations.
To them, it's a really, really unknowably
complicated process that can't be understood.
To me, it's a really, really simple process
that's executed against gigabytes or terabytes
of data that can be predicted.
We know how transformers work.
We know what the algorithms do.
We know that at a simple level, a prompt is
encoded into a bunch of vectors, those vectors
are multiplied against lots and lots of matrices
that were created during the training process
to produce a result, and then that result is
decoded back into words or images or whatever.
The part that we supposedly don't understand is
how the various metric multiplications create
the result, except we do, kind of until we don't.
An additional thing you need to understand is
that there's a step inside the LLM inference
where they literally insert randomness,
although they call it temperature, saying there's
no way we can understand why the LLM output did
what it did when you know that you told
it to make the output random is just
deliberately disingenuous.
The problem isn't that we don't know the
mechanism of how this works.
The problem is that there's just way, way too
many of these matrices to keep track of.
We make a very small toy-sized neural network,
and then we can figure out the whole thing.
It's not very useful, but we can.
Anthropic has even been able to find the "Golden
Gate Bridge" neurons in one of its older smaller
models, put a link to that below.
The problem is not a lack of understanding,
despite what they say.
It's a lack of time.
It's not impossible to trace through how a
particular prompt was altered at every step
along the way passing through a trillion
parameter model.
It would probably just take so long that, by
only time we figured it out for one prompt,
the model we just traced will have been
deprecated already.
There's nothing unknowable, impossible, or
magical here.
It's just so large that it's impractical to do.
So another quote from the same article is, "If
an ordinary software program does something,
for example, a character in a video game says a
line of dialogue, or my food delivery
app allows me to tip my driver, it does those
things because a human specifically programmed
them in."
Now, that is a mind-bogglingly dumb example,
and there's no resemblance to what the LLMs
do.
A more honest analogy would be, "if 10,000
players are all connected to the same World of
Warcraft
server, and one of those players hears some
line of dialogue, there are potentially
thousands
of different dialogue lines that could have
been heard at that moment, depending on what
is happening, who was where, et cetera, and the
specific line of dialogue wasn't predetermined.
It was chosen in the result of a series of
interacting algorithms operating on all the
data relevant to the player's game state, and
wouldn't have been able to be predicted
in advance."
Likewise, "tipping your delivery driver" is very
straightforward, but for example, the process
by which the delivery app determines which
particular driver is going to be selected to
be the one that's assigned to your order, and
which particular location of the restaurant
that you picked is going to go to can depend on
dozens of factors, including where each
driver is, what other assignments they've been
given, what their rating is, what your
rating is, how long they take each driver to
respond to the new order alert in their
app, et cetera, et cetera.
Once you get to a large-scale, complex software
system, with lots of moving parts, many things
just aren't predictable anymore.
There are far too many variables that go into
any given output, and a human just isn't
capable of tracking all of that data in
anything close to a reasonable time frame.
Which leads us to the second issue, which is
the "unprecedented" part.
They say this because they want to interpret
the trillions of parameters in the model as
processes or code that we don't understand,
whereas I say that those trillions of
parameters
are just a huge data set that's too big for a
human to track anything close to a reasonable
time frame.
This kind of thing happens all the time.
I've done several big data-type projects, but I'm
going to talk vaguely without going
into any company secrets, about three of them
today.
You don't need to memorize these, this isn't a
test at the end, but just in case you're
curious, here's kind of a brief overview of the
different projects I'm going to talk
about a little.
One of the most for a company that remotely
managed networks, we had a data warehouse that
had network metric data, stuff like packets in,
packets out, error rates, bytes, interface
resets, that kind of thing, collected every
five minutes or so for every one of millions
of network interfaces across hundreds of
customer companies.
The next one was a data lake project, more on
the difference in a bit, for a group that
basically did supply chain management that
tracked products and vendors and parts and
shipments and such.
And the last one was kind of a custom-built
thing, not strictly a data warehouse, not even
a SQL database, but just a ton of full-text
indexes, using Apache Lucene, which is a full-text
search library used for classifying web content.
This is from a long time ago.
The goal here is to show you how a lot of the
features in large language models that
people talk about like their magic are really
just extensions or evolutions of stuff we've
been doing for years, or if not decades,
specifically the inference step occurring LLMs
is basically
just searching against a big, glossy copy of
the information found on the World Wide Web.
I'm not going to go into that here because I
already talked about that in my other channel,
below, if you haven't seen that video yet.
So first off, let's talk about lossy
compression, or more importantly, what happens
when you
need to present data that was stored in a lossy
way?
I'm talking about taking the information
encoded in an mp3 file, for example, and
turning it
into signals to send to a speaker, or taking a
jpeg file and making up the data for the
pixels that weren't written down when the file
was created, that kind of thing.
For example, there are a number of steps that
happen to turn an mp3 file into a speaker
output, including Huffman decoding, re-quantization,
inverse modified discrete cosine transform,
inverse polyphase filter bank, you don't even
know what any of that crap is, those steps
don't matter for our purposes today.
It's enough to understand that there are a set
of steps that involve taking the relevant
chunk of data from the mp3, running a bunch of
algorithms and mathematical transformations
on it, and then outputting the result.
The thing that's missing in the mp3, or jpeg or
mp4 case, is that these formats are not
really intended to be query, you just can jump
around in time, and you can zoom in on
particular parts of the frame, but that's about
it.
So let's talk about a warehouse full of network
metric data.
So as you can imagine, if you're collecting a
few numbers for each of tens of hundreds
of thousands of routers and switch interfaces
every five minutes, and stuffing them all
into a big database, the size of that database
grows pretty big pretty quickly, and it gets
really expensive to store all of it.
So what we do is create what are called roll-up
tables or aggregate tables, where instead
of storing each data point, we store some
summary, like an average or a count or maximum
of the points for a given time range.
And the older the data are, the more you
summarize.
For example, six-month-old data might be stored
at one data point per hour, year-old
data might be stored at one data point every
four hours, five-year-old data might be one
point data point per day, et cetera, et cetera.
Then when it's time to display or report on
that data, there are various options about
what transformations or formulas can be used to
do that.
Oddly enough, on that particular project, the
software package that we use to display
and create reports for that data was from a
company called MicroStrategy, which today
they're better known for their position on
cryptocurrency, but they actually used to
make business intelligence software, believe it
or not.
Sometimes the reconstruction was
straightforward, you know, show me the network
traffic from
last December.
There were a lot of more complicated things
like capacity planning or show me the network
changes that are predicted to produce the best
return on investment.
We would frequently get bug reports from
customers that they got a report that didn't
make sense
or that was truncated and by the time we went
and looked, it was working fine.
There are lots of things that could be gone
wrong from MicroStrategy bugs to a database
table that was locked or corrupted at the time
the report ran or it was unlocked or overwritten
by newly inserted data by the time we got to
the bug ticket.
It was impossible to predict what it was that
the report was going to say and it was
impossible
to figure out what went wrong unless we stored
a whole lot more data than we could afford
to.
There was just no way we could have done the
calculations by hand to create any of those
reports manually, the amount of data was just
far, far too much for that.
The data lake project, which is kind of like a
data warehouse except the data is constrained
at query read time instead of at insert write
time, led to a similar but much worse problem.
To track the supply chain, you end up with
thousands of suppliers who each have their
own reports about their products, their receivables,
their statuses, each one potentially in a
unique format.
These all get uploaded into the data lake on a
semi regular but often a predictable basis.
Then you have to write code to tie all that
data together to make real reports out of
it.
And any of those suppliers could have had any
number of bugs or upload problems or software
updates or changes that we hadn't been informed
of.
And a report that worked fine yesterday might
get you completely different results today
because somewhere the underlying data changed.
Maybe it changed formats, maybe it was a redaction,
maybe an error causing an upload
to get truncated.
And in theory, you could track down exactly
what happened, but by the time you do, several
other suppliers will probably upload their new
reports and what you had might have not
have been relevant anymore.
The full tech search project had a giant index
that was trained on lots and lots of
content that was categorized.
For example, all of the Wikipedia had been run
through the indexer.
Each page was classified by the hierarchy of
Wikipedia categories associated with those
pages.
Some of the other sites that were either
specific to a particular category like news
sites being
classified by the section heading for that
article, or one subject sites, like a video
game review site being all classified as gaming,
hopefully you get the idea.
The idea was to take an arbitrary website that
we had no knowledge about without any
human having to look at it scrape the top few
pages of the website and use that as a query
to see what category that website would most
likely belong to in our index and what text
it would be like.
Technically, it was to find what categories
content in the website in question was most
similar to.
We were running this trying to classify
hundreds or thousands of websites every day
without
a human ever looking at it.
This was in 2007, maybe 2008, so 10-ish years
before the current transform, our intention
technology happened.
So it understood words kind of and to some
extent phrases, but it didn't have any way
to get the idea of context.
So in other words, the vector embeddings of
chat GPT that now use the attention mechanism
to tell the difference between whether the word
"right" means in the sense of not incorrect
or the sense of not left.
Back then we had no such mechanism, so it would
get thrown off a lot.
It didn't matter how well you thought you had
the thing tuned, there was still a good
chance that the result that you got would be
garbage, because there was no way to predict
what the query content was going to be, and
again, you in theory could go step by step
and figure it out, but by the time you did that,
you'd have a whole new batch of errors
you had to deal with.
One thing we never quite got right was bands
and music.
They take pretty much any block of text, and
there's a decent chance that at some point,
somewhere in the world, some combination of a
band name, song title and album title,
and song lyrics would have a very similar set
of words to that block of text.
It was a completely unpredictable nightmare.
For all of these projects, we understood what
our code did, but there was just too much
data changing to quickly predict the results.
We could make the argument that we didn't
understand what it was that was happening,
but that just wasn't true.
We just couldn't predict, given the huge amount
of data we were working with, what the code
was going to turn up.
There are tons of other examples I could bring
up, like high-frequency trading algorithms
that operate on millisecond signals that are
pouring in way too fast for humans to
understand,
but the thing to understand is this isn't
unprecedented at all.
In fact, one of the biggest challenges in
software is figuring out how to deal with
situations
that are highly, highly data-dependent at
internet scale.
So quite often, I see situations like this
where the people in one specialty of the tech
industry just seem stunningly ignorant of the
obvious similarities between what they're
working on now and what is or has been going on
in other parts of the industry for a long
time.
And it's a real problem because it causes
people to reinvent the wheel when instead
they could just make use of the solutions that
have already existed elsewhere in the
industry.
Let's be honest, in this particular case, it
might not be that they don't know about
the existing precedence for what they're doing,
so much as they just want to pretend they're
working on something world-changingly new to
get more money, but they just aren't.
This is a really hard problem, but it's just
not unprecedented.
It's not magic.
And it's certainly not an indication that
anything in the system is alive or thinking
or any of that crap.
I feel comfortable insisting that there's
nothing actually intelligent in these systems
because of four things.
The first is Occam's razor, the simplest
explanation that covers all the observations is
most likely
the correct one.
We have a decades old mechanism by which the
unpredictability of the output of LLMs can
be explained without the need for anything
unprecedented.
The second is what's called the Sagan standard.
It's that extraordinary claims require
extraordinary evidence.
The claim that LLMs are unprecedented in the
history of computing is extraordinary, and
the evidence for that claim, at least so far,
is not extraordinary at all.
The third is that the people pushing this
narrative have a huge financial incentive to
make the
public believe that these things are actually
worth trillions of dollars.
And lastly, because the AI industry has a
history of just lying about stuff, I made a
video
on that here.
So, we will see if I'm right.
I could be wrong, which you'll rarely hear the
folks the AI industry ever say.
But so far, I'm just not buying the
unprecedented argument.
I don't see anything here that can't be
explained by just searching a lawfully
compressed image
of the web and randomizing the output.
So don't let them fool you.
If they want to claim their work as wholly
revolutionary, they need to be made to prove
it, just pointing you at a chatbot is not good
enough.
Ask follow-up questions or revisit key timestamps.
Carl from 'Internet of Bugs' argues that Large Language Models (LLMs) are not a mysterious or unprecedented technology, but rather a high-scale evolution of big data search. He critiques statements from industry leaders claiming that AI is 'unknowable,' asserting that while the scale of matrices and parameters is too large for human tracking, the underlying algorithms are well-understood. Drawing from his 35-year career in software, he provides examples of other complex, data-dependent systems that exhibit similar unpredictability. Carl concludes that LLM inference is essentially searching a 'lossy' compression of the internet and that the public should remain skeptical of extraordinary claims made by those with financial incentives.
Videos recently processed by our community