HomeVideos

Memory in LLMs: Weights and Activations - Jack Morris, Cornell

Now Playing

Memory in LLMs: Weights and Activations - Jack Morris, Cornell

Transcript

1726 segments

0:13

[music]

0:20

Let's talk about Chad GBT. I think like

0:22

Chad GBT knows a lot of things. It's

0:25

actually extremely impressive. I use it

0:27

all the time. I used it to help prepare

0:29

for the presentation. I used it to cook

0:31

last night. Um, you know, very like

0:34

growing increasingly dependent. And yet,

0:37

there's a lot that Chad doesn't know.

0:39

Like, um, it didn't know why my speaker

0:41

pass wasn't working when I was trying to

0:43

get into the building and it uh, if you

0:46

ask it, did the Blue Jays win the World

0:48

Series? The answer is no. And I know

0:49

that because I watch the World Series,

0:51

but Chad GBT doesn't know that if you

0:52

don't enable web search because it has

0:54

something called a knowledge cut off. So

0:55

all the training data is kind of

0:58

segmented by date and things after a

1:00

certain date are not known by chbttt

1:03

like unilaterally. Uh if you ask jbt

1:06

help me optimize this kernel I wrote for

1:08

AMD GPUs it's so bad at it and I think

1:12

there's a few reasons for this. One it's

1:13

really hard. Two uh there's not a lot of

1:16

data for it. But three I think it's more

1:19

that the data that does exist is such a

1:22

small portion of its training data that

1:23

it just like can't do it very well. And

1:25

so a lot of tasks like this which I I

1:27

would guess a lot of you face in your

1:29

jobs like the things that are more niche

1:31

or here I call longtail are really hard

1:34

for Chad GBT to do even if you say

1:36

please like please [laughter] or like I

1:39

want you to learn more about this or

1:41

practice like it can't learn more about

1:42

this it can't practice it it doesn't

1:44

know uh what to do when you ask it that

1:47

and uh yeah if you ask what are the

1:49

terms of our partnership agreement for

1:50

Black Rockck it doesn't know about your

1:51

company which any shirts should I order

1:53

from Amazon on implement a new feature

1:56

uh in our company monor repo. Write an

1:59

email in my style. Diagnose this patient

2:02

given their history. What arguments did

2:04

the opposing council use in the Martinez

2:05

settlement negotiations? Uh is this

2:08

question already answered on our company

2:10

internal wiki? Like none of these things

2:12

are

2:13

possibly answered by chatbt because

2:15

they're not in the training data or

2:16

they're too niche or they require some

2:18

data that's not available to it. So I

2:21

think like the question I want to talk

2:22

about today is like what's the

2:24

[clears throat] right way to solve this

2:25

problem? Like if we want to build new

2:26

systems that actually know the things we

2:29

want them to know. Uh how how should we

2:30

build them? And I think like the way I

2:33

want to think about it is like how do we

2:35

take some knowledge and inject it into

2:38

the parameters of the model? Like what's

2:40

the right way to do this? And like the

2:42

way that I think about it and I think

2:44

the way this manifests in my research

2:45

and other people's research is there's

2:47

three ways. There's full context. you

2:50

can take as much stuff as you can and

2:52

cram it into the language model. There's

2:54

rag or retrieval augmented generation

2:56

where you have so many things that you

2:58

can't fit them all in and so you

2:59

retrieve the most useful ones and then

3:03

feed them in. And then there's this

3:05

third thing which I think is like really

3:07

new and no one is doing it yet which is

3:09

training things into weights. And I want

3:10

what I mostly want to talk about today

3:12

is like why I think we should be

3:14

training things into weights. But I'm

3:16

going to start with the other two. And

3:18

also, I guess like along the way, about

3:20

10% of the time, I'm going to be

3:22

shilling my own research, but I'm gonna

3:24

like try to be honest about it. And you

3:26

can just tune me out if you want.

3:28

So, I think like the easiest way to

3:30

solve these problems is to put

3:31

everything into context. It's like if

3:33

you work at a small company or um all

3:36

you care about is like maybe the 100

3:39

world series that have occurred, you can

3:41

kind of copy all the data and paste it

3:43

into chat GPT or paste it into croc or

3:45

whatever model you use. And that's

3:48

finite enough that the model can

3:50

understand.

3:51

And this like works works pretty well. I

3:54

think that this is something that got

3:56

people really excited for a while a few

3:57

years ago. I have this example of like a

3:59

doctor answering a question from a

4:01

medical record. a medical record is

4:03

small enough that it can presumably be

4:05

like inputed into the context of the

4:07

model and the model can do pretty well.

4:09

I think there's a few problems with

4:10

this. Maybe the main one is just that

4:13

it's so expensive. Like if you do

4:15

anything like this in your day-to-day

4:16

workflow, you put like a ton of tokens

4:18

into context and start generating. I

4:20

mean, one, it's going to cost a lot of

4:22

money, like US dollars, but two, it's

4:25

just so slow. like um you know a few

4:29

months ago I was writing my thesis and I

4:31

wrote it myself but I did ask for some

4:34

feedback a few times from Claude and

4:36

like the second you paste in I I don't

4:38

know it's like

4:39

>> maybe 80 pages of text or something like

4:42

as documents go it's medium length I

4:46

paste into claude the second you paste

4:47

into claude everything slows down by 10x

4:49

or something I have this set here that

4:51

if you have 1,000 tokens of context

4:54

>> we can output 10,000 tokens per second.

4:57

If you have 128k per to 128k tokens of

5:01

context, we can output 130 tokens per

5:03

second. So that's like several orders of

5:05

magnitude slowdown and I think we've all

5:06

faced this. So it's very annoying and

5:08

it's hard to imagine how we can get

5:10

around this. Um I'll give you like the

5:13

quick background from the research world

5:16

which maybe people know which is this

5:18

inherent limitation the models we use.

5:20

The models we use are transformers.

5:21

Transformers look like this. The real

5:23

problem with transformers comes in this

5:27

one little uh box right here called self

5:29

attention. The problem is that all of

5:32

the words that go into the transformer

5:33

need to look at each other. And this has

5:35

a quadratic dependency. So if there's

5:37

four words, four tokens, maybe the

5:39

matrix has 16 entries. If there are 12

5:41

tokens, there are 144 entries. And we

5:44

can manage this for a while, but at some

5:46

point it becomes infeasible. Like

5:48

especially from a memory perspective, we

5:50

can't

5:50

>> hold the mic. From a memory perspective,

5:53

we can't keep all these things in

5:54

context.

5:56

You might say, well, Jack, Grock 4 has

5:58

two million token context window. Yeah,

6:02

2 million token context window. It's

6:04

it's a very large number. Gemini 3

6:06

dropped uh during this conference and

6:08

Gemini 3 has 1 million token context

6:10

window. You also might ask why did

6:13

Gemini 3 not do a larger context window

6:15

even though it came after Grock? And I

6:18

think the reason is because there's

6:19

[clears throat] a difference between the

6:21

model not breaking when you put in that

6:23

many tokens and the model actually like

6:26

properly reasoning across many large

6:29

chunks of tokens. And I think the second

6:32

part we're still figuring out. I think

6:34

people have realized how to train models

6:36

that don't break with more and more

6:38

tokens, but we haven't really gotten to

6:40

the point where we can train models that

6:42

truly work as well on a million tokens

6:44

as they do on a thousand tokens. And if

6:47

you're more curious about this, there's

6:48

this really good report from Chroma

6:50

called context context broad um about

6:54

how performance degrades when you add

6:57

just like other stuff into the context.

6:59

So this graph shows like the larger the

7:02

context grows even with the same finite

7:04

amount of relevant information, the LLMs

7:06

get worse and worse. And I think like

7:09

two things to observe here that I think

7:10

are interesting. One, claw is the best

7:12

by far. I like graphs like this because

7:14

I feel like if you talk to people, a lot

7:16

of people think clot is the best, but if

7:18

you measure on a lot of standard

7:20

benchmarks, it actually is worse. But

7:22

then you use it and you're like, "Oh,

7:23

something's better here." So, I like

7:24

this because it captures what people

7:26

actually say to me. But I also like it

7:28

because once you get here, the

7:29

performance is horrible. So, like they

7:31

if they enter a bunch of relevant stuff

7:34

that doesn't actually help you solve the

7:36

problem, once you get to 10 the 4

7:38

tokens, which is 10,000, like the models

7:40

don't work at all. And even though

7:42

they're not breaking like they're

7:43

outputting

7:45

things that make sense and are

7:47

grammatical, they're not actually

7:49

solving the problem. So context broad is

7:51

a huge issue. Um

7:53

maybe like just anecdotally if you look

7:56

up there's a ton of people saying stuff

7:58

like this like oh what the context

7:59

window is so long why does it not

8:01

actually work? Or people think claude

8:02

code when it fills up the context window

8:04

sort of like stops working. Um there's a

8:07

ton of people working on these efficient

8:08

architectures that you might hear about

8:10

like [music] uh mamba state space

8:12

models, linear attention, uh hybrid

8:14

attention, sparse attention, sliding

8:16

window. They're all more efficient, but

8:19

they basically have the same properties

8:20

of transformers. Like even if they can

8:23

operate uh in a faster time or with a

8:26

lower memory requirement, there's some

8:28

trade-off in the terms of performance

8:29

they give you. So even if you build a

8:31

linear attention model that can fit

8:34

infinite context, it's not good. Like

8:36

it's not going to be able to solve the

8:39

problem you have, which is how do I

8:40

actually like reason and get smarter

8:44

when I input more tokens into the model.

8:47

There's so many examples of this. I saw

8:50

this recent post. If you're like kind of

8:52

deep in the model architecture world,

8:54

maybe you've seen this. This is like a

8:56

couple weeks ago. There's new Chinese

8:57

model Miniax M2. It's one of the

8:59

state-of-the-art open models. And a

9:02

bunch of the other Chinese labs have

9:03

been pushing these new hybrid

9:05

architectures that are like more

9:06

efficient and can take longer context.

9:09

And Miniax M2 just didn't do that. They

9:10

just use sort of like the regular

9:12

quadratic attention that I was showing

9:13

you. And they have this really long

9:15

story about how they tried and tried and

9:18

it's basically just not worth it.

9:19

There's like an inherent trade-off and

9:21

how much computation you use and and how

9:23

good the models are. And so even if you

9:26

can technically build a model that

9:27

doesn't break at millions of tokens,

9:30

it's not actually better for any of the

9:31

tasks they care about. So no one is

9:34

really doing this. And I think to

9:36

conclude, we think that like we're

9:38

pretty limited by the context window in

9:40

full context. There's like one systems

9:42

problem that you can't put millions of

9:44

tokens into the model. And then there's

9:46

another reasoning problem that even if

9:47

you can, the models don't actually get

9:49

better. So it's probably not practical.

9:52

And I think if you work in industry, I'm

9:55

sure you see document sets that are much

9:58

much larger, like on the order of I

10:00

don't know, billions to trillions of

10:02

tokens. And even though we're getting

10:04

better at training the models and the

10:06

system side, we're getting much better

10:07

at running them more efficiently,

10:09

faster, cheaper, we're not near fitting

10:13

trillions of tokens into a model. I

10:15

think like that's pretty far off. So I

10:17

would guess a lot of you are doing rag.

10:19

How many people in this room use or work

10:22

on a rag system on like a weekly basis?

10:25

That's actually pretty crazy. Okay, so

10:27

over half for sure. So now we're going

10:30

to talk about Rag. I'm going to talk

10:32

about why it's good and then I'll talk

10:33

about why I think um it's fundamentally

10:36

limited and the products of the future

10:40

will use something better than Rack.

10:44

So if you use Rag, you probably use a

10:46

vector database. There are many vector

10:47

databases. I think I know some of these.

10:51

Turboroper, we now they're on S3, that's

10:55

Chroma. I made this slide. Uh,

10:59

Uh, there there are many vector

11:01

databases. They all offer you like

11:02

slightly different trade-offs. They give

11:04

you your vectors for cheaper, faster.

11:07

Um, vector databases are the way that

11:09

memory works in production. If you're

11:11

using a company internal question

11:13

answering system, it's it's definitely

11:15

running on rag which is powered by a

11:17

vector database which stores embeddings.

11:20

JBT memory uh uses embeddings. Uh Andre

11:25

Karpathy has this diagram from last year

11:28

two years ago actually of what the an

11:31

operating system that runs on language

11:32

models would look like and he called

11:34

embeddings the file system of LLMs. Um,

11:38

I think that's true in today's terms.

11:39

Like today, November 22nd, 2025,

11:43

probably like if you think of what

11:45

you're working on as an operating

11:46

system, the file system is embeddings.

11:48

But I think embeddings are the file

11:50

system of today. And they're not the

11:51

file system of the future. And that's

11:53

what I'm going to talk about today.

11:56

I I also want to point out that they're

11:58

extremely easy to use. Like any of the

12:00

tools I'm going to talk about at the end

12:01

of the talk that are like related to

12:03

training things into models are just

12:05

fundamentally harder. But this is just

12:07

really nice and we can all take a moment

12:09

to appreciate it. You just sort of bake

12:11

your text and then you like run this and

12:14

and that's all. It's a five lines of

12:16

code. That's a that's really really

12:18

good. Um the problem is they just aren't

12:21

that good and they have a lot of

12:24

problems I think. Um, which I think

12:26

also, okay, how many people work on rag

12:28

or experience

12:30

a rag system and are satisfied

12:33

completely with [laughter] like

12:38

Okay, that's great. So, I think we're

12:39

all kind of in agreement here that maybe

12:40

there there could be something more like

12:43

even if we don't know exactly what it

12:44

is, there must be something else out

12:46

there. Um, I'll talk about a few

12:48

problems that I've run into in my own

12:49

research. So, let's like start with this

12:52

abstraction. So this is the vector

12:53

database that powers rag. Every dot here

12:57

is is supposed to be a document. So the

12:59

document goes through the LLM. The LLM

13:02

is trained to give you just this one

13:04

vector that represents the document. I

13:06

projected them down to two dimensions

13:07

for the slide, but each doc document is

13:10

one dot. Um if you actually look at

13:13

what's in the vector database, it looks

13:14

like this. So there lots of numbers.

13:19

there's no one on the in the world who

13:21

can tell tell you what this means. Um,

13:24

one thing that I think is interesting is

13:27

that even though they look random and no

13:29

one can actually read them, if you build

13:31

a system to read them, it works pretty

13:33

well. So like if you're working at Rag

13:36

and you're sending someone embeddings,

13:37

you're actually sending them something

13:39

analogous to text. And I think this is

13:42

important because a lot of the actual

13:44

architectures like Turbopuffer, Pine

13:46

Cone, what have you, they store only

13:49

embeddings. And so like maybe there's

13:50

this false premise that if you just send

13:52

them embeddings, there's no security

13:54

flaws. But actually a even slightly

13:57

motivated person can build this system

13:59

here, this white arrow on the right,

14:01

which takes the embedding and produces

14:03

maybe not the exact same text, but

14:05

something extremely close to it. This is

14:07

what I worked on for like about a year

14:09

of my PhD. This is a animation of like

14:13

so I type in this sentence it goes into

14:15

the embedding model it gets stored in

14:17

vector database and then we run this

14:19

it's like a multi- round correction

14:20

thing and then by the end we actually

14:22

can get most I think our research has at

14:25

a certain length we can get 90% of text

14:27

back exactly from vector databases. So

14:29

the takeaway here is that there's no uh

14:32

security benefits to using a vector

14:34

database and also they're very hard to

14:36

run at scale. So this is like an

14:38

inherent problem for people with

14:39

sensitive data. That's the paper. Um

14:43

I think a second problem that I

14:45

personally have with embeddings is that

14:46

they're not adaptive. Like there's this

14:49

one universal sense of what the world

14:51

looks like that's captured in these

14:52

vectors and it's not adjustable based on

14:55

what you work on. So like to give you a

14:58

concrete example,

15:00

we embedded a bunch of databases or we

15:03

created a database of a bunch of

15:04

embeddings of credit card related

15:07

documents. I think we had half of them

15:09

that were from Mastercard and half of

15:11

them that were from Visa. But if you

15:13

actually look at where the embeddings

15:15

get stored, um I guess it's not in this

15:17

picture, but it's like only right here.

15:19

So even then there's this like really

15:21

large space of kind of all possible

15:23

semantics embeddings only represent like

15:26

one universal one if that makes sense.

15:28

So credit cards are actually clustered

15:30

in this like really small area and this

15:32

means search works bad. So like to give

15:37

you a concrete example, if you take

15:38

these two documents, one's from Visa,

15:40

one's from Mastercard, at least in the

15:42

system we were designing, like if you

15:44

search something that's about a Visa

15:45

query, you should never receive

15:47

Mastercard, but they're all so close to

15:49

each other that they're actually like

15:50

completely all jumbled together. And

15:52

this is just like a problem with all

15:54

conventional embedding mechanisms. So we

15:56

built this new model that lets you feed

15:59

in some like surrounding documents. So

16:01

like to give you an example, this is

16:03

kind of the first half of our model. We

16:05

would feed in a bunch of credit cards. I

16:07

guess I put AMX, but there actually was

16:09

no AMX when we did it. And um and the

16:13

model kind of works like this. Like when

16:14

it produces the embedding for the text,

16:16

which is here, it also looks at a bunch

16:18

of surrounding documents. So it can kind

16:20

of know like okay, this text is about

16:22

Visa, but also all the other documents

16:24

are about either Visa or Mastercard. and

16:26

it gets trained so that it can like

16:28

dynamically adjust the embeddings based

16:30

on like the surrounding context. So I

16:33

thought this was cool and it works

16:35

better. So like in this Visa Mastercard

16:37

case the similarity between a Visa and

16:40

Mastercard is now.144 and I think

16:42

anything containing Visa has a much

16:44

higher similarity. So that's like maybe

16:46

correcting one small thing. Um it works

16:50

better on like out of domain stuff. So

16:51

we have a forgot what the climate data

16:54

set is. is a data set of arguments, a

16:56

data set of financial questions, and

16:59

then I think like scientific articles.

17:02

And I guess the point I'm making here is

17:04

that if you do this contextual thing,

17:05

embeddings work a bit better. So like if

17:07

you build them in a way that they can

17:09

dynamically adapt to the domain, they

17:11

can solve some problems, but I think at

17:14

the end of the day, they're still

17:15

embeddings. And so

17:18

>> yeah. Yeah.

17:19

>> Uh was this approach picked up by anyone

17:22

else? Do you know? Yeah, I think we know

17:25

they're using it at OpenAI Anthropic

17:27

like behind the scenes now the embedding

17:29

models are contextual. It's a pretty

17:31

it's kind of a free lunch like you add

17:33

these extra tokens. Uh I guess it's it's

17:37

kind of hard to build like you have to

17:39

build this two-stage model and then uh

17:41

when you embed something you have to

17:42

grab some embeddings from the

17:44

surrounding documents. But once you

17:46

build it, it just works you know better

17:48

on like especially on longtail stuff. I

17:50

think if you look at um like MS Marco,

17:53

which is this large webcale

17:55

embedding task, it it really doesn't get

17:58

much better when you add surrounding

17:59

stuff because like it's already pretty

18:02

global if that makes sense. But if you

18:03

look at like really niche things, the

18:06

embeddings work a lot better. So yeah, I

18:07

I know it's productionized at some other

18:10

companies. Um I think if you're actually

18:11

building an embedding model at your

18:13

company and you want to put effort into

18:15

making it better, this is probably like

18:17

the easiest way besides data. probably

18:19

the first way is data. Um

18:22

there's some recent work that I think is

18:24

worth mentioning about like fundamental

18:26

limitations of embeddings and vector

18:28

databases and rag which says that like

18:30

if you it's not even really worth

18:33

explaining but there's like some uh

18:37

there there's some relationships that

18:39

cannot be captured in a fixed

18:40

dimensional vector like you have to

18:42

reason about things to answer all

18:44

possible tasks. And this is this kind of

18:45

combinatorial setup where there are so

18:47

many possible relationships that the

18:49

embeddings simply can't store them. And

18:52

so like in theory embeddings are

18:54

obviously

18:56

not the best way to do all possible

18:58

relationships between text, but I think

19:01

everyone knows that rag has issues. Like

19:03

I'm glad that no one raised their hand

19:05

when I asked if anyone was going to like

19:07

really stand up and speak for rag. And

19:10

like we can I I actually think this is a

19:12

hard point to make. Like everyone kind

19:14

of knows this, but it's hard to come up

19:16

with examples that retrieval can't solve

19:18

in practice. Like speaking as someone

19:20

who's recently sat down and tried to

19:22

make benchmarks for tasks that I care

19:24

about, it's hard to express questions

19:27

that require kind of this like latent

19:29

reasoning over multiple documents in a

19:31

way that rag doesn't solve, but they do

19:34

appear like um anything that kind of

19:37

requires association between multiple

19:39

things or questions that are they're

19:41

like sort of implied but not explicitly

19:43

answered by the documents are just not

19:46

solvable by current techniques. And also

19:48

if you have interesting examples of this

19:50

would love to hear after this after the

19:52

presentation. Um

19:55

hopefully I made my case that I think

19:58

rag Oh yeah yeah yeah go ahead.

20:00

>> I'm curious if you would classify

20:02

agentic search as rag as well.

20:04

>> Yeah that's a good question. So I guess

20:06

the way I think agentic search it's like

20:09

a model that can grab and it makes a

20:11

bunch of queries in a row and then it

20:13

responds. Um

20:16

yeah that's that's a really good

20:17

question. I think

20:20

I think I wouldn't classify it as rag,

20:22

but I think it has different fundamental

20:25

limitations that are also tough to

20:27

overcome. Like what you what you would

20:29

really want is like a model that reads

20:31

the entire thing and reasons about every

20:34

possible relationship and then answers.

20:36

And I think in theory maybe you could

20:38

build an agentic rag system that does

20:40

that, but it would be very expensive.

20:43

>> Yeah. Because [clears throat] isn't that

20:44

isn't that in the isn't deep research in

20:48

the direction of that where it like goes

20:49

through and it pulls like hundreds or

20:50

thousands of sources but then what ends

20:52

up in context is only like a small

20:54

subset of those.

20:55

>> Yeah. Yeah. I actually think deep

20:57

research is like really in the right

20:58

direction. Like they're trying to do

21:01

something that's a little bit higher

21:02

level and requires a lot of compute.

21:05

Like I think um anything that works

21:07

better than rag is going to be more

21:09

expensive. And so like just the property

21:12

that it takes a while and it makes a lot

21:14

of searches and it thinks a lot is like

21:16

good. I think that there's probably a

21:20

more elegant way to train like a really

21:23

big kind of researchesque system, but I

21:26

think that's that's actually a a good

21:28

way of doing this and and not the one

21:30

that I'm talking about today, but it's

21:32

very promising as well. Like maybe the

21:34

question is like are you willing to

21:36

spend a lot of money at training time or

21:38

at inference time and deep research is

21:40

like kind of they don't spend a lot of

21:41

money to train it but it's willing to

21:42

wait for a long time at inference and I

21:44

think the things I'm going to talk about

21:46

today are more like if you're willing to

21:47

spend a lot of money up front and you

21:49

get a really smart model that knows all

21:51

your data already um and it's really

21:54

cheap to do inference. So it's like kind

21:56

of different sides of the same

21:57

trade-off. And I think like a good way

21:59

of thinking about these things is like

22:01

to get better models, you're going to

22:02

need to pay somewhere, you know, like

22:04

you're either going to need to like

22:06

generate better data and spend more time

22:07

on the data, you're going to need to

22:09

spend time on training, or you're going

22:10

to need to spend time on inference. And

22:12

a nice thing about rags is it kind of

22:13

just works, but anything better will

22:15

cost more.

22:16

>> Yeah.

22:17

>> Getting back to your example of

22:18

Mastercard versus V. I I don't know if

22:22

that's in your presentation later, but

22:23

what are your thoughts on using

22:24

knowledge graph for that as kind of

22:26

augmenting

22:28

It's a good question. Maybe ask me

22:30

after. I have to think about knowledge

22:32

graphs. It's been a while. Um, so let's

22:35

talk about how to learn things in

22:36

weights. Um, I think like the question

22:39

that we want to get at is like, okay, so

22:42

say we have the example I showed earlier

22:44

or like you have a small data set you

22:46

collected from your own personal work

22:48

and you want to teach it to the model.

22:49

It's one thing to put it into context

22:52

and that's a good way to get started and

22:54

if you don't have that much data,

22:55

that'll get you pretty far. But I think

22:57

we can do more. Like there's some

22:59

questions that even when your data is in

23:01

context, the model can't answer. And so

23:03

what I want us to think about is like

23:05

how can we inject things into a model uh

23:08

is such that it learns better than in

23:09

context and also that it doesn't forget

23:11

everything that it already knows. Um I

23:14

want to point out something from my own

23:16

research which is that there is a fixed

23:17

capacity to language models. Like one

23:19

way to think about this is tgt has like

23:22

only so many parameters. we have this

23:24

measurement that it can store 3.6 bits

23:27

per parameter. So like uh I think a

23:30

billion parameter model is like at 3.6

23:34

bits is maybe like four terabytes. Is

23:38

that right? 4 gigabytes what? Yeah,

23:41

thank you. Thank you. Um this is like

23:44

some information but it's actually not

23:45

that much. So the models they basically

23:48

do their best to fit the training

23:50

distribution and they throw everything

23:52

else out. So like to give you a concrete

23:54

example this morning I was putting this

23:56

together. I asked Claude, "What is the

23:58

capital of the smallest province in

23:59

Tajjikstan?"

24:01

And it gave me a very detailed answer.

24:03

It's actually very impressive. No web

24:05

search. The model just knows this in its

24:07

parameters. I guess I'm arguing that

24:09

this is bad. Like if you want to build a

24:11

system that can answer really detailed

24:14

documentation questions for your

24:16

company, you don't need it to know what

24:19

the capital of the smallest province in

24:20

Tajjikstan is. And since we know these

24:23

models have fixed capacity, I think that

24:25

this is bad. Like what we really want is

24:27

to know how to like find this kind of

24:29

thing and just like delete it and

24:30

replace it with the things we care

24:32

about. And I think that's like what

24:33

we're getting towards, but we don't 100%

24:34

know how to do that again. Sorry. So

24:37

when I originally put this talk

24:38

together, the way I was thinking of

24:39

explaining it is calling it a neural

24:41

file system. And then I decided to just

24:44

call it weights. I think it's easier to

24:45

understand, but this slide still says

24:47

neural file systems. Um so I think

24:51

there's a few questions here like we

24:52

want to train all our data into the

24:54

model. One question is like how do we

24:55

train it? Do we do RL? Do we do SFT? Uh

24:58

what's what even is the data? Um another

25:01

question is like out of uh all the

25:04

possible data what do we use? Do we just

25:07

like fine-tune directly on our data? Do

25:09

we try to generate more? I think my

25:11

argument is that we should try to

25:13

generate more and I'll show you why. And

25:15

then there's an architectural question.

25:17

Like I think for a long time, people

25:19

really cared in the machine learning

25:21

deep learning community about like what

25:23

architectures we should use. And then

25:24

for like what 8 years, everyone who

25:28

knows what they're doing has really just

25:29

been using transformers unless they're

25:30

trying to make them better. And I think

25:33

now in this world where we're trying to

25:35

train stuff into models like like if you

25:38

think of okay world we all each of us

25:39

have has our own model or maybe multiple

25:41

models and those models are getting

25:43

updated a lot. I think we start to care

25:45

about architecture again and I'll and

25:47

I'll tell you why and like what I think

25:48

the options are. [clears throat] So

25:50

first let's talk about learning.

25:53

Um

25:56

so I think like the mental [snorts]

25:57

model here which I mentioned before is

26:00

like we're trying to train the model to

26:03

learn the data as best as it possibly

26:05

can and it's going to be expensive. So

26:08

like we didn't like rag but also rag

26:10

didn't cost us very much money. I think

26:12

to do better than rag, we're gonna have

26:14

to like pay some GPU points and that's

26:18

just like the state of the world. Okay,

26:20

fine. So, this is our model. It's like

26:22

this homogeneous blob of data and this

26:25

is our data. So, like maybe we have the

26:27

masterard data set or maybe we collected

26:29

data about ourselves or maybe I uh

26:32

collected all my traces from coding in

26:34

November and December and I want to like

26:36

train the the model to learn my problems

26:38

better. What do I do? How do I actually

26:40

do this? Um

26:43

let's let's like start with the dumbest

26:45

possible approach and just like see what

26:47

happens. So say uh we start with a data

26:50

set and we just train on it.

26:54

Um like using I guess next token

26:56

prediction. So we actually ran this

27:00

little experiment. This is like uh 3M.

27:03

It's a company they make doct and um

27:08

this is like some financial reports. So

27:10

maybe like you're working there and you

27:12

really don't want to read all of this.

27:14

So you just want to ask the model to

27:16

like really understand this and be able

27:18

to answer questions and like rag isn't

27:20

really working cuz it's like this weird

27:22

structure and there's a lot of ways the

27:23

documents interrelate. Okay, cool. So

27:25

we're just going to like train the model

27:27

using next token prediction. See what

27:30

happens. You know what? Actually, even

27:32

if you don't train the whole model, um

27:35

you you still get zero loss. So the

27:37

model can perfectly memorize this entire

27:40

uh 3M 10K financial report. Um it's

27:44

extremely impressive.

27:46

Okay. So now let's talk to it. So so we

27:48

did this and then we didn't want to ask

27:50

anything that's like exactly present in

27:52

the document because we want to see if

27:53

the model's actually good. So we started

27:55

you know like everyone loves to test

27:56

poems. So we started with a poem. We

27:58

said can you write a poem about 3M in

28:01

fiscal year 2025?

28:04

So, register your bets. And what do you

28:06

think happened?

28:07

>> It's terrible.

28:09

>> It's terrible. Someone said it. It says

28:12

the passage of a passage is a poem. End

28:15

of sentence.

28:17

It's crazy. [laughter]

28:19

Yeah. So, now maybe we ask like why does

28:21

this happen and how do we fix it? So,

28:23

unfortunately, this doesn't work. And I

28:25

actually think this is like one of the

28:26

reasons why people haven't been doing

28:27

this yet is because the dumbest possible

28:29

approach usually does work in machine

28:31

learning. But in this case, we have to

28:33

do something a little bit more

28:34

sophisticated. Um,

28:37

so maybe take a second and think about

28:38

like what you would do. You're facing

28:39

this problem at work or in a side

28:41

project. Um, I think there's like two

28:44

things we need to fix. One is that um

28:48

the data is not it's not exactly what we

28:52

want to train on, I think. And two is

28:55

that we probably don't want to update

28:57

the entire model because what we did

28:59

there was basically overwrite all the

29:02

you know stuff about Tajikistan and

29:03

everything else that's in the model with

29:05

just like this 3M knowledge and I think

29:08

that's like too specific and then the

29:09

model is just obsessed with 3M and it'll

29:12

only produce exact copy sentences from

29:15

the document. That's that's clearly too

29:17

much. So I think we need a better way to

29:19

update the model and we need a better

29:21

way to change the data.

29:23

Um, there's this pretty relevant work. I

29:26

don't know if you follow this like LLM

29:27

chat thing from Andre Karpathy. Shout

29:30

out. I think it's very educational and

29:32

he had a really good question which is

29:34

like he built this small LLM and train

29:36

it from scratch and everything and then

29:38

he wanted to teach it about himself and

29:41

okay maybe the first thing you would try

29:43

is rag. You put like a little database

29:45

of information about yourself but that's

29:47

only scalable to a certain amount and

29:50

then the model can't really like combine

29:51

things. it can only kind of regurgitate

29:54

facts. And so he wants to actually teach

29:57

it properly, he says, meaning in

29:59

weights. And so notice he doesn't just

30:02

like take one example and and train the

30:04

model using next token prediction. He

30:06

does something a bit more complicated.

30:08

He like generates this task or you don't

30:11

have to care about the specifics, but

30:12

there's like basically he makes a

30:13

diverse training data set of examples

30:16

that look like the thing he cares about

30:18

and then trains on it. And if you go,

30:20

you can find this. It actually does work

30:21

pretty well, which is cool. So, he's

30:23

able to teach a novel behavior to a

30:25

model by like generating a lot of

30:27

synthetic data that looks like the

30:28

example he cares about and then

30:30

fine-tuning the model for a little bit

30:32

and it and it learns. There's a paper

30:34

that's really good uh that's from last

30:37

year from some folks at Stanford called

30:39

synthetic continued pre-training and

30:41

they have the same problem. So they have

30:42

like a really small data set and they

30:44

want to teach the model to the data set

30:46

without like bricking the model

30:48

essentially and they have this kind of

30:51

fancy way of generating synthetic data

30:53

by extracting entities. But I think the

30:56

important part is that they take a small

30:58

data set and they generate like a very

31:00

large more diverse data set

31:03

representative of the thing that they

31:04

care about. And this is something that

31:06

like breaks the whole like conventional

31:08

machine learning paradigm. Like they

31:11

only have a small training data set. So

31:14

uh what you learn in school would tell

31:16

you that you would just like overfit and

31:17

there's nothing you can do. You just

31:18

have to go back and collect more data.

31:20

But actually because LLMs are so good

31:22

now we can do this second thing where we

31:25

generate like a much larger training

31:26

data set. It really contains only the

31:29

like facts that were present in the

31:31

original data but it's so large that you

31:33

can train a model on it. It's like very

31:34

strange. It only recently started

31:36

working, but it does work. I'll show you

31:38

some evidence. Um, the green line is

31:41

what happens when you do the dump thing

31:43

before. So, you just like fine-tune the

31:45

model on the data. It actually starts at

31:47

the black line. [clears throat] So,

31:48

surprisingly, it actually gets worse.

31:50

So, it like memorizes the data so well

31:52

that it can't answer any slightly

31:54

different questions about it. Um the

31:56

thing they do they have like two

31:58

different ways of doing it but it's

31:59

basically like generating lots of

32:00

synthetic data that describes the things

32:02

in the original data set. It works very

32:05

well like at some scale I guess 100

32:08

million tokens close to a billion they

32:10

can actually outperform GPT4 in this

32:12

data set which is really cool. So I

32:14

think like the takeaway here is

32:17

even though you don't have a lot of

32:18

data, if you're willing to generate like

32:20

a large synthetic data set that

32:22

describes the data you have, you can

32:24

actually train a model on it and it

32:26

works really well.

32:28

There's a bunch of other papers that do

32:29

this. One is called active reading. Um

32:32

they basically ask the LLM how what

32:35

types of things should we generate? Then

32:36

they generate from it. There is

32:38

self-study which is from this cartridges

32:40

paper which is more like question

32:41

answering like asking the model to like

32:43

quiz itself. And then there's this

32:45

rephrasing the web thing. I didn't

32:47

realize my

32:50

whatever a rephrasing the web thing

32:52

where they kind of like rephrase an

32:53

entire pre-training data set. So this

32:55

actually works at scale in kind of a

32:57

surprising way. Um and there's a lot

32:59

more work in this direction. So I'm

33:00

really excited about this like and I'm

33:02

kind of monitoring it. There's a company

33:03

called Daytology that's doing this

33:05

really well. They're like generating

33:07

really highquality synthetic data. It's

33:09

just like not something that used to be

33:11

possible until very recently when LLMs

33:14

crossed some threshold that they're like

33:16

able to generate data that's good enough

33:18

to actually train themselves on. Oh,

33:20

there's actually something pretty cool.

33:21

It's not in the slide. It's called self

33:23

adapting language models, self-edit.

33:27

It's called SEAL. S E A L. And they uh

33:30

ask the model what data to generate to

33:33

make itself better. and under some like

33:35

constrained scenarios, this is actually

33:36

working. So that's like actually quite

33:38

bizarre. Um, and like obviously doesn't

33:41

work infinitely or else they would have

33:43

caused an intelligence explosion. But

33:45

the fact that it works at all is like

33:47

really remarkable and I think like worth

33:49

monitoring. So

33:52

in conclusion for this section, we want

33:54

to train things into weights. We can

33:55

generate large synthetic data sets that

33:57

describe very pretty small data sets and

34:00

it works fine. Um, now I think the money

34:04

question here is like how do we inject

34:06

the information into the model? I think

34:08

before I mentioned we were training all

34:09

the parameters and we tried it and it

34:11

worked really bad. And this is a a

34:14

problem that's been around for a long

34:17

time. It's called like catastrophic

34:18

forgetting. Um, even in old school

34:20

machine learning like you train a model

34:22

to recognize handwritten digits and then

34:24

you train a model to recognize house

34:26

numbers and it's no longer able to

34:27

recognize handwritten digits. This is

34:29

like a very well-known problem. there's

34:31

a lot of like theory and like approaches

34:33

proposed to solve it, but no one really

34:35

knows how to solve it. It's very very

34:36

hard. Um,

34:38

but I think there are some easy ways we

34:41

can get around it in the conventional

34:43

paradigm where we have like this big

34:45

pre-trained child GBT transformer. Uh,

34:48

instead of retraining the entire model,

34:50

there's a few different ways we can do

34:51

it. I mean, the first one is retraining

34:54

the entire model. So, the things we're

34:55

training I'm highlighting in blue here.

34:57

That's like if we take our transformer

34:58

and we update all the parameters, we're

35:01

probably going to forget stuff. Um,

35:03

there's another one that's pretty cool

35:05

called prefix tuning where you just

35:06

train the KV cache. Um, I mean, I'll

35:10

like skip the details for now, but ask

35:11

me if you have questions. Prefix tuning

35:13

is cool. Um, another way is since a lot

35:16

of these models are called like mixer

35:17

experts and they have this MLP layer in

35:20

them, you can add another part to the

35:22

MLP that is optionally routed to and

35:25

used and that's like pretty scalable. I

35:27

think people try this. Um, there's

35:30

another approach where where you replace

35:32

instead of like another MLP, you build

35:33

this thing called a memory layer which

35:35

is like a big lookup table. I think

35:37

memory layers are really good. And let

35:39

me pause and say now this part of the

35:41

talk is getting close to purely

35:43

speculative. This is like the things

35:45

that are like they exist and like

35:47

someone's going to do this and someone's

35:49

going to use like one of them but I

35:50

really don't know what the right answer

35:51

is. Um another one is called low. So low

35:54

rank adaptation. You probably heard of

35:56

this very like hot topic. Um they kind

35:59

of like train a small a small matrix or

36:02

small few matrices to adapt the linear

36:05

layers. So it's like if your model's 10

36:07

billion parameters, maybe you train 10

36:09

million parameters that can like control

36:11

it. Um,

36:14

and if we look at them together, maybe

36:16

it's not super obvious which thing would

36:18

work best. Like ICL is just like putting

36:20

stuff in context. So we have in context

36:23

rag full fine tuning. We could do the

36:25

memory layers in MLP cartridges which is

36:28

a prefix tuning and we could do Lorra.

36:30

We could also do add something to the

36:32

mixture of experts. I think to me it's

36:34

not like clear and I'm not positive that

36:36

it matters which one we do. Like I think

36:39

the main thing is like we have this

36:40

giant model and we're adding a tiny bit

36:43

to it to control it and training only

36:45

those parameters. That way we retain

36:47

most of the information in the model. I

36:49

think that's like the most important

36:51

part. But I think for the end of this

36:53

talk I'll just talk through like what I

36:56

think people are doing in this space up

36:57

to like the minute and then you can make

37:00

up your own mind what you think the

37:01

right way to do it is. So let's talk for

37:03

a second about what properties we want.

37:05

I think we want um we want our changes

37:08

to the model to be very small. Like say

37:10

you're serving a model to each person.

37:13

You actually can do it, but you have to

37:15

use one of these like parameter

37:16

efficient methods. If you're trying to

37:17

fine-tune a new Kimmy for each person,

37:20

Kimmyy's like a terabyte. It's a

37:21

trillion parameters. It's just like not

37:23

even storeable, let alone servable. Um

37:27

we want something that's resistant to

37:28

forgetting like we said. So it would be

37:30

nice to have an architectural change

37:32

that's both small and makes the minimal

37:34

impact on the model as it is now because

37:36

the model as it is now works really

37:38

well. Um and preferably high capacity I

37:42

think like changes that are really

37:44

expressive and can capture a lot of

37:46

facts and few parameters are the ones

37:48

that we prefer and we want to be able to

37:50

do inference quickly. As like a small

37:53

aside, you actually can do this quickly

37:55

with a lot of um a lot of these methods.

37:58

Like maybe some of you have seen Tinker,

38:00

this new training API from Thinking

38:01

Machines. It's basically all predicated

38:04

on this idea that you can you can serve

38:06

one model per person as long as you do

38:09

Lorra and batch the Loras. And there's

38:12

like it's actually most interesting from

38:13

systems perspective. There's like ways

38:15

you can train it and train each one

38:16

separately and there's ways you can do

38:18

inference and it basically has no cost.

38:20

um which is really interesting just

38:22

because like the base model doesn't

38:23

change and we all share the same base

38:25

model. So all the ideas I'm going to

38:27

talk about are kind of like in the same

38:29

direction as Tinker. Um

38:32

we can think about like whether certain

38:35

methods might learn more or forget more.

38:38

Um so this is comparing Lorra to full

38:41

fine-tuning. So Loa makes a tiny change

38:43

to the model. Full fine-tuning updates

38:45

the entire model. And on two different

38:47

settings, they show like low here is

38:50

like purplish or pink. The pink one's a

38:52

little bit smaller capacity. Um, it

38:55

basically doesn't do as well. At least

38:56

when you're doing SFT, uh, Loro can

38:59

learn a little bit less, but also if we

39:02

look at how much it's degrading, it

39:04

forgets less. So this paper is called

39:06

learns less and forgets less. And it's

39:10

actually a very nice finding. So like if

39:12

you want to at least teach a model via

39:14

SFT and you use one of these low rank or

39:17

parameter efficient methods like all the

39:19

ones I described, they're going to make

39:20

a small change to the model in a way

39:22

that it's probably not going to be as

39:24

expressive as full fine tuning, but it

39:25

also doesn't destroy a lot of the

39:27

knowledge. Um here's something going the

39:30

exact opposite direction. This is the

39:31

results from thinking machines showing

39:33

that they think lower is about as good

39:35

as full fine tuning, which is

39:37

interesting because they're doing RL. So

39:40

it's like maybe dependent on the

39:42

training mechanism like if you do RL

39:44

maybe it makes small updates and um you

39:47

can do low you can do memory layers but

39:50

for SFT it really has to store a lot of

39:52

information so you really have to do

39:53

full fine tuning. I think that's the

39:55

takeaway I have and I have some actually

39:57

a paper that's like kind of blocked for

39:59

legal reasons but coming out soon. Um

40:02

here's one result from my paper that's

40:04

relevant to this. So we have this like

40:06

tiny Lora thing that's even smaller than

40:08

Lorra. Well there's actually Lorra XS

40:11

which already exists and then we made

40:12

tiny Lora which is even smaller. And if

40:14

you're doing RL on GSMK

40:17

math [clears throat] reasoning you can

40:18

train 14 parameters and get like 91%

40:23

accuracy which is pretty crazy. I think

40:26

um there's like a lot of reasons for

40:28

this. Like RL makes really tiny changes.

40:30

I think this Quen model like is

40:32

something fishy is going on with the

40:34

training data.

40:35

>> You have a one parameter experiment.

40:38

>> Oh yeah, one parameter. It actually

40:40

learns it gets 5% better with one

40:43

parameter. [laughter]

40:45

>> Pretty cool.

40:45

>> It's amazing.

40:46

>> Yeah. Yeah. It's it's it's really nice.

40:48

I think um

40:50

>> literally the smallest

40:52

>> Yeah. Yeah. The smallest thing you could

40:53

possibly train. It's more like you you

40:56

generate a lot of random projections and

40:58

then you control them all with one

40:59

number if that makes sense. Like the

41:02

model actually changes a lot but the

41:04

only thing you can actually train and

41:06

store is the one parameter.

41:08

Uh I tell you more about it later. Um

41:11

but yeah, it's pretty cool. Um

41:14

this is another result that's like kind

41:16

of in the mix, but I'm not sure how to

41:18

place it. So if you do the KV cache

41:20

tuning or prefix tuning, this paper

41:22

thinks prefix tuning works much better

41:24

than LoRa. I met some people in Meta um

41:26

when I used to be affiliated there that

41:28

said that they think lower works much

41:30

better than prefix tuning. So I really

41:31

don't know, but I think like what it

41:34

really will come down to is like when

41:36

you do it at scale, what's like most

41:37

efficient? And I'm not exactly sure, but

41:40

I think prefix tuning is a pretty good

41:42

candidate because like KV caches are so

41:45

commonly used these days and like a lot

41:48

of the system stuff is built around KV

41:50

caches. I think a cool thing about

41:52

thinking machines is like they're

41:53

designing this entire organization

41:54

around like scaling Laura which is

41:56

awesome but it's not really possible in

41:58

open source right now. Like there's not

42:00

kernels for training many Lauras at the

42:02

same time. It's like very complex and

42:04

you have to have a lot of people working

42:05

on that. Prefix tuning on the other hand

42:06

is like very well supported. Um and then

42:09

finally I'll quickly talk about memory

42:11

layers. This is another approach to

42:12

injecting data into models which I think

42:14

is good. This is like uh adding a expert

42:18

to the MLP but the expert is just like

42:20

this giant differentiable lookup table.

42:23

So it's kind of not that important

42:26

exactly how it works but it's like it's

42:28

just a different way to inject

42:29

information into models. The cool thing

42:31

about memory layers is it's

42:32

controllable. So in this work uh by

42:35

Jesse Lynn from this year, they specify

42:39

exactly which parts of the memory layer

42:41

get updated and keep it to like a very

42:43

small number. And so their result shows

42:46

that memory layers actually work the

42:48

best. So memory the axes here are

42:52

forgetting so down is bad and learning

42:54

right is good. So the memory layers

42:57

basically don't forget at all and they

42:59

learn close to as much. So I think if

43:02

you're trying to inject information into

43:05

models that you really care about them

43:07

not forgetting any of their base

43:08

information, maybe memory layers are the

43:10

way to go. I think honestly there's a

43:12

lot of conflicting evidence right now.

43:13

Like some people think lower is good,

43:15

some people think prefix tuning is good.

43:16

These people think memory layers is

43:18

good. I really am not sure, but I think

43:21

it's going to be one of them.

43:23

Okay, cool. That's that's the end of the

43:25

training stuff into weights part. Maybe

43:27

actually I'll stop and see if anyone has

43:29

any questions about the different

43:30

parameterizations. Yeah.

43:40

>> Oh, yeah. Yeah. Yeah. From from my yet

43:42

unreleased research.

43:44

>> So, have you used SFT before?

43:47

>> Yeah. Yeah. I can show you the SFT

43:49

results later. But SFT uh

43:53

takes a lot more parameters in the short

43:55

explanation like many many more like a

43:57

thousand x1 or something. And you

44:00

attribute that to the sparcity of the

44:01

reward.

44:02

>> Yeah. Yeah. I think it's something like

44:04

that. Like the SMT learning signal is

44:06

like cross entropy on all of the tokens

44:09

with or without thinking tokens. And

44:11

that's a lot of bits essentially. And

44:13

then RL just gives you a one or a zero.

44:16

If you get it right and you already

44:17

knew, then it's no information. If you

44:20

get it wrong, you get like one bit. So I

44:22

think because RL is like so sparse and

44:25

uh information efficient, then you can

44:26

do it with way fewer parameters. That's

44:28

that's kind of the take away from our

44:30

paper actually.

44:30

>> So you didn't do GRPO after doing SF?

44:34

>> No, no SFT. We just either do GRPO or

44:37

SFT and then we see like kind of how

44:39

many parameters you need to train to get

44:42

to equivalent performance and SFT

44:44

requires many more parameters.

44:48

>> Uh so here you are comparing like uh

44:51

training versus rag like we are being we

44:54

want to solve the problem what we are

44:56

facing in the rack. So is the volume of

44:58

the document also matter like you have

45:00

any studies like uh because if if some

45:03

problem has a less number of document

45:06

uh rag will be better or the uh training

45:10

will be better.

45:11

>> That's a really good point. Um maybe

45:13

that let's uh go to the last slide. So I

45:17

think the question is like okay you're

45:19

trying to train all of your data into a

45:21

model but something only happens once.

45:23

Yeah, means when when I should pick

45:26

focus on drag and when I should focus on

45:28

like uh like a training fix because

45:31

every time mean I have like a small set

45:33

of documents the training might not be

45:36

feasible.

45:37

>> Yes. Yes. Like it your like maybe you

45:41

something is so under represented in

45:43

your data that it probably wouldn't

45:45

>> data is frequently changing might be

45:47

>> your data is changing a lot. Yeah. Maybe

45:49

in the short term it's hard to train. Um

45:52

yeah. So, let me point out like okay, so

45:55

obviously we're always going to put

45:56

stuff into context and I think we'll

46:00

also probably always do rag. Like I

46:02

think um there's basically no scenario

46:06

that you can imagine for a long time

46:08

where you're just like always training

46:09

the model and never doing rag. I think

46:11

you'll do both. I think like maybe if

46:13

you have a ton of documents, I don't

46:15

know, maybe every day you do this big

46:17

training and then every time you serve

46:18

you also do rag. And so like what I

46:21

really imagine is like or maybe my my

46:24

point is that no one is doing this right

46:26

now and like people will start doing

46:28

that.

46:28

>> You have any like a projection like

46:30

after certain amount of data like

46:32

training will be like a more [cough]

46:34

efficient and direct like yeah

46:38

uh no like I think I think this kind of

46:40

thing is really new so there's a lot of

46:42

room for analysis like that. I would

46:43

definitely be interested to see both

46:46

analysis on how the frequency of

46:48

information affects like the trade-off

46:50

and how just like how much data you have

46:52

to have for training to become

46:54

economically feasible. That's a really

46:55

good question.

46:57

>> Yeah. Um, is your suggestion kind of in

47:02

uh diving more into like the weights

47:04

side of uh the presentation to use a

47:07

fine-tuned model for like completion

47:10

type tasks or also for embeddings?

47:14

>> Oh yeah, that's a good question. Um, no,

47:17

I think I think the fine-tuning I'm

47:20

talking about is all for like assistant

47:21

engine completion. Um, it's an

47:24

interesting question. You probably could

47:25

do like dynamic embedding model

47:27

training, but I guess like the way I

47:29

think about it is like the real like 10x

47:32

improvement here is going to come from

47:33

training to weights. You could maybe

47:35

make rag like 2x better if you really

47:38

really worked, but I think there's so

47:40

many fundamental problems with it that I

47:42

wouldn't spend that much time on making

47:44

it better.

47:46

What were what do you feel like the most

47:49

fundamental problem is where even if

47:51

like your retrieval was fantastic, you

47:53

still

47:53

>> kind of I think like chunking like um

47:55

yeah,

47:56

>> you just like kind of retrieve some of

47:57

the stuff you need and then you can't

47:59

really reason across all of it. And like

48:01

I think in the limit like there's some

48:04

types of data where like no matter how

48:06

you chunk, you'll never get like

48:07

everything you need if that makes sense.

48:09

>> Yeah, totally.

48:10

>> Cool. Yeah. Do you see any fundamental

48:13

limitations as you scale up the amount

48:15

of personalization you need? Let's say

48:17

you had a B toC product that had 100

48:19

million or 10 million users memory for

48:21

all of those.

48:22

>> Do you think that's just not feasible?

48:24

>> You say 10 million users.

48:25

>> Yeah. 10 million 100 billion is more

48:27

than that.

48:27

>> Yeah. Um no, no, I actually think it is

48:30

it is feasible. Like Laura, maybe you

48:33

train a few megabytes per user or

48:37

something. It's not that crazy, right?

48:38

Like YouTube probably has gigabytes per

48:41

multiple times,

48:43

>> right? That's a good [clears throat]

48:44

point. Like the continual updates are

48:45

hard. Like probably in realistic short

48:47

term, it's more like you update once a

48:49

day or something like that. But I think

48:50

that's that's doable. But you make a

48:53

good point that the paradigm I'm

48:54

describing is much more expensive.

48:57

>> Also, you do consider there's a lot more

48:58

that you can do in the other two

49:00

buckets. You compress the data context.

49:02

You compress it before you put rag. You

49:04

break it up into other buckets. You

49:06

don't just have to use rags and use SQL

49:08

and knowledge to all of them together in

49:11

different buckets and that solves a lot

49:12

of problems.

49:13

>> Yeah. Yeah, that's a good point. There's

49:14

kind of like three axes of optimization

49:16

here. And I guess like we are we're

49:20

getting pretty good at this. We're okay

49:22

at this and we're horrible at this. And

49:23

so like we'll continue improving upon

49:26

all three axes.

49:28

>> Yeah. What's your uh like I'm kind of

49:31

hearing that maybe it's not defined yet,

49:33

but what's your kind of like intuition

49:35

or guess in terms of like where the

49:37

decision boundary is in terms of

49:39

investing your effort in those

49:41

optimizations particularly in like let's

49:43

say a couple of years where you could do

49:45

something like a deep research but it

49:47

would be way cheaper and way faster. Um

49:49

when what are there

49:52

you were saying that there isn't like a

49:54

number of documents but what is the

49:56

boundary that you would think about

49:58

looking at is it the freshness of the

50:00

data how fast changing is the number of

50:02

documents there what's your

50:04

>> yeah I it's a really good question I I

50:07

think um I think the paradigm I'm

50:10

describing is especially effective when

50:11

you have like a large amount of data

50:13

that's not been indexed into the LLM at

50:15

all and it gives you a big benefit there

50:17

I think when you start seeing seeing

50:19

like sparser updates to your data set or

50:22

like some new data comes in but it's not

50:24

that much and it's like fairly often

50:26

then you probably want to turn to

50:27

inference time approaches that are

50:28

closer to deep research.

50:31

Um yeah that guy had a question on

50:34

>> yeah can you elaborate a little bit more

50:36

about the synthetic data generation so

50:39

let's say that you have YouTube to talk

50:45

similar language terminology like

50:48

proprietary data right like millions of

50:52

documents like how is synthetic data

50:55

generation that context helpful

51:00

>> so you're company has millions of

51:02

documents you said and you want the

51:03

model to

51:04

>> it's more like a scenario.

51:05

>> Yeah. Yeah. Okay.

51:06

>> Yeah. Yeah. Yeah. Um

51:07

>> because it wouldn't you said you

51:10

wouldn't just train the mix work, right?

51:13

>> Yeah.

51:15

Try out different such and I think one

51:17

of the you talk about synthetic data.

51:21

>> Yeah. Yeah. No, I think I think

51:23

synthetic data generation could work for

51:25

that problem. So I guess like um it

51:31

depends on how information dense your

51:32

data is. If you have millions of

51:34

documents from your company, I would

51:36

guess many of them share formatting and

51:38

only contribute maybe like a few bits of

51:41

kind of global information to the data

51:43

set. And so what you want to think about

51:45

is like does there exist a function that

51:47

could produce a good training data set

51:49

for an LLM that would teach it about my

51:51

data? And like there probably is. Like

51:53

you could probably design some strategy

51:54

that looks at the documents, kind of

51:56

like figures out what's new about each

51:58

document and creates like kind of

51:59

question answer pairs, but this is very

52:02

blue sky. Like I think a lot of people

52:03

are working on this right now, but I

52:05

don't have like a a global answer of how

52:09

to actually

52:09

>> right now my only solution that I can

52:11

think of is um you know getting to

52:14

generate that Q&A,

52:17

>> right?

52:25

Yeah. Yeah. I think it also depends on

52:26

what types of questions you'll be asking

52:28

about the documents. Like what you

52:29

really want to model is like all

52:30

possible questions or something like

52:32

that, but I think Q&A gets you pretty

52:34

far.

52:37

>> Cool.

52:37

>> Yeah. Um so with with this approach

52:40

right you you you mentioned this example

52:42

where you're um uh you would train your

52:46

model right on 3M uh quarterly earnings

52:50

right uh I think 10 10k 10q um documents

52:54

what would like

52:56

what would the prompt basically look

52:58

like right like is there is there

53:00

anything in within like the in context

53:03

learning that would still need to be

53:05

kind of specified to

53:09

bring your data into a context.

53:12

>> Yeah. Uh so I think the question was if

53:14

you start with the 3M example we had and

53:17

you train all that into a model using

53:19

some like magic synthetic data, what

53:20

does actually the prompt look like?

53:21

>> Yeah.

53:22

>> I think actually if you do it right, you

53:23

don't need a prompt at all like you can

53:25

just ask the model a question. No system

53:26

prompt, no

53:29

extra information and if nothing has

53:31

changed, it should know everything. like

53:33

and you even there's some scenarios

53:34

where there's only one document and the

53:36

model knows which document it is so you

53:38

don't have to specify that you're even

53:39

asking a question about the document

53:40

it's like implied you know so um it

53:43

depends on how you set it up but I think

53:45

in like the ideal case there's no prompt

53:47

at all

53:51

>> yeah

53:53

I it's not obvious to me that

53:55

information is best stored in model yeah

53:58

why do you have do you have that um it

54:01

feels implied

54:03

you have my

54:04

>> good question.

54:06

>> So he said it's not obvious that

54:08

information needs to be stored in

54:09

weights. Yeah. Yeah. This is this is a

54:12

good question. I think um I'm not saying

54:14

that it's best to store information in

54:17

weights. I guess I'm arguing that that

54:20

gets you a lot and we're not using it

54:22

right now.

54:23

>> And like once you get to the scale of

54:25

like a GitHub repo, you might have

54:28

millions of tokens and it's just like

54:29

very expensive. And so at least like

54:32

this is the cheapest way to do it. The

54:34

question of like can we generate

54:36

synthetic data to do better than in

54:38

context is like it's it's hard. I think

54:41

it's like that's research

54:44

that do you know what I mean when I say

54:45

it's cheaper though like if you have a

54:48

million token prompt you can just like

54:50

compress it into the weights and produce

54:52

a model that gives the same outputs with

54:54

no prompt and then the inference costs

54:57

less.

55:00

I have one

55:04

after

55:10

that there is no adversal data.

55:13

>> That's actually a really good question.

55:15

Never thought about it before. Um I

55:17

think it's probably pretty hard. Like I

55:18

guess if you're training on user data

55:20

and like you have some user that wants

55:22

to sabotage your system and you're

55:24

generating training data from their

55:26

inputs, there probably are a lot of like

55:28

security risks. And uh I guess in this

55:33

scenario, if you're serving the same

55:34

model that user and it doesn't work

55:35

anymore, that's like not your problem.

55:37

But once you start aggregating

55:38

information across users, I bet it

55:40

becomes hard. I'm sure CH GBT has the

55:42

same problem where some people always

55:43

click thumbs down instead of thumbs up

55:45

to try to like [laughter]

55:49

>> the research [snorts] uh they segmented

55:51

geographically across countries. So some

55:54

cultures are inclined

55:57

>> so it [laughter] files in data.

56:00

>> That's funny.

56:03

>> Yeah. Um, so thinking maybe

56:05

[clears throat] a little bit about

56:06

practical implementations of something

56:07

like this. Um, especially in terms of

56:09

like say version controlling, you

56:11

mentioned GitHub models that you keep

56:13

fine-tuning over time. Say you're a

56:15

company that just changed a policy and

56:16

it's just a one [snorts] line sentence.

56:18

We honor something to we do not honor it

56:20

anymore

56:21

>> that keeps going back and forth. Do you

56:23

then you know start from the base model

56:25

again and then find that or go back to

56:27

the one that already a good

56:28

representation of it. I just has to

56:31

change that one small thing and then you

56:33

know how that kind of is joined at the

56:34

hip with hallucinations which is kind of

56:37

why we were doing full context

56:40

to avoid that. Do you have any thoughts

56:41

on how that might work? Yeah, I think it

56:43

so so his question was about

56:46

what do you do once you start making

56:48

multiple updates to the model especially

56:50

when you have like conflicting

56:51

information and I think like the optimal

56:54

synthetic data strategy was somehow

56:55

figure this out during training and

56:57

maybe even like if there's some

56:58

documents from a few days ago that are

57:00

no longer relevant you can just like

57:01

delete them but I don't know how

57:04

>> as far as how we can give more attention

57:06

in the same like whatever uh let's say

57:10

uh information is conflicting with each

57:12

uh whatever pre-trained versus what up

57:15

front document we are giving for

57:16

training if it is a contra but I want

57:19

more preference from my document

57:22

by what we are doing in like asking the

57:25

question from the ground truth so how uh

57:29

it will replace that scenario

57:33

>> I'm [clears throat] not sure I

57:33

understand the question

57:35

>> sorry

57:35

>> I I don't know if I understand your

57:37

question

57:37

>> okay sorry what you

57:40

>> I didn't understand your question

57:41

>> so my question is like I we have the uh

57:44

whatever the training data we are giving

57:46

it which is contradicting with the

57:48

pre-training data it is a conflicting

57:51

now while asking the question while the

57:53

inference I want to give more preference

57:55

on my data I don't need the pre-train

57:58

information that's why we are using rag

58:01

like I need output from my ground

58:04

whatever the context I'm giving

58:06

>> so how it will we can achieve in the

58:09

like a training

58:12

I think that the

58:16

the paradigm I'm proposing has all the

58:18

same limitations of rag.

58:21

I'm not positive that answers your

58:22

question, but like for example, if

58:26

uh like maybe in the scenario he said

58:28

where he said something many times and

58:30

then turns out not to be true, both rag

58:32

would retrieve that and in the uh

58:35

dumbest setup that would also be present

58:37

alive in the training data. So I think

58:38

like the same problems have to be

58:40

solved.

58:42

>> Have you done any work with federated uh

58:44

tuning fine tuning parameions

58:49

of users?

58:50

>> Have you done any research in that spot?

58:52

>> No no no no uh not really but I think

58:54

it's an interesting uh opportunity. So

58:56

like back in the day a lot of people

58:57

were really excited about the idea that

58:59

you could share gradients and train the

59:01

same model across many machines. This is

59:03

federated learning. And I think like one

59:06

of the problems why it's hard is because

59:07

the models now are so big that the

59:09

network costs are way too high and

59:11

because like I'm arguing that you only

59:13

need to train a million parameters

59:14

instead of a trillion. It probably comes

59:16

back into play. So I think it's a very

59:18

good idea especially in the RL world

59:20

where you do a lot of work for a long

59:23

time and then do gradients like very

59:26

seldomly. So I think it probably will

59:29

come back and it's smart to think of it

59:31

but it hasn't quite yet.

59:34

Um maybe I'll take like two more

59:36

questions. Yeah. Go.

59:37

>> Um so your argument here about training

59:40

in um information seems to be uh counter

59:45

to Karpathy's view of like a reasoning

59:48

engine like distilling just the pure

59:50

like you know intelligence aspect of of

59:53

a model down to like a two billion

59:55

parameter thing.

59:56

Um uh and like I think that there's a

59:59

bit of overlap there like um

60:03

like a lawyer is not doesn't have the

60:07

entire legal code memorized but they

60:08

know how to use the tools available to

60:10

them to find what they need to. And so I

60:13

I think part of it is kind of a

60:15

combination of those two things where

60:16

you're doing task specific training with

60:20

something like this on a relatively

60:22

small reasoning brain to get a sense of

60:26

where it needs to find the things that

60:28

uh might become stale or or you know am

60:32

I on the right track here or

60:34

>> Yeah. Yeah. So I think there may be a

60:37

comparison between some people who have

60:38

said, "Oh, the best model we could ever

60:40

have is like really small and knows

60:42

nothing but can use tools really well or

60:44

something like that." And I guess I I

60:47

was proposing some similar ideas. I said

60:50

models know way too much. I think

60:51

everyone agrees the model doesn't need

60:52

to know the capital of the smallest

60:54

province of Tajjikstan for most use

60:56

cases at least in like my life.

60:59

>> It doesn't need to remember, you know,

61:00

encryption keys.

61:02

>> Yeah. But I think there's I I think this

61:04

is a very philosophical question, but uh

61:07

I think it's really hard to create a

61:08

model that doesn't know anything. And so

61:10

I'm more advocating for like specialized

61:13

models that are good at something you

61:15

care about but bad at other things

61:17

rather than advocating for a model

61:18

that's like bad at everything.

61:21

>> Okay, last question here.

61:22

>> Yeah. Have you ever done research yet in

61:24

the temporal elements of the

61:25

information? No, but I think that's like

61:28

one of the first things to think about

61:29

is like, okay, if you have information

61:30

from day one and day two and day three,

61:32

do you just sort of like concatenate

61:34

everything or do you train in order kind

61:36

of like you were asking or do you like

61:38

train multiple models and merge them or

61:40

I I actually don't know, but that's a

61:42

good segue. So now I'm uh I'm working on

61:46

this problems related to this a lot,

61:48

thinking about this a lot. um started a

61:51

company with a few other people and um

61:54

this is like the kind of research we're

61:56

doing. If anyone knows someone who lives

61:58

in San Francisco and is a good engineer

62:00

and you think they're interested in

62:01

this, let me know or send me an email.

62:04

Or if you're interested in like using

62:05

this kind of thing, send me an email.

62:07

That would be great.

62:08

>> It's temporal stuff or

62:10

>> not necess I mean it's kind of all of

62:12

this I would say. Um trying to build

62:14

models that you can teach things to.

62:18

All

62:21

right. Thanks so much for having me.

62:23

This is great. [applause]

62:27

[music]

62:43

>> [music]

Interactive Summary

The speaker discusses the limitations of current Large Language Models (LLMs) in retaining and accessing specific knowledge, categorizing existing solutions into "full context" and "Retrieval Augmented Generation" (RAG). Full context methods, while simple, are expensive and slow due to the quadratic dependency of self-attention in transformers, and model performance degrades with increasing context. RAG, though widely adopted and easy to use, suffers from security vulnerabilities as embeddings can be reverse-engineered, lacks adaptivity to niche domains, and fundamentally struggles with complex reasoning over multiple documents. The speaker proposes a third, more effective approach: training specific knowledge directly into the model's weights. This method, while more expensive during training, can overcome issues like catastrophic forgetting by generating diverse synthetic data from small datasets and employing parameter-efficient techniques like LoRA, prefix tuning, or memory layers, ultimately leading to more specialized, intelligent, and cost-effective inference for specific tasks.

Suggested questions

5 ready-made prompts