DeepSeek OCR - More than OCR

Watch on YouTube

Now Playing

Transcript

159 segments

0:00

Okay, so you've probably heard the phrase, A picture is worth a thousand words, and

0:05

generally that means that, we can kind of get more from a picture than just

0:09

a thousand word description from this.

0:11

Well, DeepSeek has taken this idea to a whole new level.

0:16

So they've just released a paper and a model for DeepSeek OCR, and I'm gonna tell

0:21

you that this is really not about the OCR.

0:24

Yes, their model can process large amounts of text, but

0:27

really this is not about the OCR.

0:30

What they're getting at here is that what if you could store in a single image the

0:36

equivalent of a thousand words, and then the model could perfectly read them back.

0:43

So this is exactly what the researchers at DeepSeek have achieved, and it

0:47

could change how we think about AI memory and long context processing.

0:54

So in this video, I want to dive into this DeepSeek OCR paper.

0:57

look at what they're doing, and explain how this works in comparison

1:02

to normal vision language models.

1:04

So what they're doing here is that they're introducing this thing that they're

1:08

calling Contexts optimal compression.

1:11

And the whole idea behind this is that they want to basically use vision

1:17

as a compression algorithm for text.

1:20

even though the paper is called DeepSeek OCR, and I thought, okay,

1:24

I'll just look at this quickly.

1:25

There's a bunch of OCR models that have come out recently that

1:28

I haven't had time to cover.

1:30

very quickly I realized that, hey, this is not about the OCR, the implications for

1:35

this go way beyond just reading documents.

1:39

So let's talk about why this matters.

1:41

as you probably know, one of the issues with large language models

1:44

is how they effectively handle long contexts or really long documents, and

1:50

we would like to be able to, go out not just to a million tokens, but to

1:55

10 million tokens, perhaps even more.

1:58

The challenge is we've got roughly one token per word, Now, what the

2:03

core breakthrough here is around what they're calling contexts optical

2:08

compression is instead of just thinking about, converting an image to tokens,

2:14

what if you could store tokens, If we could store text in images.

2:20

basically they can use a hundred vision tokens, which can then be decoded to a

2:26

thousand text tokens with a 97% accuracy.

2:31

So that's like a 10 x compression ratio with almost perfect fidelity here.

2:36

And they even find that it's sort of 20 x compression.

2:39

So just using 50 vision tokens for a thousand text tokens, they can

2:43

still maintain around 60% accuracy.

2:47

So really this paper is not about OCR, it's about creating a new form of memory

2:53

compression, for large language models and perhaps, even AI systems as a whole.

2:59

So imagine a model that could take your entire conversation history of,

3:04

millions of tokens, render it as images, and then bring those into a model

3:09

with dramatically fewer tokens than what you need to represent the normal

3:13

way you would do it with text tokens.

3:16

So you could kind of imagine that you would keep the recent conversations as

3:20

the high resolution, perfect, text tokens and anything beyond a certain point.

3:26

You could render as images

3:28

and they could still get put into a context window where they could be used

3:31

for in context learning or returning back to ideas that you'd mentioned in the past.

3:38

okay.

3:38

So a quick recap if you don't understand how images, in transformers work.

3:45

One of the biggest challenges that you've got when you've got an image

3:48

and you wanna put it into transformer is you need to tokenize it somehow.

3:52

So how do you actually do that?

3:54

Well, you start off with an image like this and cut it up into little

3:58

patches, and those patches then get converted individually into tokens.

4:05

So you can see here this idea comes from the original, vision Transformer model.

4:10

And what they basically do is they come out with something where you've

4:15

got these image patches, and then each of these patches becomes a token.

4:20

And there are various ways that people have done this over time, you could

4:23

think of one of the original ways if you've got a patch of say, 16 by

4:27

16 pixels, that gives us 256 pixels.

4:32

each of those has a red, green and blue channel in there,

4:36

meaning that we're gonna have three times 256, which is going to give us

4:41

768 Now you can kind of think of that as an embedding, but it goes through a

4:46

projection layer to become an embedding.

4:49

we end up with these tokens that are really where each of the tokens is

4:54

representing a patch from an image.

4:58

And if you look at something like the PaliGemma model, you see an

5:03

example of how this comes together is that you've got some kind of vision

5:07

encoder, which basically chops up that image and turns it through a linear

5:12

projection into a bunch of tokens.

5:16

then those tokens can be fed into the language model, and combined with, other

5:20

tokens if we wanted to ask questions about it and it would be able to generate

5:25

text out, for this kind of thing.

5:27

so the idea here is that if we've got text in this image, can we get

5:32

the equivalent of more tokens of text in an image format that actually get

5:37

represented accurately as vision tokens that are coming out in a compressed way.

5:44

And that basically gets us so that we can have, a lot fewer

5:47

tokens with this kind of thing.

5:49

so how do they actually do this?

5:51

Well, their secret sauce in here is in this deep encoder.

5:55

So most vision encoders would either use too many vision tokens, to make

6:00

this possible, or they require lots of memory or they can't handle,

6:05

high resolution stuff like that.

6:07

So their deep encoder has a two stage solution.

6:10

The first stage is using a SAM model.

6:13

So this is only, about 80 million parameters, in here.

6:18

And the idea here is that this can sort of work out what to pay attention to

6:23

at a really high resolution or, really pay attention to all the details.

6:28

Before they go to stage two though, what they actually do here is

6:31

they then compress those images by using a CNN and this allows them

6:38

to shrink it down by 16 times.

6:40

in the second stage, they pass it into a clip model.

6:44

and the idea here is that this is gonna take these sort of compressed pieces of

6:49

information and use global attention to work out what relates to what, at this

6:55

point they've got something that's like a really efficient summary of this.

6:59

Now I'm massively simplifying it here just to sort of explain

7:02

it in a short amount of time.

7:03

But you can kind of think of what they're doing is getting this sort

7:08

of multi-stage way of extracting out the information as opposed to just

7:13

trying to pay attention to everything and ending up with too many tokens.

7:18

The other thing that they do at this point is that they also have it sort of

7:21

set up so that they can handle different sorts of zoom levels into the same system.

7:28

So they have a version that will output sort of 64 tokens, which they call

7:32

the tiny mode, a hundred tokens for the small mode, 2 56 for base mode.

7:38

right up to Gundam mode, which is sort of the 1800 tokens.

7:42

So to give you a real world comparison, the old way of doing this would take

7:46

a document and need 6,000 tokens to represent that document going

7:52

into your large language model, and that obviously is gonna use a lot

7:55

of computational power and memory.

7:58

This deep encoder approach can do the same thing with under 800 vision tokens.

8:03

And it can actually get better performance out of this.

8:07

Okay.

8:08

now at this point I should point out that what they're

8:10

actually doing is for OCR tasks.

8:12

They're proving this idea of compression in a theoretical way.

8:17

So we don't know for sure we don't know how this is gonna pan out.

8:20

And if we could actually use, say, 500,000 vision tokens to

8:25

replace 5 million text tokens.

8:28

At the moment, it's still theoretical research, but they have shown it with

8:32

this OCR task and it can actually work, pretty well and they've shown that it

8:36

can certainly work for OCR stuff in here.

8:39

if we look at the OCR benchmarks that they've got in here,

8:43

we can see that they're able to use this model to maintain over 95% accuracy

8:49

as long as this compression is sort of staying at sort of 10 x. so just to finish

8:53

this up, I would say this DeepSeek OCR not really is just a model for doing OCR.

9:00

We've got a whole new vision encoder from this.

9:03

we've also got a really interesting, decoder that they've put in here, which

9:06

is this DeepSeek 3B MOE, only 570 million active parameters, going on in here.

9:14

But for me, this is so much more than just an OCR paper.

9:18

this is really about an interesting idea of how can we actually take,

9:22

text tokens and then sort of store them as vision tokens and how can

9:28

we actually extract more information out of vision tokens, than we are

9:32

currently getting with text tokens.

9:34

So overall this is a pretty cool concept just to see what they're trying to do

9:39

and really kinda reinforces a lot of the things that DeepSeek has been doing,

9:43

over the past year and a half or so of just trying out different ideas than

9:49

following what everyone else is doing.

9:51

That certainly gave them big wins for the DeepSeek R1 and you've gotta

9:55

wonder where they go with this, so that we could actually build systems

9:58

that perhaps have 10 to 20 million text token windows, or the equivalent

10:03

of that, but through vision tokens.

10:06

anyway, check out the paper.

10:08

have a look at what they've actually released.

10:11

the code is up on GitHub.

10:12

And the model, I think is up now on hugging face as well.

10:17

it seems like hugging face has been having some problems today.

10:19

serving models, along with a lot of other people.

10:23

I may revisit this, to sort of check, like, how well does this

10:26

actually work for real world tasks?

10:28

If we compare it to some of the other OCR systems that have come out

10:31

recently, like the Nanonets OCR 2 and the PaddleOCR-VL which are both

10:38

very interesting models that are very small and built just for doing OCR.

10:44

so as always, let me know in the comments if you've got any questions, and I

10:47

will talk to you in the next video.

10:48

Bye for now.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video discusses DeepSeek OCR, a new model that goes beyond traditional OCR by introducing 'Contexts optimal compression'. This technique uses vision as a compression algorithm for text, allowing a large amount of text to be stored within an image using significantly fewer 'vision tokens' compared to 'text tokens'. This has major implications for large language models (LLMs) by potentially expanding their context windows to handle much longer documents or conversation histories efficiently. The model employs a two-stage deep encoder, first using a SAM model for high-resolution attention and then a CNN for compression, followed by a CLIP model for global attention. This approach achieves a significant compression ratio (e.g., 10x or even 20x) with high accuracy, enabling LLMs to process vast amounts of information that would otherwise be computationally prohibitive. While currently demonstrated and proven for OCR tasks, the underlying technology could revolutionize AI memory and long-context processing.