DeepSeek OCR - More than OCR
159 segments
Okay, so you've probably heard the phrase, A picture is worth a thousand words, and
generally that means that, we can kind of get more from a picture than just
a thousand word description from this.
Well, DeepSeek has taken this idea to a whole new level.
So they've just released a paper and a model for DeepSeek OCR, and I'm gonna tell
you that this is really not about the OCR.
Yes, their model can process large amounts of text, but
really this is not about the OCR.
What they're getting at here is that what if you could store in a single image the
equivalent of a thousand words, and then the model could perfectly read them back.
So this is exactly what the researchers at DeepSeek have achieved, and it
could change how we think about AI memory and long context processing.
So in this video, I want to dive into this DeepSeek OCR paper.
look at what they're doing, and explain how this works in comparison
to normal vision language models.
So what they're doing here is that they're introducing this thing that they're
calling Contexts optimal compression.
And the whole idea behind this is that they want to basically use vision
as a compression algorithm for text.
even though the paper is called DeepSeek OCR, and I thought, okay,
I'll just look at this quickly.
There's a bunch of OCR models that have come out recently that
I haven't had time to cover.
very quickly I realized that, hey, this is not about the OCR, the implications for
this go way beyond just reading documents.
So let's talk about why this matters.
as you probably know, one of the issues with large language models
is how they effectively handle long contexts or really long documents, and
we would like to be able to, go out not just to a million tokens, but to
10 million tokens, perhaps even more.
The challenge is we've got roughly one token per word, Now, what the
core breakthrough here is around what they're calling contexts optical
compression is instead of just thinking about, converting an image to tokens,
what if you could store tokens, If we could store text in images.
basically they can use a hundred vision tokens, which can then be decoded to a
thousand text tokens with a 97% accuracy.
So that's like a 10 x compression ratio with almost perfect fidelity here.
And they even find that it's sort of 20 x compression.
So just using 50 vision tokens for a thousand text tokens, they can
still maintain around 60% accuracy.
So really this paper is not about OCR, it's about creating a new form of memory
compression, for large language models and perhaps, even AI systems as a whole.
So imagine a model that could take your entire conversation history of,
millions of tokens, render it as images, and then bring those into a model
with dramatically fewer tokens than what you need to represent the normal
way you would do it with text tokens.
So you could kind of imagine that you would keep the recent conversations as
the high resolution, perfect, text tokens and anything beyond a certain point.
You could render as images
and they could still get put into a context window where they could be used
for in context learning or returning back to ideas that you'd mentioned in the past.
okay.
So a quick recap if you don't understand how images, in transformers work.
One of the biggest challenges that you've got when you've got an image
and you wanna put it into transformer is you need to tokenize it somehow.
So how do you actually do that?
Well, you start off with an image like this and cut it up into little
patches, and those patches then get converted individually into tokens.
So you can see here this idea comes from the original, vision Transformer model.
And what they basically do is they come out with something where you've
got these image patches, and then each of these patches becomes a token.
And there are various ways that people have done this over time, you could
think of one of the original ways if you've got a patch of say, 16 by
16 pixels, that gives us 256 pixels.
each of those has a red, green and blue channel in there,
meaning that we're gonna have three times 256, which is going to give us
768 Now you can kind of think of that as an embedding, but it goes through a
projection layer to become an embedding.
we end up with these tokens that are really where each of the tokens is
representing a patch from an image.
And if you look at something like the PaliGemma model, you see an
example of how this comes together is that you've got some kind of vision
encoder, which basically chops up that image and turns it through a linear
projection into a bunch of tokens.
then those tokens can be fed into the language model, and combined with, other
tokens if we wanted to ask questions about it and it would be able to generate
text out, for this kind of thing.
so the idea here is that if we've got text in this image, can we get
the equivalent of more tokens of text in an image format that actually get
represented accurately as vision tokens that are coming out in a compressed way.
And that basically gets us so that we can have, a lot fewer
tokens with this kind of thing.
so how do they actually do this?
Well, their secret sauce in here is in this deep encoder.
So most vision encoders would either use too many vision tokens, to make
this possible, or they require lots of memory or they can't handle,
high resolution stuff like that.
So their deep encoder has a two stage solution.
The first stage is using a SAM model.
So this is only, about 80 million parameters, in here.
And the idea here is that this can sort of work out what to pay attention to
at a really high resolution or, really pay attention to all the details.
Before they go to stage two though, what they actually do here is
they then compress those images by using a CNN and this allows them
to shrink it down by 16 times.
in the second stage, they pass it into a clip model.
and the idea here is that this is gonna take these sort of compressed pieces of
information and use global attention to work out what relates to what, at this
point they've got something that's like a really efficient summary of this.
Now I'm massively simplifying it here just to sort of explain
it in a short amount of time.
But you can kind of think of what they're doing is getting this sort
of multi-stage way of extracting out the information as opposed to just
trying to pay attention to everything and ending up with too many tokens.
The other thing that they do at this point is that they also have it sort of
set up so that they can handle different sorts of zoom levels into the same system.
So they have a version that will output sort of 64 tokens, which they call
the tiny mode, a hundred tokens for the small mode, 2 56 for base mode.
right up to Gundam mode, which is sort of the 1800 tokens.
So to give you a real world comparison, the old way of doing this would take
a document and need 6,000 tokens to represent that document going
into your large language model, and that obviously is gonna use a lot
of computational power and memory.
This deep encoder approach can do the same thing with under 800 vision tokens.
And it can actually get better performance out of this.
Okay.
now at this point I should point out that what they're
actually doing is for OCR tasks.
They're proving this idea of compression in a theoretical way.
So we don't know for sure we don't know how this is gonna pan out.
And if we could actually use, say, 500,000 vision tokens to
replace 5 million text tokens.
At the moment, it's still theoretical research, but they have shown it with
this OCR task and it can actually work, pretty well and they've shown that it
can certainly work for OCR stuff in here.
if we look at the OCR benchmarks that they've got in here,
we can see that they're able to use this model to maintain over 95% accuracy
as long as this compression is sort of staying at sort of 10 x. so just to finish
this up, I would say this DeepSeek OCR not really is just a model for doing OCR.
We've got a whole new vision encoder from this.
we've also got a really interesting, decoder that they've put in here, which
is this DeepSeek 3B MOE, only 570 million active parameters, going on in here.
But for me, this is so much more than just an OCR paper.
this is really about an interesting idea of how can we actually take,
text tokens and then sort of store them as vision tokens and how can
we actually extract more information out of vision tokens, than we are
currently getting with text tokens.
So overall this is a pretty cool concept just to see what they're trying to do
and really kinda reinforces a lot of the things that DeepSeek has been doing,
over the past year and a half or so of just trying out different ideas than
following what everyone else is doing.
That certainly gave them big wins for the DeepSeek R1 and you've gotta
wonder where they go with this, so that we could actually build systems
that perhaps have 10 to 20 million text token windows, or the equivalent
of that, but through vision tokens.
anyway, check out the paper.
have a look at what they've actually released.
the code is up on GitHub.
And the model, I think is up now on hugging face as well.
it seems like hugging face has been having some problems today.
serving models, along with a lot of other people.
I may revisit this, to sort of check, like, how well does this
actually work for real world tasks?
If we compare it to some of the other OCR systems that have come out
recently, like the Nanonets OCR 2 and the PaddleOCR-VL which are both
very interesting models that are very small and built just for doing OCR.
so as always, let me know in the comments if you've got any questions, and I
will talk to you in the next video.
Bye for now.
Ask follow-up questions or revisit key timestamps.
The video discusses DeepSeek OCR, a new model that goes beyond traditional OCR by introducing 'Contexts optimal compression'. This technique uses vision as a compression algorithm for text, allowing a large amount of text to be stored within an image using significantly fewer 'vision tokens' compared to 'text tokens'. This has major implications for large language models (LLMs) by potentially expanding their context windows to handle much longer documents or conversation histories efficiently. The model employs a two-stage deep encoder, first using a SAM model for high-resolution attention and then a CNN for compression, followed by a CLIP model for global attention. This approach achieves a significant compression ratio (e.g., 10x or even 20x) with high accuracy, enabling LLMs to process vast amounts of information that would otherwise be computationally prohibitive. While currently demonstrated and proven for OCR tasks, the underlying technology could revolutionize AI memory and long-context processing.
Videos recently processed by our community