DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.
In short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.
My first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.
In the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)
In any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.
How is it different compared to other VLLM architectures?
- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).
- They are (to the best of my knowledge) those who use an MoE as a decoder.
I think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.
However, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.
Regarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)
Overall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).
(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)
点击图片查看原图