DeepSeek-OCR Launches: A New Approach to AI's Long-Context Problem by 'Screenshotting' and Compressing History

"US innovates, Chinese optimizes." This phrase describes the two distinct paths the two superpowers are taking in technological development, this time again in AI.

DeepSeek released DeepSeek-OCR yesterday, and I believe it opens up a crucial new direction for the continued optimization of AI.

AI's scaling laws have, in fact, already hit a wall. Especially after the release of GPT-5, people realized that compared to the magical leap from GPT-3 to GPT-4, GPT-5 did not repeat that same sense of wonder. This has forced everyone to recognize the fact that the marginal returns of scaling laws are rapidly diminishing. As a result, the recent AI large model competition has shifted to more detailed optimizations. As for whether Gemini 3 will bring any new surprises, we'll just have to wait for Google to reveal it.

When chatting with AI, everyone has surely noticed a phenomenon: the longer the conversation, the "weirder" the AI's subsequent responses become, and it even starts to forget things from earlier. Eventually, it gets unbearable, and you just open a new chat, discovering the quality of a fresh conversation is much better.

But this isn't actually a bug; it's a problem yet to be overcome: AI cannot process excessively long contexts. Simply put, it's like someone asking you to remember every single word in a book. With every new sentence spoken, your brain has to re-process all the previous content. You would "crash" very quickly. AI is the same. When processing very long contexts, the computational load snowballs, and eventually, the memory just explodes.

Therefore, processing long contexts isn't impossible, but with current computation methods, you would get an AI that responds so slowly you wouldn't even want to use it, drastically reducing its practicality.

And DeepSeek has proposed a unique idea: "take pictures" of old conversations and store them.

Initially, this idea certainly raises skepticism. Converting text to an image, and then having the AI "describe the picture" to restore the content—wouldn't a lot of information be lost in the process? Besides, don't images take up more space than text?

But the results from the DeepSeek team were somewhat unexpected. They found that for a page of text with 1,000 words, it only takes 100 "visual building blocks" (technically called vision tokens) to restore it with over 97% accuracy. What does this mean? It's equivalent to compressing a 100,000-word conversation history into 10,000 "photo fragments," and the AI can recall what you talked about just by looking at these fragments.

Even more surprisingly, even when pushing the compression ratio to 20x (using only 50 vision tokens for 1,000 words), the accuracy still remained around 60%. This number might not seem high, but think about it: it's like trying to recall the details of a conversation from a month ago. Remembering 60% is already quite impressive.

However, it needs to be clarified that these tests were conducted in OCR scenarios—that is, restoring text from images. Whether this compression method can maintain the same effectiveness in real-world scenarios like multi-turn conversations, code discussions, or complex reasoning has not yet been fully validated. The paper also admits these are only preliminary results.

The Ingenuity Behind the Technology

To achieve this "photo memory" function, the team designed a compression engine called DeepEncoder. The thinking behind this design isn't complex; it's like a three-stage process in a factory.

The first stage is responsible for looking at details, like a quality inspector with a magnifying glass checking every corner of a product. But it's smart—it only looks at local regions and doesn't try to load the entire product into its brain at once, so it's fast and effortless. This part uses only 80M parameters, making it very lightweight.

The second stage is a compression expert, reducing the information volume to 1/16th of the original in one go. It sounds aggressive, but the actual information loss is minimal. It's like skillfully assembling 4,096 small puzzle pieces into 256 large pieces, retaining the main content of the picture.

The third stage is responsible for seeing the big picture and understanding the meaning of the entire image. Because the first two stages have already compressed the information so much, processing the global perspective at this point won't "blow up the memory." This part uses 300M parameters. The entire compression engine is about 380M.

The ingenuity of this design lies in using a lightweight method where fine-grained processing is needed, and by the time it needs to understand the big picture, the information is already compressed. This avoids the dilemma of traditional methods, which either can't see details clearly or see them too clearly and cause a memory explosion.

Paired with a 3B parameter language model as the decoder (only 570M are used during runtime), the entire system can run efficiently on a single A100 GPU.

Therefore, from an engineering perspective, this model's productivity is truly impressive. A single A100 can process 200,000 pages per day, and 20 nodes can handle 33 million pages. This efficiency boost is very valuable for scenarios requiring massive document processing, such as preparing training data for large models, building corporate knowledge bases, and so on.

And the team has open-sourced the code and model weights, lowering the barrier to entry. However, it's important to note that the model has not undergone Supervised Fine-Tuning (SFT), so using it requires familiarity with specific prompt formats. For developers who want to integrate it directly into a product, some adjustment work is still needed.

But even so, this is already an astonishing optimization achievement.

One Model, Multiple Uses

DeepSeek-OCR has other clever touches, such as an architecture that isn't a rigid "one-size-fits-all" design. It offers multiple modes, just as a camera has different shooting modes.

For a simple slide, the "Tiny" mode is sufficient: 512×512 resolution, requiring only 64 vision tokens. But for a complex newspaper layout, it can switch to "Gundam" mode: using multiple local views plus a global view, it can handle it with about 800 tokens.

This flexibility is important. Think about organizing documents: some are simple notes, others are dense academic papers. Traditional methods would use the same "high-spec" settings to process everything, which is wasteful. But DeepSeek-OCR automatically adjusts the compression "force" based on content complexity—saving resources when it can, and applying more when needed.

The experimental results actually reveal a rule: the limit of compression depends on the complexity of the content. Simple content can be compressed aggressively; complex content needs more space. Isn't this exactly how human memory works?

The Most Interesting Idea: AI Should Also Learn to "Forget"

And this is the most inspiring concept in the paper: "Let AI forget like a human."

First, think about how your own brain works. You can definitely repeat what was just said, word-for-word. You remember the gist of a conversation from an hour ago. For yesterday's events, you might only remember key fragments. Last week's discussion has become hazy. Many details from last month's conversations are completely forgotten.

DeepSeek proposes applying the same method to AI's memory: recent conversations are kept as original text, unchanged. Content from an hour ago is converted to a high-definition "photo," stored with 800 tokens. This morning's conversation is downgraded to standard-definition, 256 tokens. Yesterday's becomes low-resolution, 100 tokens. Last week's is even blurrier, 64 tokens. Anything older is either extremely compressed or simply discarded.

This design is much like the human brain's operation. And it creates a possibility: AI could handle a theoretically infinite-length conversation, because old memories automatically "fade," making room for new ones.

Of course, this mechanism would encounter problems in practice. For instance, how to determine which information is "important" and should be retained in high resolution? What if, in the 50th turn of the conversation, the user suddenly mentions a specific detail from the 5th turn, which has already been compressed into a blur? Perhaps it would require some kind of "memory importance scoring" mechanism, or allowing users to manually bookmark key information.

After freeing up space, the processing length of the context can also be substantially increased.

China's AI Advantage in Cost Optimization

Therefore, from this research, we can see a clear characteristic of Chinese AI companies: an extreme capability for cost optimization.

DeepSeek's previous V3 model achieved near-GPT-4 performance using 2.788 million H800 GPU hours (a training cost of about $5.57 million), which shocked the entire industry. This OCR model reflects the same line of thinking: finding ways to achieve the best effect with the fewest tokens.

Compared to the strategy of American AI companies, which tend to "achieve results by piling on resources," Chinese teams are more adept at deep optimization under resource constraints. This may be related to two factors: first, the innovation forced by limited access to computing power (GPU embargos), and second, an engineering culture that places greater emphasis on efficiency and cost control. OpenAI can burn money to train models; DeepSeek must find a way to do it with fewer resources.

This difference is reshaping the global AI competition. While US companies are still competing over whose model is bigger and costs more to train, Chinese companies are already exploring how to achieve 90% of the effect at 1/10th of the cost. In the long run, this engineering optimization capability may be more competitive than sheer resource investment. This is especially true for commercial applications that require large-scale deployment, where cost control is often more important than extreme performance.

The Possibilities for DeepSeek-R2

If DeepSeek integrates this type of innovative technology into its next-generation reasoning model, R2, it could very well bring about some substantial changes.

R1 already proved that Chinese teams can reach a level close to the US in reasoning models, but its long-context processing is still limited by traditional architecture. If R2 integrates visual compression, MoE optimizations, and other yet-to-be-disclosed technologies, it could drastically reduce the computational cost of long context while maintaining reasoning ability. That wouldn't just be a performance increase, but an expansion of use cases.

Imagine a model that can remember dozens of conversation turns and process ultra-long contexts, all while keeping reasoning costs within an acceptable range. This would be a fundamental change for applications requiring long-term interaction, such as education, medical consultation, legal analysis, and more. And if the cost is low enough, it could democratize these capabilities, moving them from "big-company exclusive" to "affordable for small and medium-sized developers."

Judging from DeepSeek's past technical roadmap, they are indeed moving in a "more efficient, more practical" direction, rather than purely chasing benchmark numbers. V3 was like this, the OCR model is like this, and R2 will likely continue this line of thought. Of course, this is just speculation based on available information; we'll have to wait for the official release to know the actual results. But at least this direction is clear and supported by a technical foundation.

Questions That Still Need Answers

However, this research opens up a new direction, but it also leaves many questions unanswered.

First is generalizability. The OCR task is relatively simple: the input is an image, the output is text, and there's no complex logical reasoning in between. But in a real conversation, the AI needs to understand context, perform reasoning, and maintain conversational coherence. How visual compression performs in these scenarios still requires much more validation.

Second is the boundary of the compression ratio. The paper shows the best results at 10x compression, after which accuracy drops quickly. Does this mean 10x is a hard bottleneck? Or can this limit be broken by improving the encoder architecture? Or do different types of content have different optimal compression ratios?

Scenarios that require precision, like code discussions and mathematical derivations, are probably not suitable for high compression ratios. But casual chat and general discussions can tolerate more information loss. The future may require dynamically adjusting the compression strategy based on the conversation type.

There is also the practical question of latency. The entire process of converting text to an image, encoding, compressing, and decoding—what is the time cost? In scenarios requiring real-time responses, is this overhead acceptable? The paper did not discuss data on this aspect in detail.

Where Is the Theoretical Limit of Compression?

After seeing the experimental results, a natural question arises: Why is 10x compression the sweet spot? Is there a theoretical basis for this?

In information theory, there's a concept called the "Shannon Limit," proposed by the founder of information theory, Claude Shannon, in 1948. Simply put, it tells us that all information has a theoretical minimum representation. You can't compress it any further unless you are willing to accept information loss.

For example, if every word in a book is random and completely unpredictable, that book is basically uncompressible. But if the book is written in Chinese or English, it has many patterns and repetitions, so it can be compressed significantly. You can use the frequency of the word "the" to design a shorter encoding for it.

Compression is divided into two types: lossless and lossy.

Lossless compression is like neatly folding clothes to fit in a suitcase. The space occupied is smaller, but when you take them out, every piece of clothing is exactly as it was. ZIP files are this type of compression; after decompressing, the file is identical to the original. But lossless compression has a limit; you can't compress it indefinitely.

Lossy compression is like reducing a photo from high-definition to low-definition. The file gets smaller, but some details are lost. JPEGs are like this; if you look closely, you'll find some areas are blurry, but most of the time you don't even care. Lossy compression can compress much more, but at the cost of information loss.

DeepSeek-OCR is doing lossy compression. When it compresses 1,000 text tokens into 100 vision tokens, it inevitably discards some information. The question is, what information is being discarded?

Looking at the experimental results, at 10x compression, what's lost is mainly "redundant information": details that are not important for understanding the content. For example, the exact font, line spacing, and margins—these layout details are not critical for understanding the text. Just as when you read an article, you can understand the content whether it's in a serif or sans-serif font.

But when the compression ratio is pushed to 20x, it's no longer just redundant information being lost; some key content may also start to blur. Imagine compressing a photo so much that the text becomes difficult to recognize; that's when the accuracy drops to 60%.

There's an interesting observation: different types of content have different "compressibility."

A simple slide might only have a few large-font lines and some bullet points. The information density is low, and redundancy is high, so 64 tokens are enough. But a page of a dense newspaper has extremely high information density, and almost every word is useful. To compress it, you have to pay a higher price in precision, so it requires 800 tokens.

This is like compressing a solid blue image versus a landscape photo full of details. The former can be described in just a few words ("fill the entire frame with blue"), while for the latter, you have to record the color, texture, and light of every region.

From this perspective, the 10x "sweet spot" DeepSeek-OCR found is actually an empirical balance between "retaining enough information" and "saving enough resources." It's not a theoretical limit, but a practical choice. For most documents, 10x compression can retain 97% recoverability, a number high enough for practical application.

But this also means that for content with particularly high information density—like mathematical formulas, code, or legal statutes—10x might not be enough, requiring a more conservative compression ratio. For low-density content, like casual chat, it might be possible to compress it even more aggressively.

In the future, if this technology is to be used for long-term conversation memory, it might require designing a "content complexity assessor" to dynamically decide the compression ratio based on the information density of each conversation segment. Simple greetings and small talk could be compressed 20x; important technical discussions kept at 5x; and critical code snippets not compressed at all.

This dynamic compression strategy is perhaps the truly practical solution for the future. Just like human memory, we automatically assess which information is important and which is not, and then store it at different "resolutions."

Rethinking the Meaning of "Memory"

Human memory has never operated like a traditional computer, recording everything. We remember impressions, key information, and emotional connections, not verbatim transcripts. We forget details but retain what's important. We re-encode memories, storing them in more efficient ways.

DeepSeek-OCR offers a viable path: when processing long contexts, we don't have to stick to a pure-text method; a visual representation might be a more efficient choice. Converting memory into a visual representation is like humans converting experiences into mental images. This is not only more efficient but also seems closer to how biological intelligence operates.

However, whether this idea can hold up in broader scenarios still needs time to be validated. But it proves one thing: even with limited resources, it's still possible to build competitive systems through deep thinking about the problem's essence, clever architectural design, and meticulous optimization of every link. This is perhaps a microcosm of China's AI development—not winning by piling on resources, but by focusing on engineering optimization.

The next time you're chatting with an AI and it "forgets" your previous conversation, perhaps a future AI will reply: "I didn't forget. I just took a picture of our conversation and stored it deep in my memory. I can pull it out and look at it anytime you need."

At that time, the conversation between AI and humans might become far more natural and persistent.

Related

Search

Categories

Recent Posts