What Did DeepSeek Really Open-Source? Where's the Truth?

Is DeepSeek Truly Open Source? Debunking the Debate

I see people starting to debate whether DeepSeek is truly open source, shouting that they've only released the model weights and explained training methods in papers, claiming this isn't real open source! Where's the training code? Where's the training dataset?

Understanding the Core of Open-Source AI: The Power of Open Weights

Let's break this down.
The most important part of an AI model is its weights. Once the weights are released, you can do almost everything else. Since some engineers without AI backgrounds are questioning my technical knowledge, let me list some things you can do with open weights.

Unlocking Potential: How Open Weights Empower Innovation

When an AI model's weights are released, developers and researchers can modify and optimize the model. These modifications and optimizations focus on several aspects that "real AI practitioners" care about most: performance optimization, architectural improvements, model analysis, and deployment integration.

The core value of open weights for performance optimization is that it allows us to develop more precise optimization strategies by analyzing actual weight data.

Deep Dive into Weights: Optimizing Quantization and KV Cache

Once we have the model's actual weight data, we can first analyze the value ranges and distribution characteristics of weights in each network layer. This information directly affects quantization optimization plans. For example, we can decide which layers are suitable for FP16 acceleration and which layers need to maintain FP32 precision due to value sensitivity. Similarly, when implementing quantization, we can determine quantization schemes for each layer based on actual weight distributions to maintain model performance while reducing precision.

Another optimization method frequently discussed recently is KV Cache, which heavily relies on understanding weight characteristics. By analyzing weight structures, we can more accurately estimate cache requirements and optimize cache strategies based on weight characteristics. A recent paper (less than a month old) worth referencing is "PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving". This leads into discussing distributed LLM deployment and using parallel computing to improve overall large model service performance.

Parallel Computing and Model Compression: Scaling and Shrinking AI

When designing LLM parallel computing strategies (basically splitting the brain into small pieces across different GPUs), weight data analysis results are crucial. By analyzing the weight scale and computational complexity of different layers, we can find optimal model splitting points to balance GPU loads. Such splitting schemes based on actual weight characteristics are more effective than simply dividing based on layer numbers. Without open weights, we can only design optimization strategies based on paper descriptions or external observations, making it difficult to optimize.

Another example is model compression, which aims to reduce model size without significantly impacting intelligence. For instance, INT8 or INT4 quantization techniques can convert original FP32 weights to low-precision integers, significantly reducing model size.

Finally, knowledge distillation involves training a smaller model to mimic the behavior of a large model, greatly reducing computational requirements while maintaining core functionality. Distillation isn't plagiarism – it's a well-established method in the AI industry that all AI practitioners use and will continue to use.

How would you easily achieve any of this without open weights?

The Real Question: What's the Controversy Around R1?

So what are people questioning about R1?
The main concern is "Is the training cost really as low as claimed?" First, I believe the training cost isn't that low and is likely somewhat exaggerated. But this is completely separate from the issue of open model weights – don't conflate the two.

Verifying R1's Training Cost: The Hugging Face Initiative

To verify how much R1 actually cost to train, Hugging Face started the "Open R1" project last week. The goal is to reproduce the training methods from the paper and verify the possible range of training costs. Training cost is important because it relates to "Do we really need so many high-end Nvidia chips to train AI models?" So, if it's proven that R1 still needs many high-end chips for training, we owe Nvidia an apology – stock up on Nvidia; if R1 doesn't need many high-end chips for training, then the market's slashing of Nvidia makes sense. But Huang is a resilient entrepreneur, and I still believe Nvidia is bearish short-term but bullish long-term. The only uncertainty is whether Trump will squeeze Huang's neck, but I won't elaborate on that.

The Future of AI Training Costs: A Continuing Trend

Although Hugging Face's results aren't out yet, I think the outcome will be somewhere in the middle: we still need high-end chips, but perhaps not as many as before. Looking at DeepSeek's paper, they indeed proposed many training optimization methods, which OpenAI researchers confirmed were independently discovered by the DeepSeek team for o1 optimization. Hugging Face's reproduction project just finished its first week. Before Hugging Face completes the reproduction and reaches preliminary conclusions, any cost estimates… are just unverified speculation.

Moreover, as various optimization tricks emerge, training model costs will definitely continue to decrease. This trend will continue, which is why how much money DeepSeek spent on training their model mainly carries the significance mentioned above.

Reconstructing the Dataset: The Open-Source Approach to Reverse Engineering

Lastly, what about those still shouting "where's the dataset"? I must emphasize again that the most important aspect of a model is its weights. Now that training is complete, people are mainly focused on how to apply R1's weights, and various studies are underway. If you're really concerned about the dataset, Hugging Face is also investigating this with a positive approach – they're reverse engineering R1 to synthesize the training data used for reasoning. The significant importance of this work lies in the fact that the reverse-engineered logical reasoning dataset can be used by any AI practitioners later, benefiting the open-source community once again.

(This article is translated from a Facebook post by Sega Cheng, co-founder and CEO of iKala.)

Related

Search

Categories

Recent Posts