Deepseek

Observations

Deepseek R1 (and V3) is impressive. They have two major (public) innovations (it’s possible/likely the first exists in private at Google and OpenAI):

First. R1-Zero: Their paper outlines how they train a “reasoning model” (which just produces a lot of “chain of thought” reasoning before giving a final answer, and that tends to increase accuracy of responses at the cost of compute power). The key innovation is that they don’t need tons of sample reasoning texts, they can just say “think step by step” and reward correct answers and penalize incorrect answers and the model learns to produce useful reasoning. This matters because gathering training data is expensive and training against it can be brittle. That said, you still need all the existing training data to get the model to a point where it can learn from the “think step by step” feedback.

Second. They didn’t have access to the most powerful Nvidia GPUs and they did some impressive low level coding work to work around their compute limitations. If they had had the compute, they would have innovated somewhere more useful. It’s impressive, but the reason Silicon Valley is not innovating there is that you can throw money at that problem and it goes away so they can spend time elsewhere. Also, having a million H100s means that they can do tons of interesting experiments to try to improve performance, and they can afford to have a whole lot more failed training runs. When you’re trying to beat the best model, you don’t always manage on the first try. Do those experiments and extra training runs pay off? Clearly not as frequently as would be ideal, but it’s definitely still an edge (and the leaders are still the leaders).

Reflections on the Discourse

I think a lot of the discussion is missing the fact that Deepseek has not caught up to OpenAI by beating them at their own game. They are like a generic drug company that has copied the formula from the org that did the R&D. Reasoning models are not that interesting (we already knew that “think step by step” improved performance, this is just encoding that step into the model’s automatic response). That they did it cheaply is (1) misleading: they don’t mention the cost of training data, test runs, failed runs… and a lot of the cost of R1 is borne by V3, the cost of which is not mentioned. (2) it’s also almost certainly downstream of OpenAI and/or Google’s reasoning models (i.e., trained on the outputs from their models). You can cry me a river for OpenAI having their model “stolen” or Terms of Service violated or whatever, but I assume the complaint that it happened likely has some truth (I would guess actually Gemini’s reasoning model is a more likely target, since their “thinking” tokens are not obfuscated). That doesn’t mean Deepseek doesn’t have a good model (they do!), but it does mean they are still followers not leaders in the space.

Another detail that’s getting missed is that there are a bunch of different models that are all getting called “R1”. The actual R1 is a behemoth that you need a bunch of high-end GPUs to run. There are also smaller “distillations” based on models from Meta and Alibaba. These models don’t have the same performance as the “big daddy” R1. They’re impressive in their own right, but the ones that can be hosted on consumer hardware are not going to break any records. So the fact that Deepseek managed to train on lower power GPUs doesn’t mean you can run their model on a consumer GPU.

My Takeaways

I like that higher quality models will be available more cheaply. I think that this will democratize access even more. But I have used the Deepseek models a bit at this point (both the smaller distillations and the main R1), and I am still getting better results from Google’s latest reasoning model experiment (which was last updated a couple of weeks ago), so I’ll probably stick with that for now.