DeepSeek-R1-Zero_Apache License 2.0-飞桨AI Studio星河社区

---desc: |- deepseek-ai/DeepSeek-R1-Zero 深度求索推出的首款纯强化学习（RL）训练模型，无需监督微调（SFT）即可自主进化。通过GRPO算法激发多步验证与反思等推理行为，AIME 2024数学竞赛Pass@1准确率从15.6%跃升至71.0%。 support_training: 0 license: Apache License 2.0 dev_type:- notebook---# DeepSeek-R1## 1. 引言我们介绍了第一代推理模型DeepSeek-R1-Zero和DeepSeek-R1。DeepSeek-R1-Zero是一个通过大规模强化学习（RL）训练，无需监督微调（SFT）作为初步步骤的模型，在推理任务上表现出色。通过RL，DeepSeek-R1-Zero自然涌现出许多强大且有趣的推理行为。然而，DeepSeek-R1-Zero面临无限重复、可读性差和语言混合等挑战。为了解决这些问题并进一步提升推理性能，我们引入了DeepSeek-R1，它在RL之前加入了冷启动数据。DeepSeek-R1在数学、代码和推理任务上的性能与OpenAI-o1相当。为了支持研究社区，我们开源了DeepSeek-R1-Zero、DeepSeek-R1以及基于Llama和Qwen从DeepSeek-R1蒸馏出的六个密集模型。DeepSeek-R1-Distill-Qwen-32B在多个基准测试中超越了OpenAI-o1-mini，为密集模型树立了新的最先进结果。**注意：在运行DeepSeek-R1系列模型之前，我们建议您查阅[使用建议](#使用建议)部分。**## 2. 模型概述---**后训练：在基础模型上进行大规模强化学习**- 我们直接在基础模型上应用强化学习（RL），无需依赖监督微调（SFT）作为初步步骤。这种方法使模型能够探索链式思维（CoT）来解决复杂问题，从而开发出DeepSeek-R1-Zero。DeepSeek-R1-Zero展示了自我验证、反思和生成长CoT等能力，为研究社区树立了重要里程碑。值得注意的是，这是首次公开研究验证，仅通过RL（无需SFT）就可以激励大型语言模型（LLMs）的推理能力。这一突破为未来的研究铺平了道路。- 我们介绍了开发DeepSeek-R1的管道。该管道包含两个RL阶段，旨在发现改进的推理模式并与人类偏好对齐，以及两个SFT阶段，作为模型推理和非推理能力的种子。我们相信该管道将通过创建更好的模型惠及行业。---**蒸馏：小型模型同样强大**- 我们证明了大型模型的推理模式可以被蒸馏到小型模型中，与在小模型上通过RL发现的推理模式相比，性能更好。开源的DeepSeek-R1及其API将有助于研究社区在未来蒸馏出更好的小型模型。- 使用DeepSeek-R1生成的推理数据，我们对研究社区中广泛使用的几个密集模型进行了微调。评估结果表明，蒸馏出的小型密集模型在基准测试上表现出色。我们基于Qwen2.5和Llama3系列向社区开源了1.5B、7B、8B、14B、32B和70B的蒸馏检查点。## 3. 模型下载### DeepSeek-R1 模型<div align="center">| **模型** | **总参数数** | **激活参数数** | **上下文长度** | **下载** || :------------: | :------------: | :------------: | :------------: | :------------: || DeepSeek-R1-Zero | 671B | 37B | 128K | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26221?modelId=26221) || DeepSeek-R1 | 671B | 37B | 128K | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26219?modelId=26219) |</div>DeepSeek-R1-Zero 和 DeepSeek-R1 基于 DeepSeek-V3-Base 进行训练。有关模型架构的更多详细信息，请参考 [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) 存储库。### DeepSeek-R1-Distill 模型<div align="center">| **模型** | **基础模型** | **下载** || :------------: | :------------: | :------------: || DeepSeek-R1-Distill-Qwen-1.5B | [Qwen2.5-Math-1.5B](https://aistudio.baidu.com/modelsdetail/26305?modelId=26305) | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26234?modelId=26234) || DeepSeek-R1-Distill-Qwen-7B | [Qwen2.5-Math-7B](https://aistudio.baidu.com/modelsdetail/26307?modelId=26307) | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26236?modelId=26236) || DeepSeek-R1-Distill-Llama-8B | [Llama-3.1-8B](https://aistudio.baidu.com/modelsdetail/25956?modelId=25956) | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26232?modelId=26232) || DeepSeek-R1-Distill-Qwen-14B | [Qwen2.5-14B](https://aistudio.baidu.com/modelsdetail/26294?modelId=26294) | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26233?modelId=26233) || DeepSeek-R1-Distill-Qwen-32B | [Qwen2.5-32B](https://aistudio.baidu.com/modelsdetail/26297?modelId=26297) | [🚀 PP飞桨](https://aistudio.baidu.com/modelsdetail/26235?modelId=26235) || DeepSeek-R1-Distill-Llama-70B | [Llama-3.3-70B-Instruct](https://aistudio.baidu.com/modelsdetail/26124?modelId=26124) | [🚀PP飞桨](https://aistudio.baidu.com/modelsdetail/26231?modelId=26231) |</div>DeepSeek-R1-Distill模型基于开源模型并使用DeepSeek-R1生成的样本进行微调。我们稍微修改了它们的配置和分词器。请使用我们的设置来运行这些模型。## 4. 评估结果### DeepSeek-R1 评估对于所有模型，最大生成长度设置为32,768个令牌。对于需要采样的基准测试，我们使用温度0.6、top-p值0.95，并为每个查询生成64个响应以估计pass@1。<div align="center">| 类别 | 基准测试（指标） | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 ||----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|| | 架构 | - | - | MoE | - | - | MoE || | 激活参数数 | - | - | 37B | - | - | 37B || | 总参数数 | - | - | 671B | - | - | 671B || 英语 | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 || | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** || | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** || | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** || | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 || | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 || | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 || | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** || | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** || | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** || 代码 | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** || | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 || | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 || | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 || | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 || 数学 | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** || | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** || | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** || 中文 | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** || | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** || | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |</div>### 蒸馏模型评估<div align="center">| 模型 | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating ||------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------|| GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 || Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 || o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | **1820** || QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 || DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 || DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 || DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 || DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 || DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 || DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 |</div>## 5. 聊天网站与API平台您可以在DeepSeek官方网站上与DeepSeek-R1聊天：[chat.deepseek.com](https://chat.deepseek.com)，并打开"DeepThink"按钮我们还在DeepSeek平台提供OpenAI兼容API：[platform.deepseek.com](https://platform.deepseek.com/)## 6. 本地运行方法### DeepSeek-R1模型请访问[DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3)仓库获取有关本地运行DeepSeek-R1的更多信息。**注意：Hugging Face的Transformers尚未直接支持。**### DeepSeek-R1-Distill模型DeepSeek-R1-Distill模型可以与Qwen或Llama模型相同的方式使用。例如，您可以使用[vLLM](https://github.com/vllm-project/vllm)轻松启动服务：```shellvllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager```您也可以使用[SGLang](https://github.com/sgl-project/sglang)轻松启动服务：```bashpython3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2```### 使用建议**我们建议在使用DeepSeek-R1系列模型时（包括基准测试）遵循以下配置，以达到预期性能：**1. 将温度参数设置在0.5-0.7范围内（建议使用0.6）以防止无尽重复或不连贯的输出。2. **避免添加系统提示；所有指令应包含在用户提示中。**3. 对于数学问题，建议在提示中包含类似指令："请一步步推理，并将最终答案放在\\boxed{}中。"4. 评估模型性能时，建议进行多次测试并平均结果。此外，我们观察到DeepSeek-R1系列模型在响应某些查询时倾向于绕过思考模式（即输出"\<think\>\n\n\</think\>"），这可能对模型性能产生不利影响。**为确保模型进行充分推理，我们建议强制模型在每次输出的开头以"\<think\>\n"开始其响应。**## 7. 许可证此代码仓库和模型权重采用[MIT许可证](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE)授权。DeepSeek-R1系列支持商业使用，允许任何修改和衍生作品，包括但不限于蒸馏用于训练其他LLMs。请注意：- DeepSeek-R1-Distill-Qwen-1.5B、DeepSeek-R1-Distill-Qwen-7B、DeepSeek-R1-Distill-Qwen-14B和DeepSeek-R1-Distill-Qwen-32B源自[Qwen-2.5系列](https://github.com/QwenLM/Qwen2.5)，原始许可证为[Apache 2.0许可证](https://huggingface.co/Qwen/Qwen2.5-1.5B/blob/main/LICENSE)，现在使用DeepSeek-R1精选的80万个样本进行微调。- DeepSeek-R1-Distill-Llama-8B源自Llama3.1-8B-Base，原始许可证为[llama3.1许可证](https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/LICENSE)。- DeepSeek-R1-Distill-Llama-70B源自Llama3.3-70B-Instruct，原始许可证为[llama3.3许可证](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE)。## 8. 引用```@misc{deepseekai2025deepseekr1incentivizingreasoningcapability, title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, author={DeepSeek-AI}, year={2025}, eprint={2501.12948}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.12948}, }```## 9. 联系方式如果您有任何问题，请提出issue或通过[service@deepseek.com](service@deepseek.com)联系我们。