
---license: Apache License 2.0 tasks:- Large Language Models- Text Generation- Question Answering- Translation- Summarization- Text2Text Generation training_framework: paddlenlp---# Qwen2.5-7B-Instruct-1M<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;"> <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/></a>## 模型概述(Introduction)Qwen2.5-1M 是 Qwen2.5 系列的长上下文版本模型,支持高达 **100万token(约70万中文字符)** 的上下文窗口。相较于 Qwen2.5 128K 版本,Qwen2.5-1M 在长文本任务处理性能上显著提升,同时保持短任务的原有能力。该模型具备以下特性:- **模型类型**:因果语言模型(Causal Language Models)- **训练阶段**:预训练(Pretraining)与后训练(Post-training)- **架构特性**:基于 Transformer,集成旋转位置编码(RoPE)、SwiGLU激活函数、RMSNorm归一化及注意力QKV偏置- **参数量**: - **总参数量**:76.1亿(7.61B) - **非嵌入参数量**:65.3亿(6.53B)- **结构细节**: - **层数**:28层 - **注意力头配置(GQA)**:28个查询头(Q)与4个键值头(KV)- **上下文长度**: - **训练支持**:完整支持1,010,000 token - **生成限制**:单次生成最长8,192 token - **部署建议**: - **推荐使用定制vLLM框架**:通过稀疏注意力(Sparse Attention)与长度外推技术(Length Extrapolation),确保长上下文任务的效率与精度(具体指南参见[此章节](#processing-ultra-long-texts)) - **常规框架限制**:使用其他支持Qwen2.5的推理框架时,序列超过262,144 token可能导致性能下降Qwen2.5-1M is the long-context version of the Qwen2.5 series models, supporting a context length of up to **1 million tokens (approximately 700k Chinese characters)**. Compared to the Qwen2.5 128K version, Qwen2.5-1M demonstrates significantly improved performance in handling long-context tasks while maintaining its capability in short tasks.The model has the following features:- **Type**: Causal Language Models- **Training Stage**: Pretraining & Post-training- **Architecture**: Transformer-based with Rotary Position Embedding (RoPE), SwiGLU activation, RMSNorm, and Attention QKV bias- **Parameters**: - **Total Parameters**: 7.61 billion (7.61B) - **Non-Embedding Parameters**: 6.53 billion (6.53B)- **Structural Details**: - **Number of Layers**: 28 - **Attention Heads (GQA)**: 28 query heads (Q) and 4 key-value heads (KV)- **Context Length**: - **Full Support**: 1,010,000 tokens - **Generation Limit**: Maximum 8,192 tokens per generation - **Deployment Recommendations**: - **Custom vLLM Framework**: Ensures efficiency and accuracy for long-context tasks through sparse attention and length extrapolation methods (see [this section](#processing-ultra-long-texts) for guidance) - **General Framework Limitations**: Accuracy degradation may occur when sequences exceed 262,144 tokens in non-custom frameworks## 环境要求 (Requirements)Qwen2.5 的代码已集成至最新版 PaddleNLP,建议安装`paddlenlp>=2.7.0`版本。若版本过低,可能会遇到如下报错:The code of Qwen2.5 has been integrated into the latest PaddleNLP and we advise you to install `paddlenlp>=2.7.0`, or you might encounter the following error:```pythonKeyError: 'qwen2_5'```## 快速开始 (Quickstart) 以下代码片段展示了如何使用 `apply_chat_template` 加载分词器与模型并进行内容生成: Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.```pythonfrom paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLMtokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct-1M")# if using CPU, please change float16 to float32model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct-1M", dtype="float16")input_features = tokenizer("你好!请自我介绍一下。", return_tensors="pd")outputs = model.generate(**input_features, max_new_tokens=128)print(tokenizer.batch_decode(outputs[0], skip_special_tokens=True))# ['我是一个AI语言模型,我可以回答各种问题,包括但不限于:天气、新闻、历史、文化、科学、教育、娱乐等。请问您有什么需要了解的吗?']```### 超长文本处理(Processing Ultra Long Texts)为提升长序列处理精度与效率,我们基于vLLM开发了集成稀疏注意力(Sparse Attention)与长度外推技术(Length Extrapolation)的增强推理框架。该方案显著改善了256K以上token序列的生成性能,并在处理100万token序列时实现3-7倍加速效果。以下是使用本框架部署Qwen2.5-1M模型的分步指南:#### 1. 系统准备(System Preparation)为获得最佳性能,建议使用支持优化内核的Ampere或Hopper架构GPU。确保系统满足以下要求:- **CUDA版本**:12.1或12.3- **Python版本**:>=3.9且<=3.12**显存需求**:- 处理100万token序列: - **Qwen2.5-7B-Instruct-1M**:至少120GB显存(跨GPU总量) - **Qwen2.5-14B-Instruct-1M**:至少320GB显存(跨GPU总量)若GPU显存不足,仍可使用Qwen2.5-1M处理较短任务。#### 2. 安装依赖(Install Dependencies)当前需从我们的定制分支手动克隆并安装vLLM仓库。我们正在推进该分支合并至vLLM主项目。---### Processing Ultra Long TextsTo enhance processing accuracy and efficiency for long sequences, we have developed an advanced inference framework based on vLLM, incorporating sparse attention and length extrapolation. This approach significantly improves model generation performance for sequences exceeding 256K tokens and achieves a 3 to 7 times speedup for sequences up to 1M tokens.Here we provide step-by-step instructions for deploying the Qwen2.5-1M models with our framework.#### 1. System PreparationTo achieve the best performance, we recommend using GPUs with Ampere or Hopper architecture, which support optimized kernels.Ensure your system meets the following requirements:- **CUDA Version**: 12.1 or 12.3- **Python Version**: >=3.9 and <=3.12**VRAM Requirements**:- For processing 1 million-token sequences: - **Qwen2.5-7B-Instruct-1M**: At least 120GB VRAM (total across GPUs) - **Qwen2.5-14B-Instruct-1M**: At least 320GB VRAM (total across GPUs)If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M for shorter tasks.#### 2. Install DependenciesFor now, you need to clone the vLLM repository from our custom branch and install it manually. We are working on getting our branch merged into the main vLLM project.```bashgit clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.gitcd vllmpip install -e . -v```#### 3. Launch vLLMvLLM supports offline inference or launch an openai-like server.**Example of Offline Inference**```pythonfrom paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLMimport paddle# 初始化并行环境(4卡并行)paddle.distributed.init_parallel_env()paddle.set_device('gpu:0')# 加载模型与分词器model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-7B-Instruct-1M", from_aistudio=True, tensor_parallel_degree=4, # 4卡张量并行 max_memory_alloc="16GB", # 单卡显存限制 dtype='float16' # FP16混合精度)tokenizer = AutoTokenizer.from_pretrained( "Qwen/Qwen2.5-7B-Instruct-1M", from_aistudio=True)# 配置生成参数(对齐原始采样参数)generation_config = { 'temperature': 0.7, 'top_p': 0.8, 'repetition_penalty': 1.05, 'max_length': 512, 'use_faster': True, # 启用飞桨加速引擎 'use_fp16_decoding': True # FP16解码优化}# 构建对话输入prompt = "Tell me something about large language models."messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt}]# 手动构造对话模板(适配Paddle格式)formatted_prompt = "<|im_start|>system\n{}<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n".format( messages[0]['content'], messages[1]['content'])# 分布式生成with paddle.amp.auto_cast(enable=True): inputs = tokenizer([formatted_prompt], return_tensors="pd", padding=True) outputs = model.generate( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], **generation_config )# 解析并打印结果generated_text = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)print("Generated text:", generated_text.split("<|im_start|>assistant\n")[-1].strip())```**Example of Openai-like Server**```bashvllm serve Qwen/Qwen2.5-7B-Instruct-1M \ --tensor-parallel-size 4 \ --max-model-len 1010000 \ --enable-chunked-prefill --max-num-batched-tokens 131072 \ --enforce-eager \ --max-num-seqs 1# --quantization fp8 # Enabling FP8 quantization for model weights can reduce memory usage.```### 参数说明(Parameter Explanations)您可通过curl或Python与部署的模型交互。- **`--tensor-parallel-size`** - 设置为使用的GPU数量。7B模型最多支持4卡,14B模型最多支持8卡 - **`--max-model-len`** - 定义最大输入序列长度。若出现显存不足错误,可降低此值- **`--max-num-batched-tokens`** - 设置分块预填充(Chunked Prefill)的块大小。较小值会减少激活显存占用,但可能降低推理速度 - 推荐设为131072以获得最佳性能- **`--max-num-seqs`** - 限制并发处理的序列数量#### 故障排除(Troubleshooting)1. 报错:"The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." KV缓存显存不足。建议降低``max_model_len``或增加``tensor_parallel_size``,也可降低``max_num_batched_tokens``(但会显著降低推理速度)2. 报错:"torch.OutOfMemoryError: CUDA out of memory." 激活权重显存不足。尝试将``gpu_memory_utilization``设为0.85或更低(但可能影响KV缓存空间)3. 报错:"Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." 输入序列过长。建议缩短输入或增大``max_model_len``## 引用(Citation)If you find our work helpful, feel free to give us a cite.```@misc{qwen2.5-1m, title = {Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens}, url = {https://qwenlm.github.io/blog/qwen2.5-1m/}, author = {Qwen Team}, month = {January}, year = {2025}}@article{qwen2.5, title={Qwen2.5-1M Technical Report}, author={An Yang and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoyan Huang and Jiandong Jiang and Jianhong Tu and Jianwei Zhang and Jingren Zhou and Junyang Lin and Kai Dang and Kexin Yang and Le Yu and Mei Li and Minmin Sun and Qin Zhu and Rui Men and Tao He and Weijia Xu and Wenbiao Yin and Wenyuan Yu and Xiafei Qiu and Xingzhang Ren and Xinlong Yang and Yong Li and Zhiying Xu and Zipeng Zhang}, journal={arXiv preprint arXiv:2501.15383}, year={2025}}```