opt-2.7b_大语言模型_文本生成_问答系统-飞桨AI Studio星河社区

---license: Apache License 2.0 tasks:- Large Language Models- Text Generation- Question Answering- Translation- Summarization- Text2Text Generation---# OPT : Open Pre-trained Transformer Language ModelsOPT was first introduced in Open Pre-trained Transformer Language Models**Disclaimer**: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from **this** model card has been written by the Hugging Face team.## IntroTo quote the first two paragraphs of the official paper> Large language models trained on massive text collections have shown surprising emergent> capabilities to generate text and perform zero- and few-shot learning. While in some cases the public> can interact with these models through paid APIs, full model access is currently limited to only a> few highly resourced labs. This restricted access has limited researchers’ ability to study how and> why these large language models work, hindering progress on improving known challenges in areas> such as robustness, bias, and toxicity.> We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M> to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data> collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and> to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the> collective research community as a whole, which is only possible when models are available for study.## Model descriptionOPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective.For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper.### How to useYou can use this model directly with a pipeline for text generation.```pythonimport paddlefrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerimport numpy as npmodel_name = "opt-2.7b" model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False )tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "Hello my friend"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, do_sample=True, top_p=0.9, dtype=paddle.float32 )generated_ids = outputs[0].numpy().flatten()generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)print([{"generated_text": generated_text}])```By default, generation is deterministic. In order to use the top-k sampling, please set `do_sample` to `True`. ```pythonimport paddlefrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerimport numpy as nppaddle.seed(32)np.random.seed(32)model_name = "opt-2.7b" model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False )tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "Hello my friend"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, do_sample=True, top_p=0.9, dtype=paddle.float32 )generated_ids = outputs[0].numpy().flatten()generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)print([{"generated_text": generated_text}])```### Limitations and biasAs mentioned in Meta AI's model card, given that the training data used for this model contains a lot ofunfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training> data induces downstream impact on the quality of our model, OPT-175B has limitations in terms> of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and> hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern> large language models. Here's an example of how the model can have biased predictions:```pythonimport paddleimport numpy as npfrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerpaddle.seed(32)np.random.seed(32)model_name = "opt-2.7b"model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False)tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "The woman worked as a"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)num_return_sequences = 5outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, decode_strategy="sampling", # 显式指定解码策略为采样 do_sample=True, top_p=0.9, num_return_sequences=num_return_sequences, dtype=paddle.float32)results = []for output in outputs[0]: generated_ids = output.numpy().flatten() generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) results.append({"generated_text": generated_text})print(results) ```compared to:```pythonimport paddleimport numpy as npfrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerpaddle.seed(32)np.random.seed(32)model_name = "opt-2.7b"model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False)tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "The man worked as a"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)num_return_sequences = 5outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, decode_strategy="sampling", # 显式指定解码策略为采样 do_sample=True, top_p=0.9, num_return_sequences=num_return_sequences, dtype=paddle.float32)results = []for output in outputs[0]: generated_ids = output.numpy().flatten() generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) results.append({"generated_text": generated_text})print(results) ```This bias will also affect all fine-tuned versions of this model.## Training dataThe Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match thestory-like style of Winograd schemas, - The Pile, from which * Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews* were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed inRoller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl Newsdataset that was used in RoBERTa (Liu et al., 2019b)The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionallyto each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset ofpublic Common Crawl data, along with a subset of public Reddit data, which could contain sentencesthat, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.### Collection processThe dataset was collected form internet, and went through classic data processing algorithms andre-formatting practices, including removing repetitive/non-informative text like *Chapter One* or*This ebook by Project Gutenberg.*## Training procedure### PreprocessingThe texts are tokenized using the **GPT2** byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and avocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens.The 175B model was trained on 992 *80GB A100 GPUs*. The training duration was roughly ~33 days of continuous training.### BibTeX entry and citation info```bibtex@misc{zhang2022opt, title={OPT: Open Pre-trained Transformer Language Models}, author={Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer}, year={2022}, eprint={2205.01068}, archivePrefix={arXiv}, primaryClass={cs.CL}}```---# OPT：Open Pre-trained Transformer 语言模型OPT 首次发表于论文 Open Pre-trained Transformer Language Models**免责声明**: OPT 研发团队在论文附录 D 中提供了官方模型卡## 模型介绍引用原论文的前两段：> 通过海量文本训练的大规模语言模型展现出令人惊讶的文本生成能力和零样本/少样本学习能力。尽管部分模型通过付费 API 向公众开放，但完整模型权限目前仅限于少数资源充足的实验室。这种访问限制阻碍了研究者对模型机理的研究，影响了在鲁棒性、偏见和毒性等领域的改进进程。> 我们推出 Open Pretrained Transformers (OPT) 系列模型，包含 125M 至 175B 参数的 decoder-only 预训练transformer，致力于向研究者全面且负责任地开放。OPT 模型在参数量级和性能表现上与 GPT-3 系列对齐，同时应用了最新的数据收集和高效训练方案。我们希望通过 OPT 模型套件推动可复现的大规模研究，让更多研究者能够深入探索大语言模型的影响。关于风险、伤害、偏见和毒性等定义，应该由全体研究社区共同探讨——而这只有在模型可被研究的情况下才能实现。## 模型细节OPT 主要使用英文文本进行预训练，通过 CommonCrawl 包含少量非英语数据。模型采用因果语言建模（CLM）目标进行预训练，属于与 GPT-3 同类的 decoder-only 模型家族。评估方面，OPT 沿用 GPT-3 的提示模板和实验设置，详见原论文### 使用方法可直接使用 pipeline 进行文本生成：```pythonimport paddlefrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerimport numpy as npmodel_name = "opt-2.7b" model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False )tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "Hello my friend"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, do_sample=True, top_p=0.9, dtype=paddle.float32 )generated_ids = outputs[0].numpy().flatten()generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)print([{"generated_text": generated_text}])```默认生成方式是确定性的。若要使用 top-k 采样，请将`do_sample` 设为 `True`.```pythonimport paddlefrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerimport numpy as nppaddle.seed(32)np.random.seed(32)model_name = "opt-2.7b" model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False )tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "Hello my friend"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, do_sample=True, top_p=0.9, dtype=paddle.float32 )generated_ids = outputs[0].numpy().flatten()generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)print([{"generated_text": generated_text}])```## 局限性与偏见如 Meta AI 模型卡片所述，由于训练数据包含大量未经过滤的互联网内容（远非中立），模型存在显著偏见：>与其他因训练数据多样性不足而影响下游模型质量的大型语言模型类似，OPT-175B 在偏见和安全性方面存在局限。OPT-175B 在生成多样性和幻觉方面也存在质量问题。总体而言，现代大型语言模型面临的诸多问题也同样存在于 OPT-175B 中。以下示例展示模型的偏见预测：```pythonimport paddleimport numpy as npfrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerpaddle.seed(32)np.random.seed(32)model_name = "opt-2.7b"model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False)tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "The woman worked as a"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)num_return_sequences = 5outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, decode_strategy="sampling", # 显式指定解码策略为采样 do_sample=True, top_p=0.9, num_return_sequences=num_return_sequences, dtype=paddle.float32)results = []for output in outputs[0]: generated_ids = output.numpy().flatten() generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) results.append({"generated_text": generated_text})print(results) ```对比男性示例：```pythonimport paddleimport numpy as npfrom paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizerpaddle.seed(32)np.random.seed(32)model_name = "opt-2.7b"model = AutoModelForCausalLM.from_pretrained( model_name, dtype=paddle.float32, from_aistudio=False)tokenizer = AutoTokenizer.from_pretrained(model_name)input_text = "The man worked as a"inputs = tokenizer(input_text, return_tensors="pd", padding=True, truncation=True)num_return_sequences = 5outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask", None), max_length=50, decode_strategy="sampling", # 显式指定解码策略为采样 do_sample=True, top_p=0.9, num_return_sequences=num_return_sequences, dtype=paddle.float32)results = []for output in outputs[0]: generated_ids = output.numpy().flatten() generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True) results.append({"generated_text": generated_text})print(results) ```这种偏见将影响该模型的所有微调版本。## 训练数据Meta AI 团队致力于使用尽可能大的语料库进行训练，语料库由以下 5 个经过筛选的文本数据集组成：- BookCorpus：包含 10,000+ 未出版书籍- CC-Stories：包含经过筛选的 CommonCrawl 数据子集，匹配 Winograd 模式的叙事风格- The Pile：包含 Pile-CC、OpenWebText2、USPTO、Project Gutenberg、OpenSubtitles、Wikipedia、DM Mathematics 和 HackerNews- Pushshift.io Reddit 数据集（来自 Baumgartner 等人 (2020)，经 Roller 等人 (2021) 处理）- CCNewsV2：包含 RoBERTa (Liu 等人, 2019b) 中使用的 CommonCrawl News 数据集英文部分的更新版本最终训练数据包含 1800 亿 token（约 800GB）。验证集从预训练数据中按比例抽样 200MB 组成。由于数据集包含部分 Common Crawl 和 Reddit 公开数据子集，可能包含直接查看时具有冒犯性、威胁性或引发焦虑的句子。### 数据收集数据集从互联网收集，经过经典数据处理算法和重构实践，包括移除重复/非信息性文本（如"第一章"或"Project Gutenberg 的电子书"等）。## 训练过程### 预处理文本使用 ***GPT2*** 字节级 BPE（针对 Unicode 字符）进行分词，词表大小 50272。输入为 2048 个连续 token 组成的序列。175B 模型在 992 块 80GB A100 GPU 上训练，持续训练时间约 33 天。### BibTeX条目与引用信息```bibtex@misc{zhang2022opt, title={OPT: Open Pre-trained Transformer Language Models}, author={Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer}, year={2022}, eprint={2205.01068}, archivePrefix={arXiv}, primaryClass={cs.CL}}