
---license: Apache License 2.0 tasks:- Large Language Models- Text Generation- Feature Extraction- Sentence Similarity------## bge-large-zh-v1.5 是 BAAI 开发的中文大型嵌入模型,v1.5 版优化相似度分布,支持 512 长度输入,1024 维向量。MTEB 基准平均 64.23 分,适用于文本检索、聚类、分类等任务,高效生成语义向量,性能优异,满足复杂语义理解与检索需求。## How to Use?你可以用paddle生态库进行快速的尝试!```python# 安装必要的库# pip install paddlenlp paddlepaddleimport paddleimport paddle.nn.functional as Fimport numpy as npfrom paddlenlp.transformers import AutoModel, AutoTokenizer# 加载 BGE-Large-zh-v1.5 模型model_name = "BAAI/bge-large-zh-v1.5"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModel.from_pretrained(model_name)# 设置为评估模式model.eval()# 定义获取文本嵌入的函数def get_embeddings(texts, max_length=512): """ 获取文本嵌入向量 Args: texts: 字符串或字符串列表 max_length: 最大序列长度 Returns: 文本嵌入向量 """ if isinstance(texts, str): texts = [texts] # 对文本进行编码 inputs = tokenizer( texts, padding=True, truncation=True, max_length=max_length, return_tensors="pd" ) # 添加指令前缀(BGE-v1.5 模型推荐使用) # 如果是检索任务 prefix = "Represent this sentence for searching relevant passages: " texts = [prefix + text for text in texts] # 获取文本嵌入 with paddle.no_grad(): outputs = model(**inputs) # 使用CLS向量作为文本表示 embeddings = outputs[0][:, 0] # 对嵌入向量进行归一化 embeddings = F.normalize(embeddings, p=2, axis=1) return embeddings# 计算文本之间的相似度def calculate_similarity(embeddings1, embeddings2): """ 计算嵌入向量之间的余弦相似度 """ # 确保向量已归一化 return paddle.matmul(embeddings1, embeddings2.t()).numpy()# 测试文本query = "什么是人工智能?"passages = [ "人工智能是计算机科学的一个分支,它尝试理解智能的本质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。", "机器学习是人工智能的一个子领域,专注于让计算机系统通过经验自动改进。", "深度学习是机器学习的一种方法,使用多层神经网络学习数据表示。", "自然语言处理是人工智能的一个分支,专注于让计算机理解和生成人类语言。", "图片识别是计算机视觉领域的一项任务,目标是识别数字图像中的物体、人物、场景等。"]# 获取嵌入query_embedding = get_embeddings(query)passage_embeddings = get_embeddings(passages)# 计算相似度similarity_scores = calculate_similarity(query_embedding, passage_embeddings)# 显示相似度结果print(f"查询文本: {query}\n")print("文本相似度排名:")sorted_indices = np.argsort(-similarity_scores[0])for i, idx in enumerate(sorted_indices): print(f"{i+1}. 相似度: {similarity_scores[0][idx]:.4f} - {passages[idx]}")# 测试英文文本english_query = "What is artificial intelligence?"english_passages = [ "Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new type of intelligent machine that responds in a manner similar to human intelligence.", "Machine learning is a subfield of AI focused on getting computer systems to automatically improve through experience.", "Deep learning is a method of machine learning that uses multi-layer neural networks to learn data representations.", "Natural language processing is a branch of AI focused on enabling computers to understand and generate human language.", "Image recognition is a task in computer vision aimed at recognizing objects, people, scenes, etc. in digital images."]# 获取英文嵌入english_query_embedding = get_embeddings(english_query)english_passage_embeddings = get_embeddings(english_passages)# 计算英文相似度english_similarity_scores = calculate_similarity(english_query_embedding, english_passage_embeddings)# 显示英文相似度结果print(f"\n英文查询: {english_query}\n")print("英文文本相似度排名:")english_sorted_indices = np.argsort(-english_similarity_scores[0])for i, idx in enumerate(english_sorted_indices): print(f"{i+1}. 相似度: {english_similarity_scores[0][idx]:.4f} - {english_passages[idx]}")``` ## 引用如果您觉得本仓库有用,欢迎star ⭐ 并引用:```@misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL}}```## 许可证FlagEmbedding采用MIT许可证,发布的模型可免费用于商业用途。