megemini/PaddleOCR-VL-Receipt_文心大模型_多模态模型_图像到文本-飞桨AI Studio星河社区

---license: Apache License 2.0language: - Multilingualtasks: - ERNIE Large Models - Multimodal Models - Image-to-Texttraining_framework: ERNIEKitinference_framework: Safetensorsbase_model: - PaddlePaddle/PaddleOCR-VLmodel_lineage: finetune---## PaddleOCR-VL-ReceiptThis model is based on [WildReceipt](https://download.openmmlab.com/mmocr/data/wildreceipt.tar) data and fine-tuned from PaddleOCR-VL to modify its output for different prompts, enabling it to extract information from receipts and documents.### Model InferenceUse the fine-tuned model for inference by modifying the `prompt` parameter to control the model's output with [PaddleOCR-VL-REC](https://github.com/megemini/PaddleOCR-VL-REC).| Input Image | Complete Information Extraction | Specific Information Extraction ||---------|---------|---------|| ![Receipt](https://ai-studio-static-online.cdn.bcebos.com/5c913163651044d6baecc3aeaca82b72a9c576f9cc864473971a66a95f608903) | ![Full Recognition](https://ai-studio-static-online.cdn.bcebos.com/f22ff349f7db461398320c84988f7f4c80819f715499499ca5375cc1ef5fbb5a) | ![Partial Recognition](https://ai-studio-static-online.cdn.bcebos.com/f741bc0f07e045a0b8892258eb9956a18a6077696e564146b345cdb550f609e7) |## Model Description### Basic Model InformationPaddleOCR-VL is a Vision-Language Model (VLM) specifically designed for document understanding. It can complete various document understanding tasks through prompts, including text recognition (OCR), table recognition, formula recognition, and chart recognition. This project extends its application scenarios to enable structured information extraction tasks.### Model Architecture and Training- **Base Model**: PaddleOCR-VL-0.9B- **Fine-tuning Method**: Supervised Fine-Tuning (SFT) using the ERNIE framework- **Core Innovation**: Extends model functionality through custom prompt templates, enabling the model to output structured JSON format data based on different field requirements### Applicable Scenarios- Complete document information extraction- Partial document information extraction> This model is optimized specifically for [WildReceipt](https://download.openmmlab.com/mmocr/data/wildreceipt.tar). The generalization capability for other types of documents requires further verification.## Expected Model Usage and Applicable Scope### Core FeaturesThe fine-tuned PaddleOCR-VL model supports two main usage methods:1. **Complete Information Extraction**: Input `OCR:{}` prompt, the model outputs all structured information (JSON format) from the document2. **Specific Field Extraction**: By specifying field names, the model outputs values for specific fields (JSON format)### Applicable Scope- Support for string fields: `OCR:{"field_name":""}`- Support for dictionary/object fields: `OCR:{"field_name":{}}`- Support for list fields: `OCR:{"field_name":[]}`### Fine-tuning UsageFor fine-tuning the model based on your own data, please refer to:- [PaddleOCR-VL-0.9B SFT](https://gitee.com/PaddlePaddle/ERNIE/blob/release/v1.4/docs/paddleocr_vl_sft_zh.md)- [Fine-tuning PaddleOCR-VL in a New Way -- Prompt and Information Extraction](https://aistudio.baidu.com/projectdetail/9857242)## Model Limitations and Potential Biases### Known Limitations1. **Data Volume Limitations**: The current fine-tuning uses a relatively small amount of data, and there is still room for improvement in model performance2. **Document Type Limitations**: Primarily optimized for receipt documents; generalization capability for other document types requires verification3. **Information Completeness**: The model may perform poorly on complex tables or multi-column layout documents### Potential Biases1. **Recognition Accuracy**: The model's recognition results depend on the input image quality2. **Field Format**: For non-standard format field values, the model may fail to parse correctly3. **Multi-page Documents**: Currently primarily optimized for single-page documents### Usage Recommendations- Ensure input images are clear and complete- Prompt templates should be consistent with training data format- For new document types, targeted fine-tuning is recommended- Comprehensive validation and testing are recommended before production use## Related Resources### References- [PaddleOCR-VL-0.9B SFT](https://gitee.com/PaddlePaddle/ERNIE/blob/release/v1.4/docs/paddleocr_vl_sft_zh.md)- [Fine-tuning PaddleOCR-VL in a New Way -- Prompt and Information Extraction](https://aistudio.baidu.com/projectdetail/9857242)### Related Tools- ERNIE Framework: https://github.com/PaddlePaddle/ERNIE- PaddleOCR-VL Model: https://github.com/PaddlePaddle/PaddleOCR