2024
pdf
bib
abs
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
|
Haiyang Xu
|
Jiabo Ye
|
Ming Yan
|
Liang Zhang
|
Bo Zhang
|
Ji Zhang
|
Qin Jin
|
Fei Huang
|
Jingren Zhou
Findings of the Association for Computational Linguistics: EMNLP 2024
Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose Unified Structure Learning to boost the performance of MLLMs. Based on publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks. All codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.
pdf
bib
abs
MNER-MI: A Multi-image Dataset for Multimodal Named Entity Recognition in Social Media
Shizhou Huang
|
Bo Xu
|
Changqun Li
|
Jiabo Ye
|
Xin Lin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Recently, multimodal named entity recognition (MNER) has emerged as a vital research area within named entity recognition. However, current MNER datasets and methods are predominantly based on text and a single accompanying image, leaving a significant research gap in MNER scenarios involving multiple images. To address the critical research gap and enhance the scope of MNER for real-world applications, we propose a novel human-annotated MNER dataset with multiple images called MNER-MI. Additionally, we construct a dataset named MNER-MI-Plus, derived from MNER-MI, to ensure its generality and applicability. Based on these datasets, we establish a comprehensive set of strong and representative baselines and we further propose a simple temporal prompt model with multiple images to address the new challenges in multi-image scenarios. We have conducted extensive experiments to demonstrate that considering multiple images provides a significant improvement over a single image and can offer substantial benefits for MNER. Furthermore, our proposed method achieves state-of-the-art results on both MNER-MI and MNER-MI-Plus, demonstrating its effectiveness. The datasets and source code can be found at https://github.com/JinFish/MNER-MI.
2023
pdf
bib
abs
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
Jiabo Ye
|
Anwen Hu
|
Haiyang Xu
|
Qinghao Ye
|
Ming Yan
|
Guohai Xu
|
Chenliang Li
|
Junfeng Tian
|
Qi Qian
|
Ji Zhang
|
Qin Jin
|
Liang He
|
Xin Lin
|
Fei Huang
Findings of the Association for Computational Linguistics: EMNLP 2023
Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.
2022
pdf
bib
abs
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li
|
Haiyang Xu
|
Junfeng Tian
|
Wei Wang
|
Ming Yan
|
Bin Bi
|
Jiabo Ye
|
He Chen
|
Guohai Xu
|
Zheng Cao
|
Ji Zhang
|
Songfang Huang
|
Fei Huang
|
Jingren Zhou
|
Luo Si
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind