ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
Qiming Peng | Yinxu Pan | Wenjin Wang | Bin Luo | Zhenyu Zhang | Zhengjie Huang | Yuhui Cao | Weichong Yin | Yongfeng Chen | Yin Zhang | Shikun Feng | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at PaddleNLP.
Alpha at SemEval-2021 Task 6: Transformer Based Propaganda Classification
Zhida Feng | Jiji Tang | Jiaxiang Liu | Weichong Yin | Shikun Feng | Yu Sun | Li Chen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
This paper describes our system participated in Task 6 of SemEval-2021: the task focuses on multimodal propaganda technique classification and it aims to classify given image and text into 22 classes. In this paper, we propose to use transformer based architecture to fuse the clues from both image and text. We explore two branches of techniques including fine-tuning the text pretrained transformer with extended visual features, and fine-tuning the multimodal pretrained transformers. For the visual features, we have tested both grid features based on ResNet and salient region features from pretrained object detector. Among the pretrained multimodal transformers, we choose ERNIE-ViL, a two-steam cross-attended transformers pretrained on large scale image-caption aligned data. Fine-tuing ERNIE-ViL for our task produce a better performance due to general joint multimodal representation for text and image learned by ERNIE-ViL. Besides, as the distribution of the classification labels is very unbalanced, we also make a further attempt on the loss function and the experiment result shows that focal loss would perform better than cross entropy loss. Last we have won first for subtask C in the final competition.
- Shikun Feng 2
- Yu Sun 2
- Qiming Peng 1
- Yinxu Pan 1
- Wenjin Wang 1
- show all...