Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning

Hao Wang, Xiahua Chen, Rui Wang, Chenhui Chu


Abstract
Extracting meaningful entities belonging to predefined categories from Visually-rich Form-like Documents (VFDs) is a challenging task. Visual and layout features such as font, background, color, and bounding box location and size provide important cues for identifying entities of the same type. However, existing models commonly train a visual encoder with weak cross-modal supervision signals, resulting in a limited capacity to capture these non-textual features and suboptimal performance. In this paper, we propose a novel Visually-Asymmetric coNsistenCy Learning (VANCL) approach that addresses the above limitation by enhancing the model’s ability to capture fine-grained visual and layout features through the incorporation of color priors. Experimental results on benchmark datasets show that our approach substantially outperforms the strong LayoutLM series baseline, demonstrating the effectiveness of our approach. Additionally, we investigate the effects of different color schemes on our approach, providing insights for optimizing model performance. We believe our work will inspire future research on multimodal information extraction.
Anthology ID:
2023.emnlp-main.973
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15718–15731
Language:
URL:
https://aclanthology.org/2023.emnlp-main.973
DOI:
10.18653/v1/2023.emnlp-main.973
Bibkey:
Cite (ACL):
Hao Wang, Xiahua Chen, Rui Wang, and Chenhui Chu. 2023. Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15718–15731, Singapore. Association for Computational Linguistics.
Cite (Informal):
Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning (Wang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.973.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.973.mp4