Shangbang Long
2023
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
Chen-Yu Lee
|
Chun-Liang Li
|
Hao Zhang
|
Timothy Dozat
|
Vincent Perot
|
Guolong Su
|
Xiang Zhang
|
Kihyuk Sohn
|
Nikolay Glushnev
|
Renshen Wang
|
Joshua Ainslie
|
Shangbang Long
|
Siyang Qin
|
Yasuhisa Fujii
|
Nan Hua
|
Tomas Pfister
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
Search
Fix data
Co-authors
- Joshua Ainslie 1
- Timothy Dozat 1
- Yasuhisa Fujii 1
- Nikolay Glushnev 1
- Nan Hua 1
- show all...
Venues
- acl1