3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Yihao Ding; Lorenzo Vaiani; Caren Han; Jean Lee; Paolo Garza; Josiah Poon; Luca Cagliero

doi:10.18653/v1/2024.findings-acl.903

3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Yihao Ding, Lorenzo Vaiani, Caren Han, Jean Lee, Paolo Garza, Josiah Poon, Luca Cagliero

Abstract

This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents.

Anthology ID:: 2024.findings-acl.903
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15233–15244
Language:
URL:: https://aclanthology.org/2024.findings-acl.903/
DOI:: 10.18653/v1/2024.findings-acl.903
Bibkey:
Cite (ACL):: Yihao Ding, Lorenzo Vaiani, Caren Han, Jean Lee, Paolo Garza, Josiah Poon, and Luca Cagliero. 2024. 3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding. In Findings of the Association for Computational Linguistics: ACL 2024, pages 15233–15244, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: 3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding (Ding et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.903.pdf

PDF Cite Search Fix data