LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder

Zhiyang Zhang, Yaping Zhang, Yupu Liang, Lu Xiang, Yang Zhao, Yu Zhou, Chengqing Zong


Abstract
Document image translation (DIT) aims to translate text embedded in images from one language to another. It is a challenging task that needs to understand visual layout with text semantics simultaneously. However, existing methods struggle to capture the crucial visual layout in real-world complex document images. In this work, we make the first attempt to incorporate layout knowledge into DIT in an end-to-end way. Specifically, we propose a novel Layout-aware end-to-end Document Image Translation (LayoutDIT) with multi-step conductive decoder. A layout-aware encoder is first introduced to model visual layout relations with raw OCR results. Then a novel multi-step conductive decoder is unified with hidden states conduction across three step-decoders to achieve the document translation step by step. Benefiting from the layout-aware end-to-end joint training, our LayoutDIT outperforms state-of-the-art methods with better parameter efficiency. Besides, we create a new multi-domain document image translation dataset to validate the model’s generalization. Extensive experiments show that LayoutDIT has a good generalization in diverse and complex layout scenes.
Anthology ID:
2023.findings-emnlp.673
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10043–10053
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.673
DOI:
10.18653/v1/2023.findings-emnlp.673
Bibkey:
Cite (ACL):
Zhiyang Zhang, Yaping Zhang, Yupu Liang, Lu Xiang, Yang Zhao, Yu Zhou, and Chengqing Zong. 2023. LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10043–10053, Singapore. Association for Computational Linguistics.
Cite (Informal):
LayoutDIT: Layout-Aware End-to-End Document Image Translation with Multi-Step Conductive Decoder (Zhang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.673.pdf