PEIT: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation

Shaolin Zhu, Shangjie Li, Yikun Lei, Deyi Xiong


Abstract
Image translation is a task that translates an image containing text in the source language to the target language. One major challenge with image translation is the modality gap between visual text inputs and textual inputs/outputs of machine translation (MT). In this paper, we propose PEIT, an end-to-end image translation framework that bridges the modality gap with pre-trained models. It is composed of four essential components: a visual encoder, a shared encoder-decoder backbone network, a vision-text representation aligner equipped with the shared encoder and a cross-modal regularizer stacked over the shared decoder. Both the aligner and regularizer aim at reducing the modality gap. To train PEIT, we employ a two-stage pre-training strategy with an auxiliary MT task: (1) pre-training the MT model on the MT training data to initialize the shared encoder-decoder backbone network; and (2) pre-training PEIT with the aligner and regularizer on a synthesized dataset with rendered images containing text from the MT training data. In order to facilitate the evaluation of PEIT and promote research on image translation, we create a large-scale image translation corpus ECOIT containing 480K image-translation pairs via crowd-sourcing and manual post-editing from real-world images in the e-commerce domain. Experiments on the curated ECOIT benchmark dataset demonstrate that PEIT substantially outperforms both cascaded image translation systems (OCR+MT) and previous strong end-to-end image translation model, with fewer parameters and faster decoding speed.
Anthology ID:
2023.acl-long.751
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13433–13447
Language:
URL:
https://aclanthology.org/2023.acl-long.751
DOI:
10.18653/v1/2023.acl-long.751
Bibkey:
Cite (ACL):
Shaolin Zhu, Shangjie Li, Yikun Lei, and Deyi Xiong. 2023. PEIT: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13433–13447, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
PEIT: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation (Zhu et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.751.pdf
Video:
 https://aclanthology.org/2023.acl-long.751.mp4