Video-Helpful Multimodal Machine Translation

Yihang Li, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi, Wei Li


Abstract
Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English parallel subtitle pairs, 520k Chinese-English parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models.
Anthology ID:
2023.emnlp-main.260
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4281–4299
Language:
URL:
https://aclanthology.org/2023.emnlp-main.260
DOI:
10.18653/v1/2023.emnlp-main.260
Bibkey:
Cite (ACL):
Yihang Li, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi, and Wei Li. 2023. Video-Helpful Multimodal Machine Translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4281–4299, Singapore. Association for Computational Linguistics.
Cite (Informal):
Video-Helpful Multimodal Machine Translation (Li et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.260.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.260.mp4