English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos

Ayu Teramen; Takumi Ohtsuka; Risa Kondo; Tomoyuki Kajiwara; Takashi Ninomiya

doi:10.18653/v1/2024.alvr-1.7

English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos

Ayu Teramen, Takumi Ohtsuka, Risa Kondo, Tomoyuki Kajiwara, Takashi Ninomiya

Abstract

We work on a multimodal machine translation of the audio contained in English lecture videos to generate Japanese subtitles. Image-guided multimodal machine translation is promising for error correction in speech recognition and for text disambiguation. In our situation, lecture videos provide a variety of images. Images of presentation materials can complement information not available from audio and may help improve translation quality. However, images of speakers or audiences would not directly affect the translation quality. We construct a multimodal parallel corpus with automatic speech recognition text and multiple images for a transcribed parallel corpus of lecture videos, and propose a method to select the most relevant ones from the multiple images with the speech text for improving the performance of image-guided multimodal machine translation. Experimental results on translating automatic speech recognition or transcribed English text into Japanese show the effectiveness of our method to select a relevant image.

Anthology ID:: 2024.alvr-1.7
Volume:: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
Venues:: ALVR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 86–91
Language:
URL:: https://aclanthology.org/2024.alvr-1.7
DOI:: 10.18653/v1/2024.alvr-1.7
Bibkey:
Cite (ACL):: Ayu Teramen, Takumi Ohtsuka, Risa Kondo, Tomoyuki Kajiwara, and Takashi Ninomiya. 2024. English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 86–91, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos (Teramen et al., ALVR-WS 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.alvr-1.7.pdf

PDF Cite Search