Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

Yusuke Sakai, Mana Makinae, Hidetaka Kamigaito, Taro Watanabe


Abstract
In Simultaneous Machine Translation (SiMT), training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency. However, constructing such a corpus is challenging due to high costs, and limitations in annotator capabilities, and as a result, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation (ST) corpora into interpretation-style corpora, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models using the LLM-SI-Corpus reduces latency while achieving better quality compared to models fine-tuned with other corpora in both speech-to-text and text-to-text settings. The LLM-SI-Corpus is available at https://github.com/yusuke1997/LLM-SI-Corpus.
Anthology ID:
2024.emnlp-main.1248
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22375–22398
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1248
DOI:
Bibkey:
Cite (ACL):
Yusuke Sakai, Mana Makinae, Hidetaka Kamigaito, and Taro Watanabe. 2024. Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22375–22398, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair (Sakai et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1248.pdf