NTTSU at WMT2024 General Translation Task

Minato Kondo, Ryo Fukuda, Xiaotian Wang, Katsuki Chousa, Masato Nishimura, Kosei Buma, Takatomo Kano, Takehito Utsuro


Abstract
The NTTSU team’s submission leverages several large language models developed through a training procedure that includes continual pre-training and supervised fine-tuning. For paragraph-level translation, we generated synthetic paragraph-aligned data and utilized this data for training.In the task of translating Japanese to Chinese, we particularly focused on the speech domain translation. Specifically, we built Whisper models for Japanese automatic speech recognition (ASR). We used YODAS dataset for Whisper training. Since this data contained many noisy data pairs, we combined the Whisper outputs using ROVER for polishing the transcriptions. Furthermore, to enhance the robustness of the translation model against errors in the transcriptions, we performed data augmentation by forward translation from audio, using both ASR and base translation models.To select the best translation from multiple hypotheses of the models, we applied Minimum Bayes Risk decoding + reranking, incorporating scores such as COMET-QE, COMET, and cosine similarity by LaBSE.
Anthology ID:
2024.wmt-1.20
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
270–279
Language:
URL:
https://aclanthology.org/2024.wmt-1.20
DOI:
Bibkey:
Cite (ACL):
Minato Kondo, Ryo Fukuda, Xiaotian Wang, Katsuki Chousa, Masato Nishimura, Kosei Buma, Takatomo Kano, and Takehito Utsuro. 2024. NTTSU at WMT2024 General Translation Task. In Proceedings of the Ninth Conference on Machine Translation, pages 270–279, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
NTTSU at WMT2024 General Translation Task (Kondo et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.20.pdf