2024
pdf
bib
abs
NAIST Simultaneous Speech Translation System for IWSLT 2024
Yuka Ko
|
Ryo Fukuda
|
Yuta Nishikawa
|
Yasumasa Kano
|
Tomoya Yanagita
|
Kosuke Doi
|
Mana Makinae
|
Haotian Tan
|
Makoto Sakai
|
Sakriani Sakti
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
This paper describes NAIST’s submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
pdf
bib
abs
NAIST-SIC-Aligned: An Aligned English-Japanese Simultaneous Interpretation Corpus
Jinming Zhao
|
Katsuhito Sudoh
|
Satoshi Nakamura
|
Yuka Ko
|
Kosuke Doi
|
Ryo Fukuda
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT). Research has been limited due to the lack of a large-scale training corpus. In this work, we aim to fill in the gap by introducing NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset. Starting with a non-aligned corpus NAIST-SIC, we propose a two-stage alignment approach to make the corpus parallel and thus suitable for model training. The first stage is coarse alignment where we perform a many-to-many mapping between source and target sentences, and the second stage is fine-grained alignment where we perform intra- and inter-sentence filtering to improve the quality of aligned pairs. To ensure the quality of the corpus, each step has been validated either quantitatively or qualitatively. This is the first open-sourced large-scale parallel SI dataset in the literature. We also manually curated a small test set for evaluation purposes. Our results show that models trained with SI data lead to significant improvement in translation quality and latency over baselines. We hope our work advances research on SI corpora construction and SiMT. Our data will be released upon the paper’s acceptance.
2023
pdf
bib
abs
NAIST Simultaneous Speech-to-speech Translation System for IWSLT 2023
Ryo Fukuda
|
Yuta Nishikawa
|
Yasumasa Kano
|
Yuka Ko
|
Tomoya Yanagita
|
Kosuke Doi
|
Mana Makinae
|
Sakriani Sakti
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes NAIST’s submission to the IWSLT 2023 Simultaneous Speech Translation task: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. Our speech-to-text system uses an end-to-end multilingual speech translation model based on large-scale pre-trained speech and text models. We add Inter-connections into the model to incorporate the outputs from intermediate layers of the pre-trained speech model and augment prefix-to-prefix text data using Bilingual Prefix Alignment to enhance the simultaneity of the offline speech translation model. Our speech-to-speech system employs an incremental text-to-speech module that consists of a Japanese pronunciation estimation model, an acoustic model, and a neural vocoder.
pdf
bib
abs
Tagged End-to-End Simultaneous Speech Translation Training Using Simultaneous Interpretation Data
Yuka Ko
|
Ryo Fukuda
|
Yuta Nishikawa
|
Yasumasa Kano
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Simultaneous speech translation (SimulST) translates partial speech inputs incrementally. Although the monotonic correspondence between input and output is preferable for smaller latency, it is not the case for distant language pairs such as English and Japanese. A prospective approach to this problem is to mimic simultaneous interpretation (SI) using SI data to train a SimulST model. However, the size of such SI data is limited, so the SI data should be used together with ordinary bilingual data whose translations are given in offline. In this paper, we propose an effective way to train a SimulST model using mixed data of SI and offline. The proposed method trains a single model using the mixed data with style tags that tell the model to generate SI- or offline-style outputs. Experiment results show improvements of BLEURT in different latency ranges, and our analyses revealed the proposed model generates SI-style outputs more than the baseline.
2022
pdf
bib
abs
NAIST Simultaneous Speech-to-Text Translation System for IWSLT 2022
Ryo Fukuda
|
Yuka Ko
|
Yasumasa Kano
|
Kosuke Doi
|
Hirotaka Tokuyama
|
Sakriani Sakti
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
This paper describes NAIST’s simultaneous speech translation systems developed for IWSLT 2022 Evaluation Campaign. We participated the speech-to-speech track for English-to-German and English-to-Japanese. Our primary submissions were end-to-end systems using adaptive segmentation policies based on Prefix Alignment.
2021
pdf
bib
abs
NAIST English-to-Japanese Simultaneous Translation System for IWSLT 2021 Simultaneous Text-to-text Task
Ryo Fukuda
|
Yui Oka
|
Yasumasa Kano
|
Yuki Yano
|
Yuka Ko
|
Hirotaka Tokuyama
|
Kosuke Doi
|
Sakriani Sakti
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
This paper describes NAIST’s system for the English-to-Japanese Simultaneous Text-to-text Translation Task in IWSLT 2021 Evaluation Campaign. Our primary submission is based on wait-k neural machine translation with sequence-level knowledge distillation to encourage literal translation.
pdf
bib
abs
On Knowledge Distillation for Translating Erroneous Speech Transcriptions
Ryo Fukuda
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
Recent studies argue that knowledge distillation is promising for speech translation (ST) using end-to-end models. In this work, we investigate the effect of knowledge distillation with a cascade ST using automatic speech recognition (ASR) and machine translation (MT) models. We distill knowledge from a teacher model based on human transcripts to a student model based on erroneous transcriptions. Our experimental results demonstrated that knowledge distillation is beneficial for a cascade ST. Further investigation that combined knowledge distillation and fine-tuning revealed that the combination consistently improved two language pairs: English-Italian and Spanish-English.
2020
pdf
bib
abs
NAIST’s Machine Translation Systems for IWSLT 2020 Conversational Speech Translation Task
Ryo Fukuda
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 17th International Conference on Spoken Language Translation
This paper describes NAIST’s NMT system submitted to the IWSLT 2020 conversational speech translation task. We focus on the translation disfluent speech transcripts that include ASR errors and non-grammatical utterances. We tried a domain adaptation method by transferring the styles of out-of-domain data (United Nations Parallel Corpus) to be like in-domain data (Fisher transcripts). Our system results showed that the NMT model with domain adaptation outperformed a baseline. In addition, slight improvement by the style transfer was observed.