Chutong Meng
2026
Speech Translation and Metrics in 2026: Findings of the IWSLT Campaign
David Ifeoluwa Adelani | Victor Agostinelli | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Sébastien Bratières | Marine Carpuat | Fabrício Carraro | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | Marcello Federico | Marco Gaido | Mahendra Gupta | HyoJung Han | Ali Hatami | Lewis C. Howe | Dávid Javorský | Yejin Jeon | Marek Kasztelnik | Antoine Laurent | Danni Liu | Nam Luu | Min Ma | Dominik Macháček | Marie Maltais | Evgeny Matusov | John McCrae | Chutong Meng | Chandresh Kumar Maurya | Mohammad Mohammadamini | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Siqi Ouyang | Sara Papi | Peter Polák | Fabian Retkowski | Stephanny Sánchez | Beatrice Savoldi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Marie Tahon | Marco Turchi | Alexander Waibel | Patrick Wilken | Rodolfo Joel Zevallos | Vilem Zouhar | Maike Züfle
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
David Ifeoluwa Adelani | Victor Agostinelli | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Sébastien Bratières | Marine Carpuat | Fabrício Carraro | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | Marcello Federico | Marco Gaido | Mahendra Gupta | HyoJung Han | Ali Hatami | Lewis C. Howe | Dávid Javorský | Yejin Jeon | Marek Kasztelnik | Antoine Laurent | Danni Liu | Nam Luu | Min Ma | Dominik Macháček | Marie Maltais | Evgeny Matusov | John McCrae | Chutong Meng | Chandresh Kumar Maurya | Mohammad Mohammadamini | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Siqi Ouyang | Sara Papi | Peter Polák | Fabian Retkowski | Stephanny Sánchez | Beatrice Savoldi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Marie Tahon | Marco Turchi | Alexander Waibel | Patrick Wilken | Rodolfo Joel Zevallos | Vilem Zouhar | Maike Züfle
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
This paper reports on the outcomes of the shared tasks organized as part of the 23rd International Workshop on Spoken Language Translation (IWSLT). The workshop covered ten major challenges in spoken language translation, including speech-to-text translation for both high-resource and low-resource language pairs, customized speech translation, speech generation, instruction-following speech processing, and the evaluation of speech translation systems. The shared tasks received strong participation, with more than 30 teams submitting runs. This year’s edition broadened the range of tasks, placing particular emphasis on speech generation and evaluation metrics.
2025
GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task
Chutong Meng | Antonios Anastasopoulos
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Chutong Meng | Antonios Anastasopoulos
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.
Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
Chutong Meng | Philipp Koehn
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chutong Meng | Philipp Koehn
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.
2024
RepCodec: A Speech Representation Codec for Speech Tokenization
Zhichao Huang | Chutong Meng | Tom Ko
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhichao Huang | Chutong Meng | Tom Ko
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec.We believe our method can facilitate large language modeling research on speech processing.
Search
Fix author
Co-authors
- Antonios Anastasopoulos 2
- David Ifeoluwa Adelani 1
- Victor Agostinelli 1
- Luisa Bentivogli 1
- Ondřej Bojar 1
- Sébastien Bratières 1
- Marine Carpuat 1
- Fabrício Carraro 1
- Roldano Cattoni 1
- Mauro Cettolo 1
- Lizhong Chen 1
- Marcello Federico 1
- Marco Gaido 1
- Mahendra Gupta 1
- HyoJung Han 1
- Ali Hatami 1
- Lewis C. Howe 1
- Zhichao Huang 1
- Dávid Javorský 1
- Yejin Jeon 1
- Marek Kasztelnik 1
- Tom Ko 1
- Philipp Koehn 1
- Antoine Laurent 1
- Danni Liu 1
- Nam Luu 1
- Min Ma 1
- Dominik Macháček 1
- Marie Maltais 1
- Evgeny Matusov 1
- Chandresh Kumar Maurya 1
- John Philip McCrae 1
- Mohammad Mohammadamini 1
- Yasmin Moslem 1
- Kenton Murray 1
- Satoshi Nakamura 1
- Matteo Negri 1
- Jan Niehues 1
- Atul Kr. Ojha 1
- John E. Ortega 1
- Siqi Ouyang 1
- Sara Papi 1
- Peter Polák 1
- Fabian Retkowski 1
- Beatrice Savoldi 1
- Claytone Sikasote 1
- Matthias Sperber 1
- Sebastian Stüker 1
- Katsuhito Sudoh 1
- Stephanny Sánchez 1
- Marie Tahon 1
- Marco Turchi 1
- Alexander Waibel 1
- Patrick Wilken 1
- Rodolfo Zevallos 1
- Vilém Zouhar 1
- Maike Züfle 1