Tom Ko


pdf bib
DUB: Discrete Unit Back-translation for Speech Translation
Dong Zhang | Rong Ye | Tom Ko | Mingxuan Wang | Yaqian Zhou
Findings of the Association for Computational Linguistics: ACL 2023

How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST.Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation(DUB) to answer two questions (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at

pdf bib
Milind Agarwal | Sweta Agrawal | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | Mingda Chen | William Chen | Khalid Choukri | Alexandra Chronopoulou | Anna Currey | Thierry Declerck | Qianqian Dong | Kevin Duh | Yannick Estève | Marcello Federico | Souhir Gahbiche | Barry Haddow | Benjamin Hsu | Phu Mon Htut | Hirofumi Inaguma | Dávid Javorský | John Judge | Yasumasa Kano | Tom Ko | Rishu Kumar | Pengwei Li | Xutai Ma | Prashant Mathur | Evgeny Matusov | Paul McNamee | John P. McCrae | Kenton Murray | Maria Nadejde | Satoshi Nakamura | Matteo Negri | Ha Nguyen | Jan Niehues | Xing Niu | Atul Kr. Ojha | John E. Ortega | Proyag Pal | Juan Pino | Lonneke van der Plas | Peter Polák | Elijah Rippeth | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Yun Tang | Brian Thompson | Kevin Tran | Marco Turchi | Alex Waibel | Mingxuan Wang | Shinji Watanabe | Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

pdf bib
Learning Retrieval Augmentation for Personalized Dialogue Generation
Qiushi Huang | Shuai Fu | Xubo Liu | Wenwu Wang | Tom Ko | Yu Zhang | Lilian Tang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose Learning Retrieval Augmentation for Personalized DialOgue Generation (LAPDOG), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration.

pdf bib
CTC-based Non-autoregressive Speech Translation
Chen Xu | Xiaoqian Liu | Xiaowen Liu | Qingxuan Sun | Yuhao Zhang | Murun Yang | Qianqian Dong | Tom Ko | Mingxuan Wang | Tong Xiao | Anxiang Ma | Jingbo Zhu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST).In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67×, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.

pdf bib
MOSPC: MOS Prediction Based on Pairwise Comparison
Kexin Wang | Yunlong Zhao | Qianqian Dong | Tom Ko | Mingxuan Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC.The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality.


pdf bib
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Junyi Ao | Rui Wang | Long Zhou | Chengyi Wang | Shuo Ren | Yu Wu | Shujie Liu | Tom Ko | Qing Li | Yu Zhang | Zhihua Wei | Yao Qian | Jinyu Li | Furu Wei
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.