Cihan Xiao


2024

pdf bib
JHU IWSLT 2024 Dialectal and Low-resource System Description
Nathaniel Romney Robinson | Kaiser Sun | Cihan Xiao | Niyati Bafna | Weiting Tan | Haoran Xu | Henry Li Xinyuan | Ankur Kejriwal | Sanjeev Khudanpur | Kenton Murray | Paul McNamee
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

Johns Hopkins University (JHU) submitted systems for all eight language pairs in the 2024 Low-Resource Language Track. The main effort of this work revolves around fine-tuning large and publicly available models in three proposed systems: i) end-to-end speech translation (ST) fine-tuning of Seamless4MT v2; ii) ST fine-tuning of Whisper; iii) a cascaded system involving automatic speech recognition with fine-tuned Whisper and machine translation with NLLB. On top of systems above, we conduct a comparative analysis on different training paradigms, such as intra-distillation for NLLB as well as joint training and curriculum learning for SeamlessM4T v2. Our results show that the best-performing approach differs by language pairs, but that i) fine-tuned SeamlessM4T v2 tends to perform best for source languages on which it was pre-trained, ii) multi-task training helps Whisper fine-tuning, iii) cascaded systems with Whisper and NLLB tend to outperform Whisper alone, and iv) intra-distillation helps NLLB fine-tuning.

2023

pdf bib
JHU IWSLT 2023 Dialect Speech Translation System Description
Amir Hussein | Cihan Xiao | Neha Verma | Thomas Thebaud | Matthew Wiesner | Sanjeev Khudanpur
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper presents JHU’s submissions to the IWSLT 2023 dialectal and low-resource track of Tunisian Arabic to English speech translation. The Tunisian dialect lacks formal orthography and abundant training data, making it challenging to develop effective speech translation (ST) systems. To address these challenges, we explore the integration of large pre-trained machine translation (MT) models, such as mBART and NLLB-200 in both end-to-end (E2E) and cascaded speech translation (ST) systems. We also improve the performance of automatic speech recognition (ASR) through the use of pseudo-labeling data augmentation and channel matching on telephone data. Finally, we combine our E2E and cascaded ST systems with Minimum Bayes-Risk decoding. Our combined system achieves a BLEU score of 21.6 and 19.1 on test2 and test3, respectively.