2024
pdf
bib
abs
Can Synthetic Speech Improve End-to-End Conversational Speech Translation?
Bismarck Bamfo Odoom
|
Nathaniel Robinson
|
Elijah Rippeth
|
Luis Tavarez-Arce
|
Kenton Murray
|
Matthew Wiesner
|
Paul McNamee
|
Philipp Koehn
|
Kevin Duh
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Conversational speech translation is an important technology that fosters communication among people of different language backgrounds. Three-way parallel data in the form of source speech, source transcript, and target translation is usually required to train end-to-end systems. However, such datasets are not readily available and are expensive to create as this involves multiple annotation stages. In this paper, we investigate the use of synthetic data from generative models, namely machine translation and text-to-speech synthesis, for training conversational speech translation systems. We show that adding synthetic data to the training recipe increasingly improves end-to-end training performance, especially when limited real data is available. However, when no real data is available, no amount of synthetic data helps.
pdf
bib
abs
Speech Data from Radio Broadcasts for Low Resource Languages
Bismarck Bamfo Odoom
|
Paola Leibny Garcia
|
Prangthip Hansanti
|
Loïc Barrault
|
Christophe Ropers
|
Matthew Wiesner
|
Kenton Murray
|
Alex Mourachko
|
Philipp Koehn
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
We created a collection of speech data for 48 low resource languages. The corpus is extracted from radio broadcasts and processed with novel speech detection and language identification models based on a manually vetted subset of the audio for 10 languages. The data is made publicly available.
pdf
bib
abs
Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages
Nathaniel R. Robinson
|
Raj Dabre
|
Ammon Shurtz
|
Rasul Dent
|
Onenamiyi Onesi
|
Claire Bizon Monroc
|
Loïc Grobol
|
Hasan Muhammad
|
Ashi Garg
|
Naome A. Etori
|
Vijay Murari Tiyyala
|
Olanrewaju Samuel
|
Matthew Dean Stutzman
|
Bismarck Bamfo Odoom
|
Sanjeev Khudanpur
|
Stephen D. Richardson
|
Kenton Murray
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations—11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages—the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity then ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 23 of 34 translation directions.
pdf
bib
abs
Where are you from? Geolocating Speech and Applications to Language Identification
Patrick Foley
|
Matthew Wiesner
|
Bismarck Bamfo Odoom
|
Leibny Paola Garcia
|
Kenton Murray
|
Philipp Koehn
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We train models to answer the question, Where are you from? and show how such models can be repurposed for language identification (LID). To our knowledge, this paper is the first to introduce data sources, methods and models to tackle the task of geolocation of speech at a global scale, and the first to explore using geolocation as a proxy-task for LID. Specifically, we explore whether radio broadcasts with known origin can be used to train regression and classification-based models for geolocating speech. We build models on top of self-supervised pretrained models, using attention pooling to qualitatively verify that the model geolocates the speech itself, and not other channel artifacts.The best geolocation models localize speaker origin to around 650km. We confirm the value of speech geolocation as a proxy task by using speech geolocation models for zero-shot LID. Finally, we show that fine-tuning geolocation models for LID outperforms fine-tuning pretrained Wav2Vec2.0 models, and achieves state-of-the-art performance on the FLEURS benchmark.
2023
pdf
bib
abs
JHU IWSLT 2023 Multilingual Speech Translation System Description
Henry Li Xinyuan
|
Neha Verma
|
Bismarck Bamfo Odoom
|
Ujvala Pradeep
|
Matthew Wiesner
|
Sanjeev Khudanpur
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
We describe the Johns Hopkins ACL 60-60 Speech Translation systems submitted to the IWSLT 2023 Multilingual track, where we were tasked to translate ACL presentations from English into 10 languages. We developed cascaded speech translation systems for both the constrained and unconstrained subtracks. Our systems make use of pre-trained models as well as domain-specific corpora for this highly technical evaluation-only task. We find that the specific technical domain which ACL presentations fall into presents a unique challenge for both ASR and MT, and we present an error analysis and an ACL-specific corpus we produced to enable further work in this area.