Ishank Saxena
2024
Connecting Language Technologies with Rich, Diverse Data Sources Covering Thousands of Languages
Daan van Esch
|
Sandy Ritchie
|
Sebastian Ruder
|
Julia Kreutzer
|
Clara Rivera
|
Ishank Saxena
|
Isaac Caswell
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Contrary to common belief, there are rich and diverse data sources available for many thousands of languages, which can be used to develop technologies for these languages. In this paper, we provide an overview of some of the major online data sources, the types of data that they provide access to, potential applications of this data, and the number of languages that they cover. Even this covers only a small fraction of the data that exists; for example, printed books are published in many languages but few online aggregators exist.
2023
GATITOS: Using a New Multilingual Lexicon for Low-resource Machine Translation
Alexander Jones
|
Isaac Caswell
|
Orhan Firat
|
Ishank Saxena
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Modern machine translation models and language models are able to translate without having been trained on parallel data, greatly expanding the set of languages that they can serve. However, these models still struggle in a variety of predictable ways, a problem that cannot be overcome without at least some trusted bilingual data. This work expands on a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Based on results from (3), we develop and open-source GATITOS, a high-quality, curated dataset in 168 tail languages, one of the first human-translated resources to cover many of these languages.
Search
Fix data
Co-authors
- Isaac Caswell 2
- Orhan Firat 1
- Alexander Jones 1
- Julia Kreutzer 1
- Sandy Ritchie 1
- show all...