Ahsan Wahab
2022
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
2021
A Large-Scale Study of Machine Translation in Turkic Languages
Jamshidbek Mirzakhalov | Anoop Babu | Duygu Ataman | Sherzod Kariev | Francis Tyers | Otabek Abduraufov | Mammad Hajili | Sardana Ivanova | Abror Khaytbaev | Antonio Laverghetta Jr. | Bekhzodbek Moydinboyev | Esra Onal | Shaxnoza Pulatova | Ahsan Wahab | Orhan Firat | Sriram Chellappan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Jamshidbek Mirzakhalov | Anoop Babu | Duygu Ataman | Sherzod Kariev | Francis Tyers | Otabek Abduraufov | Mammad Hajili | Sardana Ivanova | Abror Khaytbaev | Antonio Laverghetta Jr. | Bekhzodbek Moydinboyev | Esra Onal | Shaxnoza Pulatova | Ahsan Wahab | Orhan Firat | Sriram Chellappan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.
Evaluating Multiway Multilingual NMT in the Turkic Languages
Jamshidbek Mirzakhalov | Anoop Babu | Aigiz Kunafin | Ahsan Wahab | Bekhzodbek Moydinboyev | Sardana Ivanova | Mokhiyakhon Uzokova | Shaxnoza Pulatova | Duygu Ataman | Julia Kreutzer | Francis Tyers | Orhan Firat | John Licato | Sriram Chellappan
Proceedings of the Sixth Conference on Machine Translation
Jamshidbek Mirzakhalov | Anoop Babu | Aigiz Kunafin | Ahsan Wahab | Bekhzodbek Moydinboyev | Sardana Ivanova | Mokhiyakhon Uzokova | Shaxnoza Pulatova | Duygu Ataman | Julia Kreutzer | Francis Tyers | Orhan Firat | John Licato | Sriram Chellappan
Proceedings of the Sixth Conference on Machine Translation
Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.
Search
Fix author
Co-authors
- Duygu Ataman 3
- Orhan Firat 3
- Jamshidbek Mirzakhalov 3
- Anoop Babu 2
- Sriram Chellappan 2
- Sardana Ivanova 2
- Julia Kreutzer 2
- Bekhzodbek Moydinboyev 2
- Shaxnoza Pulatova 2
- Francis Tyers 2
- Otabek Abduraufov 1
- Mofetoluwa Adeyemi 1
- Sweta Agrawal 1
- Orevaoghene Ahia 1
- Oghenefego Ahia 1
- Ayodele Awokoya 1
- Israel Abebe Azime 1
- Pallavi Baljekar 1
- Ankur Bapna 1
- Ahmed Baruwa 1
- Alessia Battisti 1
- Stella Biderman 1
- Isaac Caswell 1
- Nisansa De Silva 1
- Sakhile Dlamini 1
- Bonaventure F. P. Dossou 1
- Mammad Hajili 1
- Mathias Jenny 1
- Yacine Jernite 1
- Sherzod Kariev 1
- Abror Khaytbaev 1
- Sneha Kudugunta 1
- Aigiz Kunafin 1
- Antonio Laverghetta Jr. 1
- Nze Lawson 1
- Colin Leong 1
- John Licato 1
- Tapiwanashe Matangira 1
- Ayanda Mnyakeni 1
- Shamsuddeen Hassan Muhammad 1
- Nanda Muhammad 1
- Mathias Müller 1
- André Müller 1
- Toan Q. Nguyen 1
- Kelechi Ogueji 1
- Esra Onal 1
- Iroro Orife 1
- Pedro Ortiz Suarez 1
- Salomey Osei 1
- Isabel Papadimitriou 1
- Annette Rios Gonzales 1
- Clara Rivera 1
- Andre Niyongabo Rubungo 1
- Benoît Sagot 1
- Sokhar Samb 1
- Supheakmungkol Sarin 1
- Monang Setyawan 1
- Claytone Sikasote 1
- Artem Sokolov 1
- Nishant Subramani 1
- Allahsera Tapo 1
- Nasanbayar Ulzii-Orshikh 1
- Mokhiyakhon Uzokova 1
- Lisa Wang 1
- Daan van Esch 1
- Sakine Çabuk Ballı 1