Claytone Sikasote
2025
Findings of the IWSLT 2025 Evaluation Campaign
Idris Abdulmumin | Victor Agostinelli | Tanel Alumäe | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Fethi Bougares | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | William Chen | Raj Dabre | Yannick Estève | Marcello Federico | Mark Fishel | Marco Gaido | Dávid Javorský | Marek Kasztelnik | Fortuné Kponou | Mateusz Krubiński | Tsz Kin Lam | Danni Liu | Evgeny Matusov | Chandresh Kumar Maurya | John P. McCrae | Salima Mdhaffar | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Sara Papi | Pavel Pecina | Peter Polák | Piotr Połeć | Ashwin Sankar | Beatrice Savoldi | Nivedita Sethiya | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Marco Turchi | Alex Waibel | Patrick Wilken | Rodolfo Zevallos | Vilém Zouhar | Maike Züfle
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Idris Abdulmumin | Victor Agostinelli | Tanel Alumäe | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Fethi Bougares | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | William Chen | Raj Dabre | Yannick Estève | Marcello Federico | Mark Fishel | Marco Gaido | Dávid Javorský | Marek Kasztelnik | Fortuné Kponou | Mateusz Krubiński | Tsz Kin Lam | Danni Liu | Evgeny Matusov | Chandresh Kumar Maurya | John P. McCrae | Salima Mdhaffar | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Sara Papi | Pavel Pecina | Peter Polák | Piotr Połeć | Ashwin Sankar | Beatrice Savoldi | Nivedita Sethiya | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Marco Turchi | Alex Waibel | Patrick Wilken | Rodolfo Zevallos | Vilém Zouhar | Maike Züfle
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
This paper presents the outcomes of the shared tasks conducted at the 22nd International Workshop on Spoken Language Translation (IWSLT). The workshop addressed seven critical challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, model compression, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks garnered significant participation, with 32 teams submitting their runs. The field’s growing importance is reflected in the increasing diversity of shared task organizers and contributors to this overview paper, representing a balanced mix of industrial and academic institutions. This broad participation demonstrates the rising prominence of spoken language translation in both research and practical applications.
2024
FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN
Ibrahim Said Ahmad | Antonios Anastasopoulos | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | William Chen | Qianqian Dong | Marcello Federico | Barry Haddow | Dávid Javorský | Mateusz Krubiński | Tsz Kin Lam | Xutai Ma | Prashant Mathur | Evgeny Matusov | Chandresh Maurya | John P. McCrae | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Xing Niu | Atul Kr. Ojha | John Ortega | Sara Papi | Peter Polák | Adam Pospíšil | Pavel Pecina | Elizabeth Salesky | Nivedita Sethiya | Balaram Sarkar | Jiatong Shi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Alex Waibel | Shinji Watanabe | Patrick Wilken | Petr Zemánek | Rodolfo Zevallos
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Ibrahim Said Ahmad | Antonios Anastasopoulos | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | William Chen | Qianqian Dong | Marcello Federico | Barry Haddow | Dávid Javorský | Mateusz Krubiński | Tsz Kin Lam | Xutai Ma | Prashant Mathur | Evgeny Matusov | Chandresh Maurya | John P. McCrae | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Xing Niu | Atul Kr. Ojha | John Ortega | Sara Papi | Peter Polák | Adam Pospíšil | Pavel Pecina | Elizabeth Salesky | Nivedita Sethiya | Balaram Sarkar | Jiatong Shi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Alex Waibel | Shinji Watanabe | Patrick Wilken | Petr Zemánek | Rodolfo Zevallos
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 17 teams whose submissions are documented in 27 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.
2023
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
Claytone Sikasote | Eunice Mukonde | Md Mahfuz Ibn Alam | Antonios Anastasopoulos
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Claytone Sikasote | Eunice Mukonde | Md Mahfuz Ibn Alam | Antonios Anastasopoulos
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba. While Bemba is the most populous language of Zambia, it exhibits a dearth of resources which render the development of language technologies or language processing research almost impossible. The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English. There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations. We also provide baselines on speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks, and sketch out other potential future multimodal uses of our dataset. We hope that by making the dataset available to the research community, this work will foster research and encourage collaboration across the language, speech, and vision communities especially for languages outside the “traditionally” used high-resourced ones. All data and code are publicly available: [https://github.com/csikasote/bigc](https://github.com/csikasote/bigc).
AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages
Odunayo Ogundepo | Tajuddeen R. Gwadabe | Clara E. Rivera | Jonathan H. Clark | Sebastian Ruder | David Ifeoluwa Adelani | Bonaventure F. P. Dossou | Abdou Aziz Diop | Claytone Sikasote | Gilles Hacheme | Happy Buzaaba | Ignatius Ezeani | Rooweither Mabuya | Salomey Osei | Chris Emezue | Albert Njoroge Kahira | Shamsuddeen Hassan Muhammad | Akintunde Oladipo | Abraham Toluwase Owodunni | Atnafu Lambebo Tonja | Iyanuoluwa Shode | Akari Asai | Tunde Oluwaseyi Ajayi | Clemencia Siro | Steven Arthur | Mofetoluwa Adeyemi | Orevaoghene Ahia | Anuoluwapo Aremu | Oyinkansola Awosan | Chiamaka Chukwuneke | Bernard Opoku | Awokoya Ayodele | Verrah Otiende | Christine Mwase | Boyd Sinkala | Andre Niyongabo Rubungo | Daniel A. Ajisafe | Emeka Felix Onwuegbuzia | Habib Mbow | Emile Niyomutabazi | Eunice Mukonde | Falalu Ibrahim Lawan | Ibrahim Said Ahmad | Jesujoba O. Alabi | Martin Namukombo | Mbonu Chinedu | Mofya Phiri | Neo Putini | Ndumiso Mngoma | Priscilla A. Amouk | Ruqayya Nasir Iro | Sonia Adhiambo
Findings of the Association for Computational Linguistics: EMNLP 2023
Odunayo Ogundepo | Tajuddeen R. Gwadabe | Clara E. Rivera | Jonathan H. Clark | Sebastian Ruder | David Ifeoluwa Adelani | Bonaventure F. P. Dossou | Abdou Aziz Diop | Claytone Sikasote | Gilles Hacheme | Happy Buzaaba | Ignatius Ezeani | Rooweither Mabuya | Salomey Osei | Chris Emezue | Albert Njoroge Kahira | Shamsuddeen Hassan Muhammad | Akintunde Oladipo | Abraham Toluwase Owodunni | Atnafu Lambebo Tonja | Iyanuoluwa Shode | Akari Asai | Tunde Oluwaseyi Ajayi | Clemencia Siro | Steven Arthur | Mofetoluwa Adeyemi | Orevaoghene Ahia | Anuoluwapo Aremu | Oyinkansola Awosan | Chiamaka Chukwuneke | Bernard Opoku | Awokoya Ayodele | Verrah Otiende | Christine Mwase | Boyd Sinkala | Andre Niyongabo Rubungo | Daniel A. Ajisafe | Emeka Felix Onwuegbuzia | Habib Mbow | Emile Niyomutabazi | Eunice Mukonde | Falalu Ibrahim Lawan | Ibrahim Said Ahmad | Jesujoba O. Alabi | Martin Namukombo | Mbonu Chinedu | Mofya Phiri | Neo Putini | Ndumiso Mngoma | Priscilla A. Amouk | Ruqayya Nasir Iro | Sonia Adhiambo
Findings of the Association for Computational Linguistics: EMNLP 2023
African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems – those that retrieve answer content from other languages while serving people in their native language—offer a means of filling this gap. To this end, we create Our Dataset, the first cross-lingual QA dataset with a focus on African languages. Our Dataset includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, Our Dataset focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, Our Dataset proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.
2022
BembaSpeech: A Speech Recognition Corpus for the Bemba Language
Claytone Sikasote | Antonios Anastasopoulos
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Claytone Sikasote | Antonios Anastasopoulos
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we explored different approaches; supervised pre-training (training from scratch), cross-lingual transfer learning from a monolingual English pre-trained model using DeepSpeech on the portion of the dataset and fine-tuning large scale self-supervised Wav2Vec2.0 based multilingual pre-trained models on the complete BembaSpeech corpus. From our experiments, the 1 billion XLS-R parameter model gives the best results. The model achieves a word error rate (WER) of 32.91%, results demonstrating that model capacity significantly improves performance and that multilingual pre-trained models transfers cross-lingual acoustic representation better than monolingual pre-trained English model on the BembaSpeech for the Bemba ASR. Lastly, results also show that the corpus can be used for building ASR systems for Bemba language.
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
Search
Fix author
Co-authors
- Antonios Anastasopoulos 4
- Mofetoluwa Adeyemi 2
- Orevaoghene Ahia 2
- Ibrahim Said Ahmad 2
- Ondřej Bojar 2
- Claudia Borg 2
- Roldano Cattoni 2
- Mauro Cettolo 2
- William Chen 2
- Bonaventure F. P. Dossou 2
- Marcello Federico 2
- Dávid Javorský 2
- Mateusz Krubiński 2
- Tsz Kin Lam 2
- Evgeny Matusov 2
- John Philip McCrae 2
- Shamsuddeen Hassan Muhammad 2
- Eunice Mukonde 2
- Kenton Murray 2
- Satoshi Nakamura 2
- Matteo Negri 2
- Jan Niehues 2
- Atul Kr. Ojha 2
- Salomey Osei 2
- Sara Papi 2
- Pavel Pecina 2
- Peter Polák 2
- Andre Niyongabo Rubungo 2
- Nivedita Sethiya 2
- Matthias Sperber 2
- Sebastian Stüker 2
- Katsuhito Sudoh 2
- Brian Thompson 2
- Alex Waibel 2
- Patrick Wilken 2
- Rodolfo Zevallos 2
- Idris Abdulmumin 1
- David Ifeoluwa Adelani 1
- Sonia Adhiambo 1
- Victor Agostinelli 1
- Sweta Agrawal 1
- Oghenefego Ahia 1
- Tunde Oluwaseyi Ajayi 1
- Daniel A. Ajisafe 1
- Jesujoba Alabi 1
- Md Mahfuz Ibn Alam 1
- Tanel Alumäe 1
- Priscilla A. Amouk 1
- Anuoluwapo Aremu 1
- Steven Arthur 1
- Akari Asai 1
- Duygu Ataman 1
- Ayodele Awokoya 1
- Oyinkansola Awosan 1
- Awokoya Ayodele 1
- Israel Abebe Azime 1
- Pallavi Baljekar 1
- Ankur Bapna 1
- Ahmed Baruwa 1
- Alessia Battisti 1
- Luisa Bentivogli 1
- Stella Biderman 1
- Fethi Bougares 1
- Happy Buzaaba 1
- Marine Carpuat 1
- Isaac Caswell 1
- Lizhong Chen 1
- Mbonu Chinedu 1
- Chiamaka Chukwuneke 1
- Jonathan H. Clark 1
- Raj Dabre 1
- Nisansa De Silva 1
- Abdou Aziz Diop 1
- Sakhile Dlamini 1
- Qianqian Dong 1
- Chris Chinenye Emezue 1
- Yannick Estève 1
- Ignatius Ezeani 1
- Orhan Firat 1
- Mark Fishel 1
- Marco Gaido 1
- Tajuddeen R. Gwadabe 1
- Gilles Hacheme 1
- Barry Haddow 1
- Ruqayya Nasir Iro 1
- Mathias Jenny 1
- Yacine Jernite 1
- Albert Njoroge Kahira 1
- Marek Kasztelnik 1
- Fortuné Kponou 1
- Julia Kreutzer 1
- Sneha Kudugunta 1
- Falalu Ibrahim Lawan 1
- Nze Lawson 1
- Colin Leong 1
- Danni Liu 1
- Xutai Ma 1
- Rooweither Mabuya 1
- Tapiwanashe Matangira 1
- Prashant Mathur 1
- Chandresh Maurya 1
- Chandresh Kumar Maurya 1
- Habib Mbow 1
- Salima Mdhaffar 1
- Jamshidbek Mirzakhalov 1
- Ndumiso Mngoma 1
- Ayanda Mnyakeni 1
- Yasmin Moslem 1
- Nanda Muhammad 1
- Christine Mwase 1
- Mathias Müller 1
- André Müller 1
- Martin Namukombo 1
- Toan Q. Nguyen 1
- Xing Niu 1
- Emile Niyomutabazi 1
- Kelechi Ogueji 1
- Odunayo Ogundepo 1
- Akintunde Oladipo 1
- Emeka Felix Onwuegbuzia 1
- Bernard Opoku 1
- Iroro Orife 1
- John Ortega 1
- John E. Ortega 1
- Pedro Ortiz Suarez 1
- Verrah Otiende 1
- Abraham Toluwase Owodunni 1
- Isabel Papadimitriou 1
- Mofya Phiri 1
- Adam Pospíšil 1
- Piotr Połeć 1
- Neo Putini 1
- Annette Rios Gonzales 1
- Clara Rivera 1
- Clara E. Rivera 1
- Sebastian Ruder 1
- Benoît Sagot 1
- Elizabeth Salesky 1
- Sokhar Samb 1
- Ashwin Sankar 1
- Supheakmungkol Sarin 1
- Balaram Sarkar 1
- Beatrice Savoldi 1
- Monang Setyawan 1
- Jiatong Shi 1
- Iyanuoluwa Shode 1
- Boyd Sinkala 1
- Clemencia Siro 1
- Artem Sokolov 1
- Nishant Subramani 1
- Allahsera Tapo 1
- Atnafu Lambebo Tonja 1
- Marco Turchi 1
- Nasanbayar Ulzii-Orshikh 1
- Ahsan Wahab 1
- Lisa Wang 1
- Shinji Watanabe 1
- Petr Zemánek 1
- Vilém Zouhar 1
- Maike Züfle 1
- Daan van Esch 1
- Sakine Çabuk Ballı 1