James Jaya - ACL Anthology

James Jaya

2024

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia | Rahmad Mahendra | Salsabil Maulana Akbar | Lester James V. Miranda | Jennifer Santoso | Elyanah Aco | Akhdan Fadhilah | Jonibek Mansurov | Joseph Marvin Imperial | Onno P. Kampman | Joel Ruben Antony Moniz | Muhammad Ravi Shulthan Habibi | Frederikus Hudi | Railey Montalan | Ryan Ignatius | Joanito Agili Lopo | William Nixon | Börje F. Karlsson | James Jaya | Ryandito Diandaru | Yuze Gao | Patrick Amadeus | Bin Wang | Jan Christian Blaise Cruz | Chenxi Whitehouse | Ivan Halim Parmonangan | Maria Khelli | Wenyu Zhang | Lucky Susanto | Reynard Adha Ryanda | Sonny Lazuardi Hermawan | Dan John Velasco | Muhammad Dehan Al Kautsar | Willy Fitra Hendria | Yasmin Moslem | Noah Flynn | Muhammad Farid Adilazuarda | Haochen Li | Johanes Lee | R. Damanhuri | Shuo Sun | Muhammad Reza Qorib | Amirbek Djanibekov | Wei Qi Leong | Quyet V. Do | Niklas Muennighoff | Tanrada Pansuwan | Ilham Firdausi Putra | Yan Xu | Tai Ngee Chia | Ayu Purwarianti | Sebastian Ruder | William Tjhi | Peerat Limkonchotiwat | Alham Fikri Aji | Sedrick Keh | Genta Indra Winata | Ruochen Zhang | Fajri Koto | Zheng-Xin Yong | Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.

2023

NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya | Holy Lovenia | Alham Fikri Aji | Genta Winata | Bryan Wilie | Fajri Koto | Rahmad Mahendra | Christian Wibisono | Ade Romadhony | Karissa Vincentio | Jennifer Santoso | David Moeljadi | Cahya Wirawan | Frederikus Hudi | Muhammad Satrio Wicaksono | Ivan Parmonangan | Ika Alfina | Ilham Firdausi Putra | Samsul Rahmadani | Yulianti Oenang | Ali Septiandri | James Jaya | Kaustubh Dhole | Arie Suryani | Rifki Afina Putri | Dan Su | Keith Stevens | Made Nindyatama Nityasya | Muhammad Adilazuarda | Ryan Hadiwijaya | Ryandito Diandaru | Tiezheng Yu | Vito Ghifari | Wenliang Dai | Yan Xu | Dyah Damapuspita | Haryo Wibowo | Cuk Tho | Ichwanul Karo Karo | Tirana Fatyanosa | Ziwei Ji | Graham Neubig | Timothy Baldwin | Sebastian Ruder | Pascale Fung | Herry Sujaini | Sakriani Sakti | Ayu Purwarianti
Findings of the Association for Computational Linguistics: ACL 2023

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

Co-authors

Rahmad Mahendra 2

Ayu Purwarianti 2

Ilham Firdausi Putra 2

Sebastian Ruder 2

Jennifer Santoso 2

Genta Indra Winata 2

Muhammad Adilazuarda 1

Muhammad Farid Adilazuarda 1

Salsabil Maulana Akbar 1

Muhammad Dehan Al Kautsar 1

Patrick Amadeus 1

Timothy Baldwin 1

Tai Ngee Chia 1

Jan Christian Blaise Cruz 1

Dyah Damapuspita 1

Kaustubh Dhole 1

Amirbek Djanibekov 1

Akhdan Fadhilah 1

Tirana Noor Fatyanosa 1

Muhammad Ravi Shulthan Habibi 1

Ryan Hadiwijaya 1

Willy Fitra Hendria 1

Sonny Lazuardi Hermawan 1

Ryan Ignatius 1

Joseph Marvin Imperial 1

Onno P. Kampman 1

Börje F. Karlsson 1

Ichwanul Karo Karo 1

Peerat Limkonchotiwat 1

Joanito Agili Lopo 1

Jonibek Mansurov 1

Lester James Validad Miranda 1

David Moeljadi 1

Joel Ruben Antony Moniz 1

Jann Railey Montalan 1

Yasmin Moslem 1

Niklas Muennighoff 1

Graham Neubig 1

Made Nindyatama Nityasya 1

William Nixon 1

Yulianti Oenang 1

Tanrada Pansuwan 1

Ivan Parmonangan 1

Ivan Halim Parmonangan 1

Rifki Afina Putri 1

Muhammad Reza Qorib 1

Samsul Rahmadani 1

Ade Romadhony 1

Reynard Adha Ryanda 1

Sakriani Sakti 1

Ali Septiandri 1

Keith Stevens 1

Herry Sujaini 1

Lucky Susanto 1

Dan John Velasco 1

Karissa Vincentio 1

Chenxi Whitehouse 1

Christian Wibisono 1

Muhammad Satrio Wicaksono 1

Cahya Wirawan 1

Zheng Xin Yong 1

Ruochen Zhang 1

Venues

EMNLP1
Findings1