SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Validad Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Jann Railey Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus Irawan, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse, Ivan Halim Parmonangan, Maria Khelli, Wenyu Zhang, Lucky Susanto, Reynard Adha Ryanda, Sonny Lazuardi Hermawan, Dan John Velasco, Muhammad Dehan Al Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah Flynn, Muhammad Farid Adilazuarda, Haochen Li, Johanes Lee, R. Damanhuri, Shuo Sun, Muhammad Reza Qorib, Amirbek Djanibekov, Wei Qi Leong, Quyet V. Do, Niklas Muennighoff, Tanrada Pansuwan, Ilham Firdausi Putra, Yan Xu, Tai Ngee Chia, Ayu Purwarianti, Sebastian Ruder, William Chandra Tjhi, Peerat Limkonchotiwat, Alham Fikri Aji, Sedrick Keh, Genta Indra Winata, Ruochen Zhang, Fajri Koto, Zheng Xin Yong, Samuel Cahyawijaya
Abstract
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.- Anthology ID:
- 2024.emnlp-main.296
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5155–5203
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.296
- DOI:
- 10.18653/v1/2024.emnlp-main.296
- Bibkey:
- Cite (ACL):
- Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Validad Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Jann Railey Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, et al.. 2024. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages (Lovenia et al., EMNLP 2024)
- Copy Citation:
- PDF:
- https://aclanthology.org/2024.emnlp-main.296.pdf
- Software:
- 2024.emnlp-main.296.software.zip
- Data:
- 2024.emnlp-main.296.data.zip
Export citation
@inproceedings{lovenia-etal-2024-seacrowd, title = "{SEAC}rowd: A Multilingual Multimodal Data Hub and Benchmark Suite for {S}outheast {A}sian Languages", author = {Lovenia, Holy and Mahendra, Rahmad and Akbar, Salsabil Maulana and Miranda, Lester James Validad and Santoso, Jennifer and Aco, Elyanah and Fadhilah, Akhdan and Mansurov, Jonibek and Imperial, Joseph Marvin and Kampman, Onno P. and Moniz, Joel Ruben Antony and Habibi, Muhammad Ravi Shulthan and Hudi, Frederikus and Montalan, Jann Railey and Hadiwijaya, Ryan Ignatius and Lopo, Joanito Agili and Nixon, William and Karlsson, B{\"o}rje F. and Jaya, James and Diandaru, Ryandito and Gao, Yuze and Irawan, Patrick Amadeus and Wang, Bin and Cruz, Jan Christian Blaise and Whitehouse, Chenxi and Parmonangan, Ivan Halim and Khelli, Maria and Zhang, Wenyu and Susanto, Lucky and Ryanda, Reynard Adha and Hermawan, Sonny Lazuardi and Velasco, Dan John and Kautsar, Muhammad Dehan Al and Hendria, Willy Fitra and Moslem, Yasmin and Flynn, Noah and Adilazuarda, Muhammad Farid and Li, Haochen and Lee, Johanes and Damanhuri, R. and Sun, Shuo and Qorib, Muhammad Reza and Djanibekov, Amirbek and Leong, Wei Qi and Do, Quyet V. and Muennighoff, Niklas and Pansuwan, Tanrada and Putra, Ilham Firdausi and Xu, Yan and Chia, Tai Ngee and Purwarianti, Ayu and Ruder, Sebastian and Tjhi, William Chandra and Limkonchotiwat, Peerat and Aji, Alham Fikri and Keh, Sedrick and Winata, Genta Indra and Zhang, Ruochen and Koto, Fajri and Yong, Zheng Xin and Cahyawijaya, Samuel}, editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-main.296", doi = "10.18653/v1/2024.emnlp-main.296", pages = "5155--5203", abstract = "Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="lovenia-etal-2024-seacrowd"> <titleInfo> <title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title> </titleInfo> <name type="personal"> <namePart type="given">Holy</namePart> <namePart type="family">Lovenia</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rahmad</namePart> <namePart type="family">Mahendra</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Salsabil</namePart> <namePart type="given">Maulana</namePart> <namePart type="family">Akbar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lester</namePart> <namePart type="given">James</namePart> <namePart type="given">Validad</namePart> <namePart type="family">Miranda</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jennifer</namePart> <namePart type="family">Santoso</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Elyanah</namePart> <namePart type="family">Aco</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Akhdan</namePart> <namePart type="family">Fadhilah</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jonibek</namePart> <namePart type="family">Mansurov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joseph</namePart> <namePart type="given">Marvin</namePart> <namePart type="family">Imperial</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Onno</namePart> <namePart type="given">P</namePart> <namePart type="family">Kampman</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joel</namePart> <namePart type="given">Ruben</namePart> <namePart type="given">Antony</namePart> <namePart type="family">Moniz</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Muhammad</namePart> <namePart type="given">Ravi</namePart> <namePart type="given">Shulthan</namePart> <namePart type="family">Habibi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Frederikus</namePart> <namePart type="family">Hudi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jann</namePart> <namePart type="given">Railey</namePart> <namePart type="family">Montalan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ryan</namePart> <namePart type="given">Ignatius</namePart> <namePart type="family">Hadiwijaya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joanito</namePart> <namePart type="given">Agili</namePart> <namePart type="family">Lopo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">William</namePart> <namePart type="family">Nixon</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Börje</namePart> <namePart type="given">F</namePart> <namePart type="family">Karlsson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">James</namePart> <namePart type="family">Jaya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ryandito</namePart> <namePart type="family">Diandaru</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yuze</namePart> <namePart type="family">Gao</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Patrick</namePart> <namePart type="given">Amadeus</namePart> <namePart type="family">Irawan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bin</namePart> <namePart type="family">Wang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="given">Christian</namePart> <namePart type="given">Blaise</namePart> <namePart type="family">Cruz</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chenxi</namePart> <namePart type="family">Whitehouse</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ivan</namePart> <namePart type="given">Halim</namePart> <namePart type="family">Parmonangan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Maria</namePart> <namePart type="family">Khelli</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Wenyu</namePart> <namePart type="family">Zhang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lucky</namePart> <namePart type="family">Susanto</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Reynard</namePart> <namePart type="given">Adha</namePart> <namePart type="family">Ryanda</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sonny</namePart> <namePart type="given">Lazuardi</namePart> <namePart type="family">Hermawan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dan</namePart> <namePart type="given">John</namePart> <namePart type="family">Velasco</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Muhammad</namePart> <namePart type="given">Dehan</namePart> <namePart type="given">Al</namePart> <namePart type="family">Kautsar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Willy</namePart> <namePart type="given">Fitra</namePart> <namePart type="family">Hendria</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yasmin</namePart> <namePart type="family">Moslem</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Noah</namePart> <namePart type="family">Flynn</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Muhammad</namePart> <namePart type="given">Farid</namePart> <namePart type="family">Adilazuarda</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Haochen</namePart> <namePart type="family">Li</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Johanes</namePart> <namePart type="family">Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">R</namePart> <namePart type="family">Damanhuri</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Shuo</namePart> <namePart type="family">Sun</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Muhammad</namePart> <namePart type="given">Reza</namePart> <namePart type="family">Qorib</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Amirbek</namePart> <namePart type="family">Djanibekov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Wei</namePart> <namePart type="given">Qi</namePart> <namePart type="family">Leong</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Quyet</namePart> <namePart type="given">V</namePart> <namePart type="family">Do</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Niklas</namePart> <namePart type="family">Muennighoff</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tanrada</namePart> <namePart type="family">Pansuwan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ilham</namePart> <namePart type="given">Firdausi</namePart> <namePart type="family">Putra</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yan</namePart> <namePart type="family">Xu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tai</namePart> <namePart type="given">Ngee</namePart> <namePart type="family">Chia</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ayu</namePart> <namePart type="family">Purwarianti</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sebastian</namePart> <namePart type="family">Ruder</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">William</namePart> <namePart type="given">Chandra</namePart> <namePart type="family">Tjhi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Peerat</namePart> <namePart type="family">Limkonchotiwat</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alham</namePart> <namePart type="given">Fikri</namePart> <namePart type="family">Aji</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sedrick</namePart> <namePart type="family">Keh</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Genta</namePart> <namePart type="given">Indra</namePart> <namePart type="family">Winata</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ruochen</namePart> <namePart type="family">Zhang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Fajri</namePart> <namePart type="family">Koto</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Zheng</namePart> <namePart type="given">Xin</namePart> <namePart type="family">Yong</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Samuel</namePart> <namePart type="family">Cahyawijaya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2024-11</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</title> </titleInfo> <name type="personal"> <namePart type="given">Yaser</namePart> <namePart type="family">Al-Onaizan</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mohit</namePart> <namePart type="family">Bansal</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yun-Nung</namePart> <namePart type="family">Chen</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Miami, Florida, USA</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.</abstract> <identifier type="citekey">lovenia-etal-2024-seacrowd</identifier> <identifier type="doi">10.18653/v1/2024.emnlp-main.296</identifier> <location> <url>https://aclanthology.org/2024.emnlp-main.296</url> </location> <part> <date>2024-11</date> <extent unit="page"> <start>5155</start> <end>5203</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages %A Lovenia, Holy %A Mahendra, Rahmad %A Akbar, Salsabil Maulana %A Miranda, Lester James Validad %A Santoso, Jennifer %A Aco, Elyanah %A Fadhilah, Akhdan %A Mansurov, Jonibek %A Imperial, Joseph Marvin %A Kampman, Onno P. %A Moniz, Joel Ruben Antony %A Habibi, Muhammad Ravi Shulthan %A Hudi, Frederikus %A Montalan, Jann Railey %A Hadiwijaya, Ryan Ignatius %A Lopo, Joanito Agili %A Nixon, William %A Karlsson, Börje F. %A Jaya, James %A Diandaru, Ryandito %A Gao, Yuze %A Irawan, Patrick Amadeus %A Wang, Bin %A Cruz, Jan Christian Blaise %A Whitehouse, Chenxi %A Parmonangan, Ivan Halim %A Khelli, Maria %A Zhang, Wenyu %A Susanto, Lucky %A Ryanda, Reynard Adha %A Hermawan, Sonny Lazuardi %A Velasco, Dan John %A Kautsar, Muhammad Dehan Al %A Hendria, Willy Fitra %A Moslem, Yasmin %A Flynn, Noah %A Adilazuarda, Muhammad Farid %A Li, Haochen %A Lee, Johanes %A Damanhuri, R. %A Sun, Shuo %A Qorib, Muhammad Reza %A Djanibekov, Amirbek %A Leong, Wei Qi %A Do, Quyet V. %A Muennighoff, Niklas %A Pansuwan, Tanrada %A Putra, Ilham Firdausi %A Xu, Yan %A Chia, Tai Ngee %A Purwarianti, Ayu %A Ruder, Sebastian %A Tjhi, William Chandra %A Limkonchotiwat, Peerat %A Aji, Alham Fikri %A Keh, Sedrick %A Winata, Genta Indra %A Zhang, Ruochen %A Koto, Fajri %A Yong, Zheng Xin %A Cahyawijaya, Samuel %Y Al-Onaizan, Yaser %Y Bansal, Mohit %Y Chen, Yun-Nung %S Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing %D 2024 %8 November %I Association for Computational Linguistics %C Miami, Florida, USA %F lovenia-etal-2024-seacrowd %X Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia. %R 10.18653/v1/2024.emnlp-main.296 %U https://aclanthology.org/2024.emnlp-main.296 %U https://doi.org/10.18653/v1/2024.emnlp-main.296 %P 5155-5203
Markdown (Informal)
[SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages](https://aclanthology.org/2024.emnlp-main.296) (Lovenia et al., EMNLP 2024)
- SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages (Lovenia et al., EMNLP 2024)
ACL
- Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Validad Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Jann Railey Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, et al.. 2024. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA. Association for Computational Linguistics.