Karissa Vincentio
2025
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
2023
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya | Holy Lovenia | Alham Fikri Aji | Genta Winata | Bryan Wilie | Fajri Koto | Rahmad Mahendra | Christian Wibisono | Ade Romadhony | Karissa Vincentio | Jennifer Santoso | David Moeljadi | Cahya Wirawan | Frederikus Hudi | Muhammad Satrio Wicaksono | Ivan Parmonangan | Ika Alfina | Ilham Firdausi Putra | Samsul Rahmadani | Yulianti Oenang | Ali Septiandri | James Jaya | Kaustubh Dhole | Arie Suryani | Rifki Afina Putri | Dan Su | Keith Stevens | Made Nindyatama Nityasya | Muhammad Adilazuarda | Ryan Hadiwijaya | Ryandito Diandaru | Tiezheng Yu | Vito Ghifari | Wenliang Dai | Yan Xu | Dyah Damapuspita | Haryo Wibowo | Cuk Tho | Ichwanul Karo Karo | Tirana Fatyanosa | Ziwei Ji | Graham Neubig | Timothy Baldwin | Sebastian Ruder | Pascale Fung | Herry Sujaini | Sakriani Sakti | Ayu Purwarianti
Findings of the Association for Computational Linguistics: ACL 2023
Samuel Cahyawijaya | Holy Lovenia | Alham Fikri Aji | Genta Winata | Bryan Wilie | Fajri Koto | Rahmad Mahendra | Christian Wibisono | Ade Romadhony | Karissa Vincentio | Jennifer Santoso | David Moeljadi | Cahya Wirawan | Frederikus Hudi | Muhammad Satrio Wicaksono | Ivan Parmonangan | Ika Alfina | Ilham Firdausi Putra | Samsul Rahmadani | Yulianti Oenang | Ali Septiandri | James Jaya | Kaustubh Dhole | Arie Suryani | Rifki Afina Putri | Dan Su | Keith Stevens | Made Nindyatama Nityasya | Muhammad Adilazuarda | Ryan Hadiwijaya | Ryandito Diandaru | Tiezheng Yu | Vito Ghifari | Wenliang Dai | Yan Xu | Dyah Damapuspita | Haryo Wibowo | Cuk Tho | Ichwanul Karo Karo | Tirana Fatyanosa | Ziwei Ji | Graham Neubig | Timothy Baldwin | Sebastian Ruder | Pascale Fung | Herry Sujaini | Sakriani Sakti | Ayu Purwarianti
Findings of the Association for Computational Linguistics: ACL 2023
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
2021
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation
Samuel Cahyawijaya | Genta Indra Winata | Bryan Wilie | Karissa Vincentio | Xiaohong Li | Adhiguna Kuncoro | Sebastian Ruder | Zhi Yuan Lim | Syafri Bahar | Masayu Khodra | Ayu Purwarianti | Pascale Fung
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Samuel Cahyawijaya | Genta Indra Winata | Bryan Wilie | Karissa Vincentio | Xiaohong Li | Adhiguna Kuncoro | Sebastian Ruder | Zhi Yuan Lim | Syafri Bahar | Masayu Khodra | Ayu Purwarianti | Pascale Fung
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks. We collate a clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets, Indo4B-Plus, which is used to pretrain our models: IndoBART and IndoGPT. We show that IndoBART and IndoGPT achieve competitive performance on all tasks—despite using only one-fifth the parameters of a larger multilingual model, mBART-large (Liu et al., 2020). This finding emphasizes the importance of pretraining on closely related, localized languages to achieve more efficient learning and faster inference at very low-resource languages like Javanese and Sundanese.
2020
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding
Bryan Wilie | Karissa Vincentio | Genta Indra Winata | Samuel Cahyawijaya | Xiaohong Li | Zhi Yuan Lim | Sidik Soleman | Rahmad Mahendra | Pascale Fung | Syafri Bahar | Ayu Purwarianti
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Bryan Wilie | Karissa Vincentio | Genta Indra Winata | Samuel Cahyawijaya | Xiaohong Li | Zhi Yuan Lim | Sidik Soleman | Rahmad Mahendra | Pascale Fung | Syafri Bahar | Ayu Purwarianti
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset (Indo4B) collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.
Search
Fix author
Co-authors
- Samuel Cahyawijaya 4
- Genta Indra Winata 4
- Pascale Fung 3
- Ayu Purwarianti 3
- Bryan Wilie 3
- Alham Fikri Aji 2
- Syafri Bahar 2
- Tirana Noor Fatyanosa 2
- Frederikus Hudi 2
- Fajri Koto 2
- Xiaohong Li 2
- Zhi Yuan Lim 2
- Holy Lovenia 2
- Rahmad Mahendra 2
- Sebastian Ruder 2
- Muhammad Adilazuarda 1
- Amit Agarwal 1
- Ika Alfina 1
- Fadil Risdian Ansori 1
- David Anugraha 1
- Michael Anugraha 1
- Phakphum Artkaew 1
- Timothy Baldwin 1
- Yeshil Bangera 1
- Mithil Bangera 1
- Anab Maulana Barik 1
- Carlos Rafael Catalan 1
- Jan Christian Blaise Cruz 1
- Wenliang Dai 1
- Dyah Damapuspita 1
- John Amadeo Daniswara 1
- Kaustubh Dhole 1
- Ryandito Diandaru 1
- Meisyarah Dwiastuti 1
- Evan Evan 1
- Mohammad Rifqi Farhansyah 1
- Vicky Feliren 1
- Teddy Ferdinan 1
- Isaiah Edri W. Flores 1
- Rifo Ahmad Genadi 1
- Vito Ghifari 1
- Muhammad Ravi Shulthan Habibi 1
- Ryan Hadiwijaya 1
- M.Alif Al Hakim 1
- Kenneth Chen Ko Han 1
- Ikhlasul Akmal Hanif 1
- Rochana Prih Hastuti 1
- Ming Shan Hee 1
- Mahardika Krisna Ihsani 1
- Fenal Ashokbhai Ilasariya 1
- Mohamed Fazli Mohamed Imam 1
- Joseph Marvin Imperial 1
- Piyalitt Ittichaiwong 1
- Audra Aurora Izzani 1
- James Jaya 1
- Ziwei Ji 1
- Onno P. Kampman 1
- Börje F. Karlsson 1
- Ichwanul Karo Karo 1
- Kun Kerdthaisong 1
- Jun Kevin 1
- Aye Hninn Khine 1
- Masayu Khodra 1
- Takdanai Kreangphet 1
- Jauza Akbar Krito 1
- Adhiguna Kuncoro 1
- Haochen Li 1
- Adrian Xuan Wei Lim 1
- Wan Shen Lim 1
- Peerat Limkonchotiwat 1
- Jiayun Luo 1
- Thant Thiri Maung 1
- Lester James Validad Miranda 1
- David Moeljadi 1
- Patricia Nicole Monderin 1
- Joel Ruben Antony Moniz 1
- Adisai Na-Thalang 1
- Bahrul Ilmi Nasution 1
- Graham Neubig 1
- Lynnette Hui Xian Ng 1
- Giang Nguyen 1
- Made Nindyatama Nityasya 1
- William Nixon 1
- Yulianti Oenang 1
- Kadek Hendrawan Palgunadi 1
- Ivan Parmonangan 1
- Hitesh Laxmichand Patel 1
- Priyaranjan Pattnayak 1
- Kaung Si Phyo 1
- Salsabila Zahirah Pranida 1
- Kevin Pratama 1
- Ilham Firdausi Putra 1
- Rifki Afina Putri 1
- Muhammad Reza Qorib 1
- Taki Hasan Rafi 1
- Samsul Rahmadani 1
- Rian Adam Rajagede 1
- Ade Romadhony 1
- Matthew Theodore Roque 1
- Jostin Jerico Rosal 1
- Manuel Antonio Rufino 1
- Saptarshi Saha 1
- Sakriani Sakti 1
- Anjela Gail D. Santos 1
- Tim Santos 1
- Jennifer Santoso 1
- Richardy Lobo Sapan 1
- Ali Septiandri 1
- Christian Simon 1
- Ayushman Singh 1
- Sidik Soleman 1
- Yueqi Song 1
- Keith Stevens 1
- Dan Su 1
- Herry Sujaini 1
- Supryadi 1
- Arie Suryani 1
- Muhammad Rizky Sya’ban 1
- Cuk Tho 1
- Filbert Aurelian Tjiaranata 1
- Can Udomcharoenchaikit 1
- Kanyakorn Veerakanjana 1
- Dan John Velasco 1
- Bin Wang 1
- Chengwei Wei 1
- Christian Wibisono 1
- Haryo Wibowo 1
- Muhammad Satrio Wicaksono 1
- Robert Wijaya 1
- Cahya Wirawan 1
- Tack Hwa Wong 1
- Yan Xu 1
- Tiezheng Yu 1
- Yanzhi Yu 1
- Eryawan Presma Yulianrifat 1
- Hanif Muhammad Zhafran 1
- Ruochen Zhang 1