Isaiah Edri W. Flores - ACL Anthology

Isaiah Edri W. Flores

2025

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.

2024

Zero-shot Cross-lingual POS Tagging for Filipino
Jimson Paulo Layacan | Isaiah Edri W. Flores | Katrina Bernice M. Tan | Ma. Regina E. Estuar | Jann Railey E. Montalan | Marlene M. De Leon
Proceedings of the Third Workshop on NLP Applications to Field Linguistics

Supervised learning approaches in NLP, exemplified by POS tagging, rely heavily on the presence of large amounts of annotated data. However, acquiring such data often requires significant amount of resources and incurs high costs. In this work, we explore zero-shot cross-lingual transfer learning to address data scarcity issues in Filipino POS tagging, particularly focusing on optimizing source language selection. Our zero-shot approach demonstrates superior performance compared to previous studies, with top-performing fine-tuned PLMs achieving F1 scores as high as 79.10%. The analysis reveals moderate correlations between cross-lingual transfer performance and specific linguistic distances–featural, inventory, and syntactic–suggesting that source languages with these features closer to Filipino provide better results. We identify tokenizer optimization as a key challenge, as PLM tokenization sometimes fails to align with meaningful representations, thus hindering POS tagging performance.

Co-authors

Michael Anugraha 1

Phakphum Artkaew 1

Mithil Bangera 1

Yeshil Bangera 1

Anab Maulana Barik 1

Samuel Cahyawijaya 1

Carlos Rafael Catalan 1

Jan Christian Blaise Cruz 1

John Amadeo Daniswara 1

Marlene M. De Leon 1

Meisyarah Dwiastuti 1

Ma. Regina E. Estuar 1

Mohammad Rifqi Farhansyah 1

Tirana Noor Fatyanosa 1

Vicky Feliren 1

Teddy Ferdinan 1

Rifo Ahmad Genadi 1

Muhammad Ravi Shulthan Habibi 1

Kenneth Chen Ko Han 1

Ikhlasul Akmal Hanif 1

Rochana Prih Hastuti 1

Ming Shan Hee 1

Frederikus Hudi 1

Mahardika Krisna Ihsani 1

Fenal Ashokbhai Ilasariya 1

Mohamed Fazli Mohamed Imam 1

Joseph Marvin Imperial 1

Piyalitt Ittichaiwong 1

Audra Aurora Izzani 1

Onno P. Kampman 1

Börje F. Karlsson 1

Kun Kerdthaisong 1

Aye Hninn Khine 1

Takdanai Kreangphet 1

Jauza Akbar Krito 1

Jimson Paulo Layacan 1

Adrian Xuan Wei Lim 1

Peerat Limkonchotiwat 1

Thant Thiri Maung 1

Lester James Validad Miranda 1

Patricia Nicole Monderin 1

Joel Ruben Antony Moniz 1

Jann Railey Montalan 1

Adisai Na-Thalang 1

Bahrul Ilmi Nasution 1

Lynnette Hui Xian Ng 1

William Nixon 1

Kadek Hendrawan Palgunadi 1

Hitesh Laxmichand Patel 1

Priyaranjan Pattnayak 1

Kaung Si Phyo 1

Salsabila Zahirah Pranida 1

Kevin Pratama 1

Muhammad Reza Qorib 1

Taki Hasan Rafi 1

Rian Adam Rajagede 1

Matthew Theodore Roque 1

Jostin Jerico Rosal 1

Manuel Antonio Rufino 1

Saptarshi Saha 1

Anjela Gail D. Santos 1

Richardy Lobo Sapan 1

Christian Simon 1

Ayushman Singh 1

Muhammad Rizky Sya’ban 1

Katrina Bernice M. Tan 1

Filbert Aurelian Tjiaranata 1

Can Udomcharoenchaikit 1

Kanyakorn Veerakanjana 1

Dan John Velasco 1

Karissa Vincentio 1

Robert Wijaya 1

Genta Indra Winata 1

Tack Hwa Wong 1

Eryawan Presma Yulianrifat 1

Hanif Muhammad Zhafran 1

Ruochen Zhang 1

Venues