Muhammad Ravi Shulthan Habibi
Also published as: Muhammad Ravi Shulthan Habibi
2025
SMOL: Professionally Translated Parallel Data for 115 Under-represented Languages
Isaac Caswell | Elizabeth Nielsen | Jiaming Luo | Colin Cherry | Geza Kovacs | Hadar Shemtov | Partha Talukdar | Dinesh Tewari | Baba Mamadi Diane | Djibrila Diane | Solo Farabado Cissé | Koulako Moussa Doumbouya | Edoardo Ferrante | Alessandro Guasoni | Christopher Homan | Mamadou K. Keita | Sudhamoy DebBarma | Ali Kuzhuget | David Anugraha | Muhammad Ravi Shulthan Habibi | Sina Ahmadi | Anthony Munthali | Jonathan Mingfei Liu | Jonathan Eng
Proceedings of the Tenth Conference on Machine Translation
Isaac Caswell | Elizabeth Nielsen | Jiaming Luo | Colin Cherry | Geza Kovacs | Hadar Shemtov | Partha Talukdar | Dinesh Tewari | Baba Mamadi Diane | Djibrila Diane | Solo Farabado Cissé | Koulako Moussa Doumbouya | Edoardo Ferrante | Alessandro Guasoni | Christopher Homan | Mamadou K. Keita | Sudhamoy DebBarma | Ali Kuzhuget | David Anugraha | Muhammad Ravi Shulthan Habibi | Sina Ahmadi | Anthony Munthali | Jonathan Mingfei Liu | Jonathan Eng
Proceedings of the Tenth Conference on Machine Translation
We open-source SMOL (Set of Maximal Over-all Leverage), a suite of training data to un-lock machine translation for low-resource languages (LRLs). SMOL has been translated into123 under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
Language Surgery in Multilingual Large Language Models
Joanito Agili Lopo | Muhammad Ravi Shulthan Habibi | Tack Hwa Wong | Muhammad Ilham Ghozali | Fajri Koto | Genta Indra Winata | Peerat Limkonchotiwat | Alham Fikri Aji | Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Joanito Agili Lopo | Muhammad Ravi Shulthan Habibi | Tack Hwa Wong | Muhammad Ilham Ghozali | Fajri Koto | Genta Indra Winata | Peerat Limkonchotiwat | Alham Fikri Aji | Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.
2024
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia | Rahmad Mahendra | Salsabil Maulana Akbar | Lester James V. Miranda | Jennifer Santoso | Elyanah Aco | Akhdan Fadhilah | Jonibek Mansurov | Joseph Marvin Imperial | Onno P. Kampman | Joel Ruben Antony Moniz | Muhammad Ravi Shulthan Habibi | Frederikus Hudi | Railey Montalan | Ryan Ignatius | Joanito Agili Lopo | William Nixon | Börje F. Karlsson | James Jaya | Ryandito Diandaru | Yuze Gao | Patrick Amadeus | Bin Wang | Jan Christian Blaise Cruz | Chenxi Whitehouse | Ivan Halim Parmonangan | Maria Khelli | Wenyu Zhang | Lucky Susanto | Reynard Adha Ryanda | Sonny Lazuardi Hermawan | Dan John Velasco | Muhammad Dehan Al Kautsar | Willy Fitra Hendria | Yasmin Moslem | Noah Flynn | Muhammad Farid Adilazuarda | Haochen Li | Johanes Lee | R. Damanhuri | Shuo Sun | Muhammad Reza Qorib | Amirbek Djanibekov | Wei Qi Leong | Quyet V. Do | Niklas Muennighoff | Tanrada Pansuwan | Ilham Firdausi Putra | Yan Xu | Tai Ngee Chia | Ayu Purwarianti | Sebastian Ruder | William Tjhi | Peerat Limkonchotiwat | Alham Fikri Aji | Sedrick Keh | Genta Indra Winata | Ruochen Zhang | Fajri Koto | Zheng-Xin Yong | Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Holy Lovenia | Rahmad Mahendra | Salsabil Maulana Akbar | Lester James V. Miranda | Jennifer Santoso | Elyanah Aco | Akhdan Fadhilah | Jonibek Mansurov | Joseph Marvin Imperial | Onno P. Kampman | Joel Ruben Antony Moniz | Muhammad Ravi Shulthan Habibi | Frederikus Hudi | Railey Montalan | Ryan Ignatius | Joanito Agili Lopo | William Nixon | Börje F. Karlsson | James Jaya | Ryandito Diandaru | Yuze Gao | Patrick Amadeus | Bin Wang | Jan Christian Blaise Cruz | Chenxi Whitehouse | Ivan Halim Parmonangan | Maria Khelli | Wenyu Zhang | Lucky Susanto | Reynard Adha Ryanda | Sonny Lazuardi Hermawan | Dan John Velasco | Muhammad Dehan Al Kautsar | Willy Fitra Hendria | Yasmin Moslem | Noah Flynn | Muhammad Farid Adilazuarda | Haochen Li | Johanes Lee | R. Damanhuri | Shuo Sun | Muhammad Reza Qorib | Amirbek Djanibekov | Wei Qi Leong | Quyet V. Do | Niklas Muennighoff | Tanrada Pansuwan | Ilham Firdausi Putra | Yan Xu | Tai Ngee Chia | Ayu Purwarianti | Sebastian Ruder | William Tjhi | Peerat Limkonchotiwat | Alham Fikri Aji | Sedrick Keh | Genta Indra Winata | Ruochen Zhang | Fajri Koto | Zheng-Xin Yong | Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.
Search
Fix author
Co-authors
- Alham Fikri Aji 3
- Samuel Cahyawijaya 3
- Fajri Koto 3
- Peerat Limkonchotiwat 3
- Genta Indra Winata 3
- David Anugraha 2
- Jan Christian Blaise Cruz 2
- Frederikus Hudi 2
- Joseph Marvin Imperial 2
- Onno P. Kampman 2
- Börje F. Karlsson 2
- Haochen Li 2
- Joanito Agili Lopo 2
- Holy Lovenia 2
- Lester James Validad Miranda 2
- Joel Ruben Antony Moniz 2
- William Nixon 2
- Muhammad Reza Qorib 2
- Dan John Velasco 2
- Bin Wang 2
- Tack Hwa Wong 2
- Ruochen Zhang 2
- Elyanah Aco 1
- Muhammad Farid Adilazuarda 1
- Amit Agarwal 1
- Sina Ahmadi 1
- Salsabil Maulana Akbar 1
- Muhammad Dehan Al Kautsar 1
- Patrick Amadeus 1
- Fadil Risdian Ansori 1
- Michael Anugraha 1
- Phakphum Artkaew 1
- Yeshil Bangera 1
- Mithil Bangera 1
- Anab Maulana Barik 1
- Isaac Caswell 1
- Carlos Rafael Catalan 1
- Colin Cherry 1
- Tai Ngee Chia 1
- Solo Farabado Cissé 1
- R. Damanhuri 1
- John Amadeo Daniswara 1
- Sudhamoy DebBarma 1
- Ryandito Diandaru 1
- Baba Mamadi Diané 1
- Djibrila Diané 1
- Amirbek Djanibekov 1
- Quyet V. Do 1
- Koulako Moussa Doumbouya 1
- Meisyarah Dwiastuti 1
- Jonathan Eng 1
- Evan Evan 1
- Akhdan Fadhilah 1
- Mohammad Rifqi Farhansyah 1
- Tirana Noor Fatyanosa 1
- Vicky Feliren 1
- Teddy Ferdinan 1
- Edoardo Ferrante 1
- Isaiah Edri W. Flores 1
- Noah Flynn 1
- Yuze Gao 1
- Rifo Ahmad Genadi 1
- Muhammad Ilham Ghozali 1
- Alessandro Guasoni 1
- M.Alif Al Hakim 1
- Kenneth Chen Ko Han 1
- Ikhlasul Akmal Hanif 1
- Rochana Prih Hastuti 1
- Ming Shan Hee 1
- Willy Fitra Hendria 1
- Sonny Lazuardi Hermawan 1
- Christopher Homan 1
- Ryan Ignatius 1
- Mahardika Krisna Ihsani 1
- Fenal Ashokbhai Ilasariya 1
- Mohamed Fazli Mohamed Imam 1
- Piyalitt Ittichaiwong 1
- Audra Aurora Izzani 1
- James Jaya 1
- Sedrick Keh 1
- Mamadou K. Keita 1
- Kun Kerdthaisong 1
- Jun Kevin 1
- Maria Khelli 1
- Aye Hninn Khine 1
- Geza Kovacs 1
- Takdanai Kreangphet 1
- Jauza Akbar Krito 1
- Ali Kuzhuget 1
- Johanes Lee 1
- Wei Qi Leong 1
- Adrian Xuan Wei Lim 1
- Wan Shen Lim 1
- Jonathan Mingfei Liu 1
- Jiaming Luo 1
- Jiayun Luo 1
- Rahmad Mahendra 1
- Jonibek Mansurov 1
- Thant Thiri Maung 1
- Patricia Nicole Monderin 1
- Jann Railey Montalan 1
- Yasmin Moslem 1
- Niklas Muennighoff 1
- Anthony Munthali 1
- Adisai Na-Thalang 1
- Bahrul Ilmi Nasution 1
- Lynnette Hui Xian Ng 1
- Giang Nguyen 1
- Elizabeth Nielsen 1
- Kadek Hendrawan Palgunadi 1
- Tanrada Pansuwan 1
- Ivan Halim Parmonangan 1
- Hitesh Laxmichand Patel 1
- Priyaranjan Pattnayak 1
- Kaung Si Phyo 1
- Salsabila Zahirah Pranida 1
- Kevin Pratama 1
- Ayu Purwarianti 1
- Ilham Firdausi Putra 1
- Taki Hasan Rafi 1
- Rian Adam Rajagede 1
- Matthew Theodore Roque 1
- Jostin Jerico Rosal 1
- Sebastian Ruder 1
- Manuel Antonio Rufino 1
- Reynard Adha Ryanda 1
- Saptarshi Saha 1
- Anjela Gail D. Santos 1
- Tim Santos 1
- Jennifer Santoso 1
- Richardy Lobo Sapan 1
- Hadar Shemtov 1
- Christian Simon 1
- Ayushman Singh 1
- Yueqi Song 1
- Shuo Sun 1
- Supryadi 1
- Lucky Susanto 1
- Muhammad Rizky Sya’ban 1
- Partha Talukdar 1
- Dinesh Tewari 1
- William Tjhi 1
- Filbert Aurelian Tjiaranata 1
- Can Udomcharoenchaikit 1
- Kanyakorn Veerakanjana 1
- Karissa Vincentio 1
- Chengwei Wei 1
- Chenxi Whitehouse 1
- Robert Wijaya 1
- Yan Xu 1
- Zheng Xin Yong 1
- Yanzhi Yu 1
- Eryawan Presma Yulianrifat 1
- Hanif Muhammad Zhafran 1
- Wenyu Zhang 1