Peerat Limkonchotiwat - ACL Anthology

Peerat Limkonchotiwat

2025

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata | Frederikus Hudi | Patrick Amadeus Irawan | David Anugraha | Rifki Afina Putri | Wang Yutong | Adam Nohejl | Ubaidillah Ariq Prathama | Nedjma Ousidhoum | Afifa Amriani | Anar Rzayev | Anirban Das | Ashmari Pramodya | Aulia Adila | Bryan Wilie | Candy Olivia Mawalim | Cheng Ching Lam | Daud Abolade | Emmanuele Chersoni | Enrico Santus | Fariz Ikhwantri | Garry Kuwanto | Hanyang Zhao | Haryo Akbarianto Wibowo | Holy Lovenia | Jan Christian Blaise Cruz | Jan Wira Gotama Putra | Junho Myung | Lucky Susanto | Maria Angelica Riera Machin | Marina Zhukova | Michael Anugraha | Muhammad Farid Adilazuarda | Natasha Christabelle Santosa | Peerat Limkonchotiwat | Raj Dabre | Rio Alexander Audino | Samuel Cahyawijaya | Shi-Xiong Zhang | Stephanie Yulia Salim | Yi Zhou | Yinxuan Gui | David Ifeoluwa Adelani | En-Shiun Annie Lee | Shogo Okada | Ayu Purwarianti | Alham Fikri Aji | Taro Watanabe | Derry Tanti Wijaya | Alice Oh | Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards
Panuthep Tasawong | Napat Laosaengpha | Wuttikorn Ponwitayarat | Sitiporn Lim | Potsawee Manakul | Samuel Cahyawijaya | Can Udomcharoenchaikit | Peerat Limkonchotiwat | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the The First Workshop on LLM Security (LLMSEC)

This paper investigates the problem of shortcut learning in safety guardrails for large language models (LLMs). It reveals that current safeguard models often rely excessively on superficial cues, such as specific keywords that are spuriously correlated with training labels, rather than genuinely understanding the input’s semantics or intent. As a result, their performance degrades significantly when there is a shift in keyword distribution. The paper also examines the impact of reducing shortcut reliance, showing that merely minimizing shortcut influence is insufficient. To build robust safeguard models, it is equally crucial to promote the use of intended features.

Language Surgery in Multilingual Large Language Models
Joanito Agili Lopo | Muhammad Ravi Shulthan Habibi | Tack Hwa Wong | Muhammad Ilham Ghozali | Fajri Koto | Genta Indra Winata | Peerat Limkonchotiwat | Alham Fikri Aji | Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.

WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai
Peerat Limkonchotiwat | Pume Tuchinda | Lalita Lowphansirikul | Surapon Nonesung | Panuthep Tasawong | Alham Fikri Aji | Can Udomcharoenchaikit | Sarana Nutanong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Patomporn Payoungkhamdee | Pume Tuchinda | Jinheon Baek | Samuel Cahyawijaya | Can Udomcharoenchaikit | Potsawee Manakul | Peerat Limkonchotiwat | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2025

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

Proceedings of the 9th Widening NLP Workshop
Chen Zhang | Emily Allaway | Hua Shen | Lesly Miculicich | Yinqiao Li | Meryem M'hamdi | Peerat Limkonchotiwat | Richard He Bai | Santosh T.y.s.s. | Sophia Simeng Han | Surendrabikram Thapa | Wiem Ben Rim
Proceedings of the 9th Widening NLP Workshop

SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Yosephine Susanto | Adithya Venkatadri Hulagadri | Jann Railey Montalan | Jian Gang Ngui | Xianbin Yong | Wei Qi Leong | Hamsawardhini Rengarajan | Peerat Limkonchotiwat | Yifan Mai | William Chandra Tjhi
Findings of the Association for Computational Linguistics: ACL 2025

With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multiculturalbenchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specificcapabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA)region, a comprehensive and culturally representative evaluation suite for the SEA languages has not been developed thus far.Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasises SEA languages, comprisingfive core pillars: (1) NLP CLASSICS, (2) LLM-SPECIFICS, (3) SEA LINGUISTICS, (4) SEA CULTURE, (5) SAFETY. SEA-HELMcurrently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models’ multilingual and multicultural performance in a systematic and user-friendly manner. We make the SEA-HELM evaluation code publicly available.

Reliable multilingual evaluation is difficult, and culturally appropriate evaluation is even harder to achieve.A common practice to fill this gap is to machine-translate English evaluation sets. However, translation introduces language bias and carries over cultural and regional assumptions from the original questions – often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translations and annotation practices.Through a large-scale study involving professional and community translators and annotators, we show that state-of-the-art models excel primarily by learning Western-centric concepts. Notably, we find that model rankings on the full MMLU change when evaluated on a subset of questions explicitly marked as culturally sensitive.We release Global MMLU, a multilingual extension of MMLU across 42 languages, featuring improved translation quality, expanded language coverage, and designated subsets labeled as culturally sensitive and culturally agnostic to enable a more comprehensive and equitable benchmark for evaluating language models across diverse linguistic and cultural contexts.

2024

Space Decomposition for Sentence Embedding
Wuttikorn Ponwitayarat | Peerat Limkonchotiwat | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2024

Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks.

Identifying and Mitigating Annotation Bias in Natural Language Understanding using Causal Mediation Analysis
Sitiporn Sae Lim | Can Udomcharoenchaikit | Peerat Limkonchotiwat | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2024

NLU models have achieved promising results on standard benchmarks. Despite state-of-the-art accuracy, analysis reveals that many models make predictions using annotation bias rather than the properties we intend the model to learn. Consequently, these models perform poorly on out-of-distribution datasets. Recent advances in bias mitigation show that annotation bias can be alleviated through fine-tuning debiasing objectives. In this paper, we apply causal mediation analysis to gauge how much each model component mediates annotation biases. Using the knowledge from the causal analysis, we improve the model’s robustness against annotation bias through two bias mitigation methods: causal-grounded masking and gradient unlearning. Causal analysis reveals that biases concentrated in specific components, even after employing other training-time debiasing techniques. Manipulating these components by masking out neurons’ activations or updating specific weight blocks both demonstrably improve robustness against annotation artifacts.

An Empirical Study of Multilingual Reasoning Distillation for Question Answering
Patomporn Payoungkhamdee | Peerat Limkonchotiwat | Jinheon Baek | Potsawee Manakul | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Reasoning is one crucial capability in Large Language Models (LLMs), allowing them to perform complex tasks such as solving math problems and multi-step planning. While reasoning capability can emerge in larger models, smaller ones usually have to rely on distillation to transfer this capability from a larger model. However, recent efforts to distill reasoning capabilities have focused mainly on English, leaving multilingual distillation underexplored. To address this gap, this paper examines existing English reasoning distillation methods that utilize a variety of positive rationales in multilingual settings and proposes d-CoT-nR, a novel approach that incorporates incorrect rationales as additional guidance. Empirical results from multilingual high-school examinations show that d-CoT-nR significantly surpasses the baseline, improving accuracy in unseen languages and correctness in step-by-step reasoning.

McCrolin: Multi-consistency Cross-lingual Training for Retrieval Question Answering
Peerat Limkonchotiwat | Wuttikorn Ponwitayarat | Lalita Lowphansirikul | Potsawee Manakul | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2024

Automated question answering (QA) systems are increasingly relying on robust cross-lingual retrieval to identify and utilize information from multilingual sources, ensuring comprehensive and contextually accurate responses. Existing approaches often struggle with consistency across multiple languages and multi-size input scenarios. To address these challenges, we propose McCrolin, a Multi-consistency Cross-lingual training framework, leveraging multi-task learning to enhance cross-lingual consistency, ranking stability, and input-size robustness. Experimental results demonstrate that McCrolin achieves state-of-the-art performance on standard cross-lingual retrieval QA datasets. Furthermore, McCrolin outperforms competitors when dealing with various input sizes on downstream tasks. In terms of generalizability, results from further analysis show that our method is effective for various encoder architectures and sizes.

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia | Rahmad Mahendra | Salsabil Maulana Akbar | Lester James V. Miranda | Jennifer Santoso | Elyanah Aco | Akhdan Fadhilah | Jonibek Mansurov | Joseph Marvin Imperial | Onno P. Kampman | Joel Ruben Antony Moniz | Muhammad Ravi Shulthan Habibi | Frederikus Hudi | Railey Montalan | Ryan Ignatius | Joanito Agili Lopo | William Nixon | Börje F. Karlsson | James Jaya | Ryandito Diandaru | Yuze Gao | Patrick Amadeus | Bin Wang | Jan Christian Blaise Cruz | Chenxi Whitehouse | Ivan Halim Parmonangan | Maria Khelli | Wenyu Zhang | Lucky Susanto | Reynard Adha Ryanda | Sonny Lazuardi Hermawan | Dan John Velasco | Muhammad Dehan Al Kautsar | Willy Fitra Hendria | Yasmin Moslem | Noah Flynn | Muhammad Farid Adilazuarda | Haochen Li | Johanes Lee | R. Damanhuri | Shuo Sun | Muhammad Reza Qorib | Amirbek Djanibekov | Wei Qi Leong | Quyet V. Do | Niklas Muennighoff | Tanrada Pansuwan | Ilham Firdausi Putra | Yan Xu | Tai Ngee Chia | Ayu Purwarianti | Sebastian Ruder | William Tjhi | Peerat Limkonchotiwat | Alham Fikri Aji | Sedrick Keh | Genta Indra Winata | Ruochen Zhang | Fajri Koto | Zheng-Xin Yong | Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.

Efficient Overshadowed Entity Disambiguation by Mitigating Shortcut Learning
Panuthep Tasawong | Peerat Limkonchotiwat | Potsawee Manakul | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Entity disambiguation (ED) is crucial in natural language processing (NLP) for tasks such as question-answering and information extraction. A major challenge in ED is handling overshadowed entities—uncommon entities sharing mention surfaces with common entities. The current approach to enhance performance on these entities involves reasoning over facts in a knowledge base (KB), increasing computational overhead during inference. We argue that the ED performance on overshadowed entities can be enhanced during training by addressing shortcut learning, which does not add computational overhead at inference. We propose a simple yet effective debiasing technique to prevent models from shortcut learning during training. Experiments on a range of ED datasets show that our method achieves state-of-the-art performance without compromising inference speed. Our findings suggest a new research direction for improving entity disambiguation via shortcut learning mitigation.

SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering
Norawit Urailertprasert | Peerat Limkonchotiwat | Supasorn Suwajanakorn | Sarana Nutanong
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Visual Question Answering (VQA) is a critical task that requires the simultaneous understanding of visual and textual information. While significant advancements have been made with multilingual datasets, these often lack cultural specificity, especially in the context of Southeast Asia (SEA). In this paper, we introduce SEA-VQA aiming to highlight the challenges and gaps in existing VQA models when confronted with culturally specific content. Our dataset includes images from eight SEA countries, curated from the UNESCO Cultural Heritage collection. Our evaluation, comparing GPT-4 and GEMINI models, demonstrates substantial performance drops on culture-centric questions compared to the A-OKVQA dataset, a commonsense and world-knowledge VQA benchmark comprising approximately 25,000 questions. Our findings underscore the importance of cultural diversity in VQA datasets and reveal substantial gaps in the ability of current VQA models to handle culturally rich contexts. SEA-VQA serves as a crucial benchmark for identifying these gaps and guiding future improvements in VQA systems.

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai
Parinthapat Pengpun | Can Udomcharoenchaikit | Weerayut Buaphet | Peerat Limkonchotiwat
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.

On Creating an English-Thai Code-switched Machine Translation in Medical Domain
Parinthapat Pengpun | Krittamate Tiankanon | Amrest Chinkamol | Jiramet Kinchagawat | Pitchaya Chairuengjitjaras | Pasit Supholkhan | Pubordee Aussavavirojekul | Chiraphat Boonnag | Kanyakorn Veerakanjana | Hirunkul Phimsiri | Boonthicha Sae-jia | Nattawach Sataudom | Piyalitt Ittichaiwong | Peerat Limkonchotiwat
Findings of the Association for Computational Linguistics: EMNLP 2024

Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge. Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS translations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available https://github.com/preceptorai-org/NLLB_CS_EM_NLP2024.

CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects
Wannaphong Phatthiyaphaibun | Surapon Nonesung | Peerat Limkonchotiwat | Can Udomcharoenchaikit | Jitkapat Sawatphol | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

The evaluation of generative models in Machine Reading Comprehension (MRC) presents distinct difficulties, as traditional metrics like BLEU, ROUGE, METEOR, Exact Match, and F1 score often struggle to capture the nuanced and diverse responses. While embedding-based metrics such as BERTScore and BARTScore focus on semantic similarity, they still fail to fully address aspects such as recognizing additional helpful information and rewarding contextual faithfulness. Recent advances in large language model (LLM) based metrics offer more fine-grained evaluations, but challenges such as score clustering remain. This paper introduces a multi-aspect evaluation framework, CHIE,incorporating aspects of Correctness, Helpfulness, Irrelevance, and Extraneousness. Our approach, which uses binary categorical values rather than continuous rating scales, aligns well with human judgments, indicating its potential as a comprehensive and effective evaluation method.

2023

Typo-Robust Representation Learning for Dense Retrieval
Panuthep Tasawong | Wuttikorn Ponwitayarat | Peerat Limkonchotiwat | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github.com/panuthept/DST-DenseRetrieval.

PyThaiNLP: Thai Natural Language Processing in Python
Wannaphong Phatthiyaphaibun | Korakot Chaovavanich | Charin Polpanumas | Arthit Suriyawongkul | Lalita Lowphansirikul | Pattarawat Chormai | Peerat Limkonchotiwat | Thanathip Suntorntip | Can Udomcharoenchaikit
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.

mReFinED: An Efficient End-to-End Multilingual Entity Linking System
Peerat Limkonchotiwat | Weiwei Cheng | Christos Christodoulopoulos | Amir Saffari | Jens Lehmann
Findings of the Association for Computational Linguistics: EMNLP 2023

End-to-end multilingual entity linking (MEL) is concerned with identifying multilingual entity mentions and their corresponding entity IDs in a knowledge base. Existing works assumed that entity mentions were given and skipped the entity mention detection step due to a lack of high-quality multilingual training corpora. To overcome this limitation, we propose mReFinED, the first end-to-end multilingual entity linking. Additionally, we propose a bootstrapping mention detection framework that enhances the quality of training corpora. Our experimental results demonstrated that mReFinED outperformed the best existing work in the end-to-end MEL task while being 44 times faster.

An Efficient Self-Supervised Cross-View Training For Sentence Embedding
Peerat Limkonchotiwat | Wuttikorn Ponwitayarat | Lalita Lowphansirikul | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Transactions of the Association for Computational Linguistics, Volume 11

Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.1

SEA-LION (Southeast Asian Languages In One Network): A Family of Southeast Asian Language Models
David Ong | Peerat Limkonchotiwat
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

Cross-Lingual Data Augmentation For Thai Question-Answering
Parinthapat Pengpun | Can Udomcharoenchaikit | Weerayut Buaphet | Peerat Limkonchotiwat
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

This paper presents an innovative data augmentation framework with data quality control designed to enhance the robustness of Question Answering (QA) models in low-resource languages, particularly Thai. Recognizing the challenges posed by the scarcity and quality of training data, we leverage data augmentation techniques in both monolingual and cross-lingual settings. Our approach augments and enriches the original dataset, thereby increasing its linguistic diversity and robustness. We evaluate the robustness of our framework on Machine Reading Comprehension, and the experimental results illustrate the potential of data augmentation to effectively increase training data and improve model generalization in low-resource language settings, offering a promising direction for the data augmentation manner.

2022

ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation
Peerat Limkonchotiwat | Wuttikorn Ponwitayarat | Lalita Lowphansirikul | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2022

Sentence representations are essential in many NLP tasks operating at the sentence level.Recently, research attention has shifted towards learning how to represent sentences without any annotations, i.e., unsupervised representation learning. Despite the benefit of training without supervised data, there is still a performance penalty compared to supervised methods.Furthermore, the supervised-unsupervised performance gap widens as we reduce the model size. In this paper, we propose an unsupervised sentence representation method to reduce the supervised-unsupervised performance gap, especially for smaller models. Utilizing the concept for knowledge distillation, we derive a distillation framework comprising two training objectives, control and generalize, called ConGen. Experiments on semantic textual similarity (STS), text classification (transfer), and natural language inference (NLI) tasks show that ConGen is on par with supervised training even on smaller models.Furthermore, our method consistently outperformed competitors on multilingual STS.The code and models are available at https://github.com/KornWtp/ConGen.

Thai Nested Named Entity Recognition Corpus
Weerayut Buaphet | Can Udomcharoenchaikit | Peerat Limkonchotiwat | Attapol Rutherford | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2022

This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from 4,894 documents in the domains of news articles and restaurant reviews. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtained two key findings. First, all models produced poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages.

CL-ReLKT: Cross-lingual Language Knowledge Transfer for Multilingual Retrieval Question Answering
Peerat Limkonchotiwat | Wuttikorn Ponwitayarat | Can Udomcharoenchaikit | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: NAACL 2022

Cross-Lingual Retrieval Question Answering (CL-ReQA) is concerned with retrieving answer documents or passages to a question written in a different language. A common approach to CL-ReQA is to create a multilingual sentence embedding space such that question-answer pairs across different languages are close to each other. In this paper, we propose a novel CL-ReQA method utilizing the concept of language knowledge transfer and a new cross-lingual consistency training technique to create a multilingual embedding space for ReQA. To assess the effectiveness of our work, we conducted comprehensive experiments on CL-ReQA and a downstream task, machine reading QA. We compared our proposed method with the current state-of-the-art solutions across three public CL-ReQA corpora. Our method outperforms competitors in 19 out of 21 settings of CL-ReQA. When used with a downstream machine reading QA task, our method outperforms the best existing language-model-based method by 10% in F1 while being 10 times faster in sentence embedding computation. The code and models are available at https://github.com/mrpeerat/CL-ReLKT.

2021

Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation
Peerat Limkonchotiwat | Wannaphong Phatthiyaphaibun | Raheem Sarwar | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval
Nattapol Trijakwanich | Peerat Limkonchotiwat | Raheem Sarwar | Wannaphong Phatthiyaphaibun | Ekapol Chuangsuwanich | Sarana Nutanong
Findings of the Association for Computational Linguistics: EMNLP 2021

Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of-Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.

2020

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
Peerat Limkonchotiwat | Wannaphong Phatthiyaphaibun | Raheem Sarwar | Ekapol Chuangsuwanich | Sarana Nutanong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as “black boxes”. We propose a filter-and-refine solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.

Co-authors

Alham Fikri Aji 5

Lalita Lowphansirikul 5

Potsawee Manakul 5

Wannaphong Phatthiyaphaibun 5

Panuthep Tasawong 4

Genta Indra Winata 4

Weerayut Buaphet 3

Jan Christian Blaise Cruz 3

Muhammad Ravi Shulthan Habibi 3

Frederikus Hudi 3

Jann Railey Montalan 3

Jian Gang Ngui 3

Parinthapat Pengpun 3

Raheem Sarwar 3

Yosephine Susanto 3

David Ifeoluwa Adelani 2

Muhammad Farid Adilazuarda 2

David Anugraha 2

Michael Anugraha 2

Tai Ngee Chia 2

Adithya Venkatadri Hulagadri 2

Joseph Marvin Imperial 2

Piyalitt Ittichaiwong 2

Onno P. Kampman 2

Börje F. Karlsson 2

Joanito Agili Lopo 2

Lester James Validad Miranda 2

Joel Ruben Antony Moniz 2

William Nixon 2

Surapon Nonesung 2

Patomporn Payoungkhamdee 2

Ayu Purwarianti 2

Muhammad Reza Qorib 2

Hamsawardhini Rengarajan 2

Sebastian Ruder 2

Lucky Susanto 2

William Chandra Tjhi 2

Pume Tuchinda 2

Kanyakorn Veerakanjana 2

Dan John Velasco 2

Tack Hwa Wong 2

Ruochen Zhang 2

Salsabil Maulana Akbar 1

Muhammad Dehan Al Kautsar 1

Emily Allaway 1

Patrick Amadeus 1

Afifa Amriani 1

Fadil Risdian Ansori 1

Sajeban Antonyrex 1

Phakphum Artkaew 1

Rio Alexander Audino 1

Pubordee Aussavavirojekul 1

Richard He Bai 1

Yeshil Bangera 1

Mithil Bangera 1

Anab Maulana Barik 1

Chiraphat Boonnag 1

Antoine Bosselut 1

Carlos Rafael Catalan 1

Pitchaya Chairuengjitjaras 1

Korakot Chaovavanich 1

Nicholas Cheng 1

Emmanuele Chersoni 1

Amrest Chinkamol 1

Pattarawat Chormai 1

Leshem Choshen 1

Christos Christodoulopoulos 1

John Amadeo Daniswara 1

Ryandito Diandaru 1

Amirbek Djanibekov 1

Meisyarah Dwiastuti 1

Marzieh Fadaee 1

Akhdan Fadhilah 1

Mohammad Rifqi Farhansyah 1

Tirana Noor Fatyanosa 1

Vicky Feliren 1

Teddy Ferdinan 1

Enzo Ferrante 1

Isaiah Edri W. Flores 1

Clémentine Fourrier 1

Rifo Ahmad Genadi 1

Muhammad Ilham Ghozali 1

M.Alif Al Hakim 1

Kenneth Chen Ko Han 1

Sophia Simeng Han 1

Ikhlasul Akmal Hanif 1

Rochana Prih Hastuti 1

Ming Shan Hee 1

Willy Fitra Hendria 1

Sonny Lazuardi Hermawan 1

Ryan Ignatius 1

Mahardika Krisna Ihsani 1

Fariz Ikhwantri 1

Fenal Ashokbhai Ilasariya 1

Mohamed Fazli Mohamed Imam 1

Daphne Ippolito 1

Patrick Amadeus Irawan 1

Audra Aurora Izzani 1

Kun Kerdthaisong 1

Aye Hninn Khine 1

Jiramet Kinchagawat 1

Takdanai Kreangphet 1

Jauza Akbar Krito 1

Garry Kuwanto 1

Cheng Ching Lam 1

Napat Laosaengpha 1

En-Shiun Annie Lee 1

Adrian Xuan Wei Lim 1

Bing Jie Darius Liu 1

Shayne Longpre 1

Rahmad Mahendra 1

Jonibek Mansurov 1

Kelly Marchisio 1

André F. T. Martins 1

Thant Thiri Maung 1

Candy Olivia Mawalim 1

Tan Choon Meng 1

Lesly Miculicich Werlen 1

Patricia Nicole Monderin 1

Yasmin Moslem 1

Niklas Muennighoff 1

Meryem M’hamdi 1

Adisai Na-Thalang 1

Bahrul Ilmi Nasution 1

Lynnette Hui Xian Ng 1

Chong-Wah Ngo 1

Thanh Ngan Nguyen 1

Nedjma Ousidhoum 1

Kadek Hendrawan Palgunadi 1

Tanrada Pansuwan 1

Ivan Halim Parmonangan 1

Hitesh Laxmichand Patel 1

Priyaranjan Pattnayak 1

Hirunkul Phimsiri 1

Kaung Si Phyo 1

Charin Polpanumas 1

Ashmari Pramodya 1

Salsabila Zahirah Pranida 1

Kevin Pratama 1

Ubaidillah Ariq Prathama 1

Jan Wira Gotama Putra 1

Ilham Firdausi Putra 1

Rifki Afina Putri 1

Taki Hasan Rafi 1

Rian Adam Rajagede 1

Maria Angelica Riera Machin 1

Angelika Romanou 1

Matthew Theodore Roque 1

Jostin Jerico Rosal 1

Manuel Antonio Rufino 1

Attapol Rutherford 1

Reynard Adha Ryanda 1

Sitiporn Sae Lim 1

Boonthicha Sae-jia 1

Saptarshi Saha 1

Stephanie Yulia Salim 1

Anjela Gail D. Santos 1

Natasha Christabelle Santosa 1

Jennifer Santoso 1

Enrico Santus 1

Richardy Lobo Sapan 1

Nattawach Sataudom 1

Jitkapat Sawatphol 1

Christian Simon 1

Ayushman Singh 1

Shivalika Singh 1

Thanathip Suntorntip 1

Pasit Supholkhan 1

Arthit Suriyawongkul 1

Supasorn Suwajanakorn 1

Muhammad Rizky Sya’ban 1

Santosh T.Y.S.S 1

David Ong Tat-Wee 1

Surendrabikram Thapa 1

Krittamate Tiankanon 1

Filbert Aurelian Tjiaranata 1

Yeo Yeow Tong 1

Nattapol Trijakwanich 1

Norawit Urailertprasert 1

Daniel Vila-Suero 1

Karissa Vincentio 1

Taro Watanabe 1

Chenxi Whitehouse 1

Haryo Akbarianto Wibowo 1

Derry Tanti Wijaya 1

Robert Wijaya 1

Zheng Xin Yong 1

Eryawan Presma Yulianrifat 1

Hanif Muhammad Zhafran 1

Shi-Xiong Zhang 1

Marina Zhukova 1

Venues