Proceedings of the First Workshop on Language Models for Low-Resource Languages

Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage (Editors)


Anthology ID:
2025.loreslm-1
Month:
January
Year:
2025
Address:
Abu Dhabi, United Arab Emirates
Venues:
LoResLM | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2025.loreslm-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2025.loreslm-1.pdf

pdf bib
Proceedings of the First Workshop on Language Models for Low-Resource Languages
Hansi Hettiarachchi | Tharindu Ranasinghe | Paul Rayson | Ruslan Mitkov | Mohamed Gaber | Damith Premasiri | Fiona Anting Tan | Lasitha Uyangodage

pdf bib
Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)
Hansi Hettiarachchi | Tharindu Ranasinghe | Paul Rayson | Ruslan Mitkov | Mohamed Gaber | Damith Premasiri | Fiona Anting Tan | Lasitha Randunu Chandrakantha Uyangodage

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

pdf bib
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Guokan Shang | Hadi Abdine | Yousef Khoubrane | Amr Mohamed | Yassine Abbahaddou | Sofiane Ennadir | Imane Momayiz | Xuguang Ren | Eric Moulines | Preslav Nakov | Michalis Vazirgiannis | Eric Xing

We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.

pdf bib
Empowering Persian LLMs for Instruction Following: A Novel Dataset and Training Approach
Hojjat Mokhtarabadi | Ziba Zamani | Abbas Maazallahi | Mohammad Hossein Manshaei

Instruction-tuned large language models have demonstrated remarkable capabilities in following human instructions across various domains. However, their proficiency remains notably deficient in many low-resource languages. To address this challenge, we begin by introducing FarsInstruct: a comprehensive instruction dataset designed to enhance the instruction-following ability of large language models specifically for the Persian language—a significant yet underrepresented language globally. FarsInstruct encompasses a wide range of task types and datasets, each containing a mix of straightforward to complex manual written instructions, as well as translations from the Public Pool of Prompts, ensuring a rich linguistic and cultural representation. Furthermore, we introduce Co-CoLA, a framework designed to enhance the multi-task adaptability of LoRA-tuned models. Through extensive experimental analyses, our study showcases the effectiveness of the FarsInstruct dataset coupled with training by the Co-CoLA framework, in improving the performance of large language models within the Persian context. As of the current writing, FarsInstruct comprises 197 templates across 21 distinct datasets, and we intend to update it consistently, thus augmenting its applicability.

pdf bib
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Sadia Alam | Md Farhan Ishmam | Navid Hasin Alvee | Md Shahnewaz Siddique | Md Azam Hossain | Abu Raihan Mostofa Kamal

The widespread availability of code-mixed data in digital spaces can provide valuable insights into low-resource languages like Bengali, which have limited annotated corpora. Sentiment analysis, a pivotal text classification task, has been explored across multiple languages, yet code-mixed Bengali remains underrepresented with no large-scale, diverse benchmark. Code-mixed text is particularly challenging as it requires the understanding of multiple languages and their interaction in the same text. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali comprising 20,000 samples with 4 sentiment labels, sourced from Facebook, YouTube, and e-commerce sites. By aggregating multiple sources, we ensure linguistic diversity reflecting realistic code-mixed scenarios. We implement a novel automated text filtering pipeline using fine-tuned language models to detect code-mixed samples and expand code-mixed text corpora. We further propose baselines using machine learning, neural networks, and transformer-based language models. The availability of a diverse dataset is a critical step towards democratizing NLP and ultimately contributing to a better understanding of code-mixed languages.

pdf bib
Using Language Models for assessment of users’ satisfaction with their partner in Persian
Zahra Habibzadeh | Masoud Asadpour

Sentiment analysis, the process of gauging user attitudes and emotions through their textual data, including social media posts and other forms of communication, is a valuable tool for informed decision-making. In other words, a statement conveys positivity, negativity, or neutrality, sentiment analysis offers insights into public sentiment regarding a product, individual, event, or other significant topics. This research focuses on the effectiveness of sentiment analysis techniques, using Machine Learning (ML) and Natural Language Processing (NLP) especially pre-trained language models for Persian, in assessing users’ satisfaction with their partner, using data collected from X (formerly Twitter). Our motivation stems from traditional in-person surveys, which periodically analyze societal challenges in Iran. The limitations of these surveys led us to explore Artificial Intelligence (AI) as an alternative solution for addressing contemporary social issues. We collected Persian tweets and utilized data annotation techniques to label them according to our research question, forming the dataset. Our goal also was to provide a benchmark of Persian tweets on this specific topic. To evaluate our dataset, we employed several classification methods to achieve our goal, including classical ML models, Deep Neural Networks, and pre-trained language models for Persian. Following a comprehensive evaluation, our results show that BERTweet-FA (one of the pre-trained language models for Persian) emerged as the best performer among the classifiers for assessing users’ satisfaction. This point indicates the ability of language models to understand conversational Persian text and perform sentiment analysis, even in a low-resource language like Persian.

pdf bib
Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing
Atharva Mutsaddi | Aditya Prashant Choudhary

Plagiarism involves using another person’s work or concepts without proper attribution, presenting them as original creations. With the growing amount of data communicated in regional languages such as Marathi—one of India’s regional languages—it is crucial to design robust plagiarism detection systems tailored for low-resource languages. Language models like Bidirectional Encoder Representations from Transformers (BERT) have demonstrated exceptional capability in text representation and feature extraction, making them essential tools for semantic analysis and plagiarism detection. However, the application of BERT for low-resource languages remains underexplored, particularly in the context of plagiarism detection. This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts using BERT sentence embeddings in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) feature representation. By combining TF-IDF with BERT, the system’s performance is significantly improved, which is especially pronounced in languages where BERT models are not extremely robust due to a lack of resources and corpora. This approach effectively captures statistical, semantic, and syntactic aspects of text features through a weighted voting ensemble of machine learning models.

pdf bib
Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa
Sani Abdullahi Sani | Shamsuddeen Hassan Muhammad | Devon Jarvis

Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by identifying sentiments expressed in text. Although significant advances have been made in SA for widely spoken languages, low-resource languages such as Hausa face unique challenges, primarily due to a lack of digital resources. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa. We first curate a diverse, unlabeled corpus to expand the model’s linguistic capabilities, followed by applying LAFT to adapt AfriBERTa specifically to the nuances of the Hausa language. The adapted model is then fine-tuned on the labeled NaijaSenti sentiment dataset to evaluate its performance. Our findings demonstrate that LAFT gives modest improvements, which may be attributed to the use of formal Hausa text rather than informal social media data. Nevertheless, the pre-trained AfriBERTa model significantly outperformed models not specifically trained on Hausa, highlighting the importance of using pre-trained models in low-resource contexts. This research emphasizes the necessity for diverse data sources to advance NLP applications for low-resource African languages. We will publish the code and the data set to encourage further research and facilitate reproducibility in low-resource NLP

pdf bib
Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language
Anastasia Zhukova | Christian E. Matt | Bela Gipp

Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. Collecting test datasets in a narrow domain is time-consuming and requires skilled human resources with domain knowledge and training for the annotation task. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language of the process industry. Our approach proposes an end-to-end annotation pipeline for automated query generation to the score reassessment of query-document pairs. To overcome the lack of text encoders trained in the German chemistry domain, we explore a principle of an ensemble of “weak” text encoders trained on common knowledge datasets. We combine individual relevance scores from diverse models to retrieve document candidates and relevance scores generated by an LLM, aiming to achieve consensus on query-document alignment. Evaluation results demonstrate that the ensemble method significantly improves alignment with human-assigned relevance scores, outperforming individual models in both inter-coder agreement and accuracy metrics. These findings suggest that ensemble learning can effectively adapt semantic search systems for specialized, low-resource languages, offering a practical solution to resource limitations in domain-specific contexts.

pdf bib
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia
Lance Calvin Lim Gamboa | Mark Lee

Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets—a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.

pdf bib
Exploiting Word Sense Disambiguation in Large Language Models for Machine Translation
Van-Hien Tran | Raj Dabre | Hour Kaing | Haiyue Song | Hideki Tanaka | Masao Utiyama

Machine Translation (MT) has made great strides with the use of Large Language Models (LLMs) and advanced prompting techniques. However, translating sentences with ambiguous words remains challenging, especially when LLMs have limited proficiency in the source language. This paper introduces two methods to enhance MT performance by leveraging the word sense disambiguation capabilities of LLMs. The first method integrates all the available senses of an ambiguous word into the prompting template. The second method uses a pre-trained source language model to predict the correct sense of the ambiguous word, which is then incorporated into the prompting template. Additionally, we propose two prompting template styles for providing word sense information to LLMs. Experiments on the HOLLY dataset demonstrate the effectiveness of our approach in improving MT performance.

pdf bib
Low-Resource Interlinear Translation: Morphology-Enhanced Neural Models for Ancient Greek
Maciej Rapacz | Aleksander Smywiński-Pohl

Contemporary machine translation systems prioritize fluent, natural-sounding output with flexible word ordering. In contrast, interlinear translation maintains the source text’s syntactic structure by aligning target language words directly beneath their source counterparts. Despite its importance in classical scholarship, automated approaches to interlinear translation remain understudied. We evaluated neural interlinear translation from Ancient Greek to English and Polish using four transformer-based models: two Ancient Greek-specialized (GreTa and PhilTa) and two general-purpose multilingual models (mT5-base and mT5-large). Our approach introduces novel morphological embedding layers and evaluates text preprocessing and tag set selection across 144 experimental configurations using a word-aligned parallel corpus of the Greek New Testament. Results show that morphological features through dedicated embedding layers significantly enhance translation quality, improving BLEU scores by 35% (44.67 → 60.40) for English and 38% (42.92 → 59.33) for Polish compared to baseline models. PhilTa achieves state-of-the-art performance for English, while mT5-large does so for Polish. Notably, PhilTa maintains stable performance using only 10% of training data. Our findings challenge the assumption that modern neural architectures cannot benefit from explicit morphological annotations. While preprocessing strategies and tag set selection show minimal impact, the substantial gains from morphological embeddings demonstrate their value in low-resource scenarios.

pdf bib
Language verY Rare for All
Ibrahim Merad | Amos Wolf | Ziad Mazzawi | Yannick Léo

In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque — a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA’s effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.

pdf bib
Improving LLM Abilities in Idiomatic Translation
Sundesh Donthi | Maximilian Spencer | Om B. Patel | Joon Young Doh | Eid Rodan | Kevin Zhu | Sean O’Brien

Translating idiomatic expressions remains a challenge for large language models (LLMs), as they often produce literal, semantically incorrect translations—for instance, directly converting “break a leg” into a nonsensical phrase in the target language. While external resources like IdiomKB can supply the figurative meaning and thus yield semantically accurate translations, this approach does not preserve the cultural and stylistic nuances that make idioms so distinctive. Our study focuses on idiomatic translations across multiple languages, including Chinese (ZH), Urdu (UR), and Hindi (HI), with clearly defined abbreviations for each. We propose two methods for improving idiomatic translation fidelity: a Semantic Idiom Alignment (SIA) approach that uses pre-trained sentence embeddings to identify target-language idioms, and a Language-Model-based Idiom Alignment (LIA) approach that prompts an LLM to suggest appropriate idiom counterparts. Human evaluations across multiple language pairs show that SIA better preserves idiomatic style. To support this work, we introduce idiom datasets in low-resource languages (Urdu and Hindi). Our results indicate that aligning idioms at the semantic level can improve cross-lingual style preservation and cultural authenticity.

pdf bib
A Comparative Study of Static and Contextual Embeddings for Analyzing Semantic Changes in Medieval Latin Charters
Yifan Liu | Gelila Tilahun | Xinxiang Gao | Qianfeng Wen | Michael Gervers

The Norman Conquest of 1066 C.E. brought profound transformations to England’s administrative, societal, and linguistic practices. The DEEDS (Documents of Early England Data Set) database offers a unique opportunity to explore these changes by examining shifts in word meanings within a vast collection of Medieval Latin charters. While computational linguistics typically relies on vector representations of words like static and contextual embeddings to analyze semantic changes, existing embeddings for scarce and historical Medieval Latin are limited and may not be well-suited for this task. This paper presents the first computational analysis of semantic change pre- and post-Norman Conquest and the first systematic comparison of static and contextual embeddings in a scarce historical data set. Our findings confirm that, consistent with existing studies, contextual embeddings outperform static word embeddings in capturing semantic change within a scarce historical corpus.

pdf bib
Bridging Literacy Gaps in African Informal Business Management with Low-Resource Conversational Agents
Maimouna Ouattara | Abdoul Kader Kaboré | Jacques Klein | Tegawendé F. Bissyandé

Position paper: In many African countries, the informal business sector represents the backbone of the economy, providing essential livelihoods and opportunities where formal employment is limited. Despite, however, the growing adoption of digital tools, entrepreneurs in this sector often face significant challenges due to lack of literacy and language barriers. These barriers not only limit accessibility but also increase the risk of fraud and financial insecurity. This position paper explores the potential of conversational agents (CAs) adapted to low-resource languages (LRLs), focusing specifically on Mooré, a language widely spoken in Burkina Faso. By enabling natural language interactions in local languages, AI-driven conversational agents offer a promising solution to enable informal traders to manage their financial transactions independently, thus promoting greater autonomy and security in business, while providing a step towards formalization of their business. Our study examines the main challenges in developing AI for African languages, including data scarcity and linguistic diversity, and reviews viable strategies for addressing them, such as cross-lingual transfer learning and data augmentation techniques.

pdf bib
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias
Jayanta Sadhu | Maneesha Rani Saha | Rifat Shahriyar

The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two types of social biases in LLM generated outputs for Bangla language. Our main contributions in this work are: (1) bias studies on two different social biases for Bangla, (2) a curated dataset for bias measurement benchmarking and (3) testing two different probing techniques for bias detection in the context of Bangla. This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge. All our code and resources are publicly available for the progress of bias related research in Bangla NLP.

pdf bib
Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation
Jan Christian Blaise Cruz

In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of benchmark tasks in a much more efficient manner. Furthermore, we investigate additional steps during the distillation process that improves the soft-supervision of the target language, and provide a number of analyses and ablations to show the efficacy of the proposed method.

pdf bib
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models
Sina Bagheri Nezhad | Ameeta Agrawal | Rhitabrat Pokharel

Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity highlights the importance of shared cultural and linguistic contexts. These insights offer valuable guidance for developing more equitable and effective multilingual language models, particularly for underrepresented languages.

pdf bib
BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context
Alexis Matzopoulos | Charl Hendriks | Hishaam Mahomed | Francois Meyer

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, outperforming models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

pdf bib
Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models
Tsegaye Misikir Tashu | Andreea Ioana Tudor

In this work, we explore different linear mapping techniques to learn cross-lingual document representations from pre-trained multilingual large language models for low-resource languages. Three different mapping techniques namely Linear Concept Approximation (LCA), Linear Concept Compression (LCC), and Neural Concept Approximation (NCA) and four multilingual language models such as mBERT, mT5, XLM-R, and ErnieM were used to extract embeddings. The inter-lingual representations were created mappings the monolingual representation extracted from multilingual language models. The experimental results showed that LCA and LCC significantly outperform NCA, with models like ErnieM achieving the highest alignment quality. Language pairs exhibit variable performance, influenced by linguistic similarity and data availability, with the Amharic-English pair yielding particularly high scores. The results showed the utility of LCA and LCC in enabling cross-lingual tasks for low-resource languages.

pdf bib
How to age BERT Well: Continuous Training for Historical Language Adaptation
Anika Harju | Rob van der Goot

As the application of computational tools increases to digitalize historical archives, automatic annotation challenges persist due to distinct linguistic and morphological features of historical languages like Old English (OE). Existing tools struggle with the historical language varieties due to insufficient training. Previous research has focused on adapting pre-trained language models to new languages or domains but has rarely explored the modeling of language variety across time. Hence, we investigate the effectiveness of continuous language model training for adapting language models to OE on domain-specific data. We compare the continuous training of an English model (EN) and a multilingual model (ML), and use POS tagging for downstream evaluation. Results show that continuous pre-training substantially improves performance. We retrain a modern English (EN) model and a Multi-lingual (ML) BERT model for OE. We confirmed the effectiveness of continuous pre-training for language adaptation and downstream evaluation utilizing part-of-speech (POS) tagging, advancing the potential to understand the unique grammatical structures of historical OE archives. More concretely, EN BERT initially outperformed ML BERT with an accuracy of 83% during the language modeling phase. However, on the POS tagging task, ML BERT surpassed EN BERT, achieving an accuracy of 94%, which suggests effective performance to the historical language varieties.

pdf bib
Exploiting Task Reversibility of DRS Parsing and Generation: Challenges and Insights from a Multi-lingual Perspective
Muhammad Saad Amin | Luca Anselma | Alessandro Mazzei

Semantic parsing and text generation exhibit reversible properties when utilizing Discourse Representation Structures (DRS). However, both processes—text-to-DRS parsing and DRS-to-text generation—are susceptible to errors. In this paper, we exploit the reversible nature of DRS to explore both error propagation, which is commonly seen in pipeline methods, and the less frequently studied potential for error correction. We investigate two pipeline approaches: Parse-Generate-Parse (PGP) and Generate-Parse-Generate (GPG), utilizing pre-trained language models where the output of one model becomes the input for the next. Our evaluation uses the Parallel Meaning Bank dataset, focusing on Urdu as a low-resource language, Italian as a mid-resource language, and English serving as a high-resource baseline. Our analysis highlights that while pipelines are theoretically suited for error correction, they more often propagate errors, with Urdu exhibiting the greatest sensitivity, Italian showing a moderate effect, and English demonstrating the highest stability. This variation highlights the unique challenges faced by low-resource languages in semantic processing tasks. Further, our findings suggest that these pipeline methods support the development of more linguistically balanced datasets, enabling a comprehensive assessment across factors like sentence structure, length, type, polarity, and voice. Our cross-linguistic analysis provides valuable insights into the behavior of DRS processing in low-resource contexts, demonstrating both the potential and limitations of reversible pipeline approaches.

pdf bib
BBPOS: BERT-based Part-of-Speech Tagging for Uzbek
Latofat Bobojonova | Arofat Akhundjanova | Phil Sidney Ostheimer | Sophie Fellenz

This paper advances NLP research for the low-resource Uzbek language by evaluating two previously untested monolingual Uzbek BERT models on the part-of-speech (POS) tagging task and introducing the first publicly available UPOS-tagged benchmark dataset for Uzbek. Our fine-tuned models achieve 91% average accuracy, outperforming the baseline multi-lingual BERT as well as the rule-based tagger. Notably, these models capture intermediate POS changes through affixes and demonstrate context sensitivity, unlike existing rule-based taggers.

pdf bib
When Every Token Counts: Optimal Segmentation for Low-Resource Language Models
Vikrant Dewangan | Bharath Raj S | Garvit Suri | Raghav Sonavane

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding (BPE) are widely used, questions remain about their optimality across model scales and languages. In this work, we demonstrate through extensive experiments that an optimal BPE configuration significantly reduces token count compared to greedy segmentation, yielding improvements in token-saving percentages and performance benefits, particularly for smaller models. We evaluate tokenization performance across various intrinsic and extrinsic tasks, including generation and classification. Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource (LR) language applications, highlighting a promising direction for further research and inclusive NLP.

pdf bib
Recent Advancements and Challenges of Turkic Central Asian Language Processing
Yana Veitsman | Mareike Hartmann

Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks. Thus, this paper aims to summarize recent progress and identify future research directions. It provides a high-level overview of each language’s linguistic features, the current technology landscape, the application of transfer learning from higher-resource languages, and the availability of labeled and unlabeled data. By outlining the current state, we hope to inspire and facilitate future research.

pdf bib
CaLQuest.PT: Towards the Collection and Evaluation of Natural Causal Ladder Questions in Portuguese for AI Agents
Uriel Anderson Lasheras | Vladia Pinheiro

Large Language Models (LLMs) are increasingly central to the development of generative AI across diverse fields. While some anticipate these models may mark a step toward artificial general intelligence, their ability to handle complex causal reasoning remains unproven. Causal reasoning, particularly at Pearl’s interventional and counterfactual levels, is essential for true general intelligence. In this work, we introduce CaLQuest.PT, a dataset of over 8,000 natural causal questions in Portuguese, collected from real human interactions. Built upon a novel three-axis taxonomy, CaLQuest.PT categorizes questions by causal intent, action requirements, and the level of causal reasoning needed (associational, interventional, or counterfactual). Our findings from evaluating CaLQuest.PT’s seed questions with GPT-4o reveal that this LLM face challenges in handling interventional and relation-seeking causal queries. These results suggest limitations in using GPT-4o for extending causal question annotations and highlight the need for improved LLM strategies in causal reasoning. CaLQuest.PT provides a foundation for advancing LLM capabilities in causal understanding, particularly for the Portuguese-speaking world.

pdf bib
PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian
Kamyar Zeinalipour | Neda Jamshidi | Fahimeh Akbari | Marco Maggini | Monica Bianchini | Marco Gori

We present PersianMCQ-Instruct, a comprehensive resource that includes a dataset and advanced models for generating multiple-choice questions (MCQs) in standard Iranian Persian, a low-resource language spoken by over 80 million people. This resource features three state-of-the-art models for Persian MCQ generation: PMCQ-Gemma2-9b, PMCQ-Llama3.1-8b, and PMCQ-Mistral-7B. Inspired by the Agent Instruct framework and GPT-4o, we created the dataset by curating over 4,000 unique Persian Wikipedia pages, resulting in three MCQs per page and a total of over 12,000 questions. To ensure the quality of this dataset, we conducted human evaluations and model fine-tuning, both of which demonstrated significant performance improvements in Persian MCQ generation. The dataset and models are publicly available, offering valuable tools for researchers and educators, with particular benefits for advancing Persian-language educational technology.

pdf bib
Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev

Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.

pdf bib
Towards Inclusive Arabic LLMs: A Culturally Aligned Benchmark in Arabic Large Language Model Evaluation
Omer Nacar | Serry Taiseer Sibaee | Samar Ahmed | Safa Ben Atitallah | Adel Ammar | Yasser Alhabashi | Abdulrahman S. Al-Batati | Arwa Alsehibani | Nour Qandos | Omar Elshehy | Mohamed Abdelkader | Anis Koubaa

Arabic Large Language Models are usually evaluated using Western-centric benchmarks that overlook essential cultural contexts, making them less effective and culturally misaligned for Arabic-speaking communities. This study addresses this gap by evaluating the Arabic Massive Multitask Language Understanding (MMLU) Benchmark to assess its cultural alignment and relevance for Arabic Large Language Models (LLMs) across culturally sensitive topics. A team of eleven experts annotated over 2,500 questions, evaluating them based on fluency, adequacy, cultural appropriateness, bias detection, religious sensitivity, and adherence to social norms. Through human assessment, the study highlights significant cultural misalignments and biases, particularly in sensitive areas like religion and morality. In response to these findings, we propose annotation guidelines and integrate culturally enriched data sources to enhance the benchmark’s reliability and relevance. The research highlights the importance of cultural sensitivity in evaluating inclusive Arabic LLMs, fostering more widely accepted LLMs for Arabic-speaking communities.

pdf bib
Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models
Daria Kryvosheieva | Roger Levy

Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs’ capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.

pdf bib
Evaluating Large Language Models for In-Context Learning of Linguistic Patterns In Unseen Low Resource Languages
Hongpu Zhu | Yuqi Liang | Wenjing Xu | Hongzhi Xu

This paper investigates the ability of Large language Models (LLMs) in capturing linguistic patterns from unseen languages and applying them to translation between the languages and English within an in-context learning framework. Inspired by the International Linguistics Olympiad (IOL), we create test data consisting of translation puzzles between 40 low resource languages and English. We test the LLMs in two different strategies: direct prompting and step-by-step prompting. In the latter, the puzzles are manually decomposed into intermediate steps to allow LLMs learn and apply linguistic rules incrementally. The results show that this strategy can significantly improve the performance of LLMs, achieving comparable or slightly superior results to humans when translating the unseen languages to English. However, LLMs still struggle with translating English into the unseen languages, typically with complex syntactic rules. We further observe that LLMs cannot deal with languages with object-subject and noun-adjective word order compared to others, reflecting the potential impact imposed by typological features of languages in training data.

pdf bib
Next-Level Cantonese-to-Mandarin Translation: Fine-Tuning and Post-Processing with LLMs
Yuqian Dai | Chun Fai Chan | Ying Ki Wong | Tsz Ho Pun

Large Language Models (LLMs) have improved performance across various natural language processing tasks. Despite these improvements, LLMs continue to face significant challenges, such as grammatical issues and code-switching to English, when applied to low-resource languages like Cantonese in Machine Translation (MT) scenarios. By addressing the unique linguistic and contextual challenges of Cantonese, we present a novel strategy to improve the understanding and translation capabilities of LLMs for Cantonese-to-Mandarin MT. Our strategy comprises three key components: (1) Syntax and Part-of-Speech (POS) fine-tuning, where we use the Universal Dependencies (UD) corpus to fine-tune LLM, focusing on the linguistic structures of Cantonese; (2) Specialized Cantonese to Mandarin sentence pairs, collected from diverse sources such as Cantonese grammar textbooks and manually translated sentences across various domains, to expose the model to a wide range of linguistic contexts; (3) Post-processing with additional LLMs, where we introduce additional LLMs to improve the initial translations, correcting Mandarin grammar and punctuation. Empirical evaluations on human-created test sets show that our proposed strategy improves translation performance and outperforms existing commercial translation models with at least 3 BLEU scores. Additionally, our strategy also benefits other LLMs and a reversed translation direction, demonstrating its generalization and effectiveness.

pdf bib
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
Archchana Sindhujan | Diptesh Kanojia | Constantin Orasan | Shenbin Qian

This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0 -100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.