Damith Premasiri

2026

pdf bib abs

The second workshop on Language Models for Low-Resource Languages (LoResLM 2026) was held in conjunction with the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) in Rabat, Morocco. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages and dialects, following recent advancements in neural language models and their linguistic biases towards high- resource languages. LoResLM 2026 attracted a notable interest from the natural language processing (NLP) community, resulting in 55 accepted papers from 79 submissions. These contributions cover a broad range of low-resource languages from 13 language families and 11 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

2025

pdf bib abs

Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.

pdf bib

Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models
Damith Premasiri | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models

pdf bib

pdf bib abs

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

pdf bib abs

LLM-based Embedders for Prior Case Retrieval
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

2024

pdf bib abs

ChatGPT: Detection of Spanish Terms Based on False Friends
Amal Haddad Haddad | Damith Premasiri
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

One of the common errors which translators commit when transferring terms from one lan- guage into another is erroneously coining terms which are based on a false friend mistake due to the similarity between lexical units forming part of terms. In this case-study, we use Chat- GPT to automatically detect terms in Spanish which may be coined based on a false friend relation. To carry out this study, we imple- mented two experiments with GPT and com- pared the results. In the first, we prompted GPT to produce a list of twenty terms in Span- ish extracted from the UN discourse, which are possibly based on false friend relation, and its English equivalents and analysed the veracity of the results. In the second experiment, we used an aligned corpus to further study the ca- pabilities of the Language Model on detecting false friends in English and Spanish Text. Some results were significant for future terminologi- cal studies.

pdf bib abs

DARES: Dataset for Arabic Readability Estimation of School Materials
Mo El-Haj | Sultan Almujaiwel | Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.

pdf bib abs

NSina: A News Corpus for Sinhala
Hansi Hettiarachchi | Damith Premasiri | Lasitha Randunu Chandrakantha Uyangodage | Tharindu Ranasinghe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

pdf bib abs

A Federated Learning Approach to Privacy Preserving Offensive Language Identification
Marcos Zampieri | Damith Premasiri | Tharindu Ranasinghe
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users’ privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

2023

pdf bib abs

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

pdf bib abs

Deep Learning Methods for Identification of Multiword Flower and Plant Names
Damith Premasiri | Amal Haddad Haddad | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Multiword Terms (MWTs) are domain-specific Multiword Expressions (MWE) where two or more lexemes converge to form a new unit of meaning. The task of processing MWTs is crucial in many Natural Language Processing (NLP) applications, including Machine Translation (MT) and terminology extraction. However, the automatic detection of those terms is a difficult task and more research is still required to give more insightful and useful results in this field. In this study, we seek to fill this gap using state-of-the-art transformer models. We evaluate both BERT like discriminative transformer models and generative pre-trained transformer (GPT) models on this task, and we show that discriminative models perform better than current GPT models in multi-word terms identification task in flower and plant names in English and Spanish languages. Best discriminate models perform 94.3127%, 82.1733% F1 scores in English and Spanish data, respectively while ChatGPT could only perform 63.3183% and 47.7925% respectively.

2022

pdf bib abs

DTW at Qur’an QA 2022: Utilising Transfer Learning with Transformers for Question Answering in a Low-resource Domain
Damith Premasiri | Tharindu Ranasinghe | Wajdi Zaghouani | Ruslan Mitkov
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

The task of machine reading comprehension (MRC) is a useful benchmark to evaluate the natural language understanding of machines. It has gained popularity in the natural language processing (NLP) field mainly due to the large number of datasets released for many languages. However, the research in MRC has been understudied in several domains, including religious texts. The goal of the Qur’an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur’an. This paper describes the DTW entry to the Quran QA 2022 shared task. Our methodology uses transfer learning to take advantage of available Arabic MRC data. We further improve the results using various ensemble learning strategies. Our approach provided a partial Reciprocal Rank (pRR) score of 0.49 on the test set, proving its strong performance on the task.