Ruvan Weerasinghe

2025

Towards Effective Emotion Analysis in Low-Resource Tamil Texts
Priyatharshan Balachandran | Uthayasanker Thayasivam | Randil Pushpananda | Ruvan Weerasinghe
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Emotion analysis plays a significant role in understanding human behavior and communication, yet research in Tamil language remains limited. This study focuses on building an emotion classifier for Tamil texts using machine learning (ML) and deep learning (DL), along with creating an emotion-annotated Tamil corpus for Ekman’s basic emotions. Our dataset combines publicly available data with re-annotation and translations. Along with traditional ML models we investigated the use of Transfer Learning (TL) with state-of-the-art models, such as BERT and Electra based models. Experiments were conducted on unbalanced and balanced datasets using data augmentation techniques. The results indicate that MultinomialNaive Bayes (MNB) and Support Vector Machine (SVM) performed well with TF-IDF and BoW representations, while among Transfer Learning models, LaBSE achieved the highest accuracy (63% balanced, 69% unbalanced), followed by TamilBERT and IndicBERT.

pdf bib abs

Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

pdf bib

Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Ruvan Weerasinghe | Isuri Anuradha | Deshan Sumanathilaka
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

2021

pdf bib abs

A Dataset for Research on Modelling Depression Severity in Online Forum Data
Isuri Anuradha Nanomi Arachchige | Vihangi Himaya Jayasuriya | Ruvan Weerasinghe
Proceedings of the Student Research Workshop Associated with RANLP 2021

People utilize online forums to either look for information or to contribute it. Because of their growing popularity, certain online forums have been created specifically to provide support, assistance, and opinions for people suffering from mental illness. Depression is one of the most frequent psychological illnesses worldwide. People communicate more with online forums to find answers for their psychological disease. However, there is no mechanism to measure the severity of depression in each post and give higher importance to those who are diagnosed more severely depressed. Despite the fact that numerous researches based on online forum data and the identification of depression have been conducted, the severity of depression is rarely explored. In addition, the absence of datasets will stymie the development of novel diagnostic procedures for practitioners. From this study, we offer a dataset to support research on depression severity evaluation. The computational approach to measure an automatic process, identified severity of depression here is quite novel approach. Nonetheless, this elaborate measuring severity of depression in online forum posts is needed to ensure the measurement scales used in our research meets the expected norms of scientific research.

The cumulative effort over the past few decades that have gone into developing linguistic resources for tasks ranging from machine readable dictionaries to translation systems is enormous. Such effort is prohibitively expensive for languages outside the (largely) European family. The possibility of building such resources automatically by accessing electronic corpora of such languages are therefore of great interest to those involved in studying these ‘new’ - ‘lesser known’ languages. The main stumbling block to applying these data driven techniques directly is that most of them require large corpora rarely available for such ‘new’ languages. This paper describes an attempt at setting up a bootstrapping agenda to exploit the scarce corpus resources that may be available at the outset to a researcher concerned with such languages. In particular it reports on results of an experiment to use state-of-the-art data-driven techniques for building linguistic resources for Sinhala - a non-European language with virtually no electronic resources.