Deshan Koshala Sumanathilaka


2025

Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.
Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system designed specifically for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT a open sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multi modal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 66% transcription accuracy and 70% correction accuracy with 65% overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive NLP technologies in underrepresented languages and showcases a practical step toward improving accessibility for adult dyslexic users.
Large Language Models(LLMs) have revolutionised the field of artificial intelligence and have been successfully employed in many disciplines, capturing widespread attention and enthusiasm. Many previous studies have established that Domain-specific Deep Learning models to competitively perform with the general-purpose LLMs (Maatouk et al., 2024;Lu et al., 2024). However, a suitable prompt which provides direct instructions and background information is expected to yield im- proved results (Kamruzzaman and Kim, 2024). The present study focuses on utilising LLMs for the Toponym Resolution task by incorporating Retrieval-Augmented Generation(RAG) and prompting techniques to surpass the results of the traditional Deep Learning models. Moreover, this study demonstrates that promising results can be achieved without relying on large amounts of labelled, domain-specific data. After a descriptive comparison between open-source and proprietary LLMs through different prompt engineering techniques, the GPT-4o model performs best compared to the other LLMs for the Toponym Resolution task.
This systematic review paper provides an overview of recent machine translation and transliteration developments for Indo-Aryan languages spoken by a large population across South Asia. The paper examines advancements in translation and transliteration systems for a few language pairs which appear in recently published papers. The review summarizes the current state of these technologies, providing a worthful resource for anyone who is doing research in these fields to understand and find existing systems and techniques for translation and transliteration.
The Romanized text has become popular with the growth of digital communication platforms, largely due to the familiarity with English keyboards. In Sri Lanka, Romanized Sinhala, commonly referred to as “Singlish” is widely used in digital communications. This paper introduces a novel context-aware back-transliteration system designed to address the ad-hoc typing patterns and lexical ambiguity inherent in Singlish. The proposed system com bines dictionary-based mapping for Singlish words, a rule-based transliteration for out of-vocabulary words and a BERT-based language model for addressing lexical ambiguities. Evaluation results demonstrate the robustness of the proposed approach, achieving high BLEU scores along with low Word Error Rate (WER) and Character Error Rate (CER) across test datasets. This study provides an effective solution for Romanized Sinhala back-transliteration and establishes the foundation for improving NLP tools for similar low-resourced languages.
This paper introduces the first Transliteration disambiguation (TD) dataset for Romanized Sinhala, informally known as Singlish, developed to address the challenge of transliteration ambiguity in backwards transliteration tasks. The dataset covers 22 ambiguous Romanized Sinhala words, each mapping to two distinct Sinhala meanings, and provides 30 Romanized sentences per word: ten for each meaning individually and ten containing both meanings in context. Sentences were initially collected through web scraping and later post-processed using the Claude language model, which offers strong support for Sinhala, alongside a rule-based Romanization process to ensure linguistic quality and consistency. To demonstrate its applicability, the dataset was used to evaluate four existing back-transliteration systems, highlighting their performance in resolving context-sensitive ambiguities. Baseline evaluations confirm the dataset’s effectiveness in assessing transliteration systems’ ability to handle transliteration ambiguity, offering a valuable resource for advancing TD and transliteration research for Sinhala.
The growth of mobile financial transactions presents new challenges for fraud detection, where traditional and ML methods often miss emerging patterns. While Large Language Models (LLMs) offer advanced language understanding, they are typically too resource-intensive for mobile deployment and raise privacy concerns due to cloud reliance. This paper proposes a lightweight, privacy-preserving approach by fine-tuning and quantizing compact LLMs for on-device fraud detection from textual data. Models were optimized using Open Neural Network Exchange (ONNX) conversion and quantization to ensure efficiency. The fine-tuned quantized Llama-160M-Chat-v1 (bnb4) achieved 99.47% accuracy with a 168MB footprint, while fine-tuned quantized Qwen1.5-0.5B-Chat (bnb4) reached 99.50% accuracy at 797MB. These results demonstrate that optimized LLMs can deliver accurate, real-time fraud detection on mobile devices without compromising user privacy.

2024

Ambiguous words are often found within modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.