Nisansa De Silva

Also published as: Nisansa de Silva, Nisansa de Silva

2025

Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando | Nisansa de Silva | Menan Velayuthan | Charitha Rathnayake | Surangika Ranathunga
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while minimizing the disparity across multiPLMs. We publicly release the source code and the curated datasets

pdf bib

ScheduleMe: Multi-Agent Calendar Assistant
Oshadha Wijerathne | Amandi Nimasha | Dushan Fernando | Nisansa de Silva | Srinath Perera
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

pdf bib

Fine-tuning an LLM to Generate Lore Coherent Encounters for Dungeons and Dragons
Aravinth Sivaganeshan | Nisansa de Silva | Akila Peiris
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation
Menan Velayuthan | Nisansa De Silva | Surangika Ranathunga
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)

Domain adaptation in Neural Machine Translation (NMT) is commonly achieved through fine-tuning, but this approach becomes inefficient as the number of domains increases. Knowledge distillation (KD) provides a scalable alternative by training a compact model on distilled data from a larger model. However, we hypothesize that vanilla sequence-level KD primarily distills the decoder while neglecting encoder knowledge, leading to suboptimal knowledge transfer and limiting its effectiveness in low-resource settings, where both data and computational resources are constrained. To address this, we propose an improved sequence-level KD method that enhances encoder knowledge transfer through a cosine-based alignment loss. Our approach first trains a large model on a mixed-domain dataset and generates a Distilled Mixed Dataset (DMD). A small model is then trained on this dataset via sequence-level KD with encoder alignment. Experiments in a low-resource setting validate our hypothesis, demonstrating that our approach outperforms vanilla sequence-level KD, improves generalization to out-of-domain data, and facilitates efficient domain adaptation while reducing model size and computational cost.

pdf bib

Domain Adaptation for Multi-document Summarisation: A Case Study in the Medical Research Domain
Kushan Hewapathirana | Nisansa de Silva | C.D. Athuraliya | Piumi Kandanaarachchi
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

pdf bib

SiDiaC: Sinhala Diachronic Corpus
Nevidu Jayatilleke | Nisansa de Silva
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Nevidu Jayatilleke | Nisansa de Silva
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

pdf bib abs

Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches
Yomal De Mel | Kasun Wickramasinghe | Nisansa de Silva | Surangika Ranathunga
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method.

2024

pdf bib

Enhanced Aspect-Based Sentiment Analysis with Integrated Category Extraction for Instruct-DeBERTa
Dineth Jayakody | Koshila Isuranda | Ava Malkith | Nisansa de Silva | Sachintha Rajith Ponnamperuma | Ggn Sandamali | Klk Sudheera | Kashnika Gimhani Sarathchandra
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga | Nisansa de Silva | Menan Velayuthan | Aloka Fernando | Charitha Rathnayake
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

pdf bib abs

Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods
Menan Velayuthan | Dilith Jayakody | Nisansa De Silva | Aloka Fernando | Surangika Ranathunga
Proceedings of the Ninth Conference on Machine Translation

This paper describes our submission to the WMT24 shared task for Low-Resource Languages of Spain in the Constrained task category. Due to the lack of deep learning-based data filtration methods for these languages, we propose a purely statistical-based, two-stage pipeline for data filtration. In the primary stage, we begin by removing spaces and punctuation from the source sentences (Spanish) and deduplicating them. We then filter out sentence pairs with inconsistent language predictions by the language identification model, followed by the removal of pairs with anomalous sentence length and word count ratios, using the development set statistics as the threshold. In the secondary stage, for corpora of significant size, we employ a Jensen Shannon divergence-based method to curate training data of the desired size. Our filtered data allowed us to complete a two-step training process in under 3 hours, with GPU power consumption kept below 1 kWh, making our system both economical and eco-friendly. The source code, training data, and best models are available on the project’s GitHub page.

pdf bib

A Comparative Study of Multi-document Summarization Techniques
Anushiya Thevapalan | Nisansa de Silva
Proceedings of the 36th Conference on Computational Linguistics and Speech Processing (ROCLING 2024)

pdf bib abs

Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research
Surangika Ranathunga | Nisansa De Silva | Dilith Jayakody | Aloka Fernando
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We analysed a sample of NLP research papers archived in ACL Anthology as an attempt to quantify the degree of openness and the benefit of such an open culture in the NLP community. We observe that papers published in different NLP venues show different patterns related to artefact reuse. We also note that more than 30% of the papers we analysed do not release their artefacts publicly. Further, we observe a wide language-wise disparity in publicly available NLP-related artefacts.

2023

pdf bib abs

Comparative Analysis of Named Entity Recognition in the Dungeons and Dragons Domain
Gayashan Weerasundara | Nisansa de Silva
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Some Natural Language Processing (NLP) tasks that are in the sufficiently solved state for general domain English still struggle to attain the same level of performance in specific domains. Named Entity Recognition (NER), which aims to find and categorize entities in text is such a task met with difficulties in adapting to domain specificity. This paper compares the performance of 10 NER models on 7 adventure books from the Dungeons and Dragons (D&D) domain which is a subdomain of fantasy literature. Fantasy literature, being rich and diverse in vocabulary, poses considerable challenges for conventional NER. In this study, we use open-source Large Language Models (LLM) to annotate the named entities and character names in each number of official D&D books and evaluate the precision and distribution of each model. The paper aims to identify the challenges and opportunities for improving NER in fantasy literature. Our results show that even in the off-the-shelf configuration, Flair, Trankit, and Spacy achieve better results for identifying named entities in the D&D domain compared to their peers.

pdf bib

Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language
Kasun Wickramasinghe | Nisansa de Silva
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

pdf bib abs

Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Surangika Ranathunga | Nisansa de Silva
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, GDP and the speaker population of languages and provide possible reasons for this disparity, along with some suggestions to overcome the same.

pdf bib

Learning Sentence Embeddings In The Legal Domain with Low Resource Settings
Sahan Jayasinghe | Lakith Rambukkanage | Ashan Silva | Nisansa de Silva | Shehan Perera | Madhavi Perera
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Automatic Generation of Abstracts for Research Papers
Dushan Kumarasinghe | Nisansa de Silva
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Summarizing has always been an important utility for reading long documents. Research papers are unique in this regard, as they have a compulsory summary in the form of the abstract in the beginning of the document which gives the gist of the entire study often within a set upper limit for the word count. Writing the abstract to be sufficiently succinct while being descriptive enough is a hard task even for native English speakers. This study is the first step in generating abstracts for research papers in the computational linguistics domain automatically using the domain-specific abstractive summarization power of the GPT-Neo model.

pdf bib

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages
Gihan Weeraprameshwara | Vihanga Jayawickrama | Nisansa de Silva | Yudhanjaya Wijeratne
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib

Synthesis and Evaluation of a Domain-specific Large Data Set for Dungeons & Dragons
Akila Peiris | Nisansa de Silva
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

pdf bib abs

Legal Case Winning Party Prediction With Domain Specific Auxiliary Models
Sahan Jayasinghe | Lakith Rambukkanage | Ashan Silva | Nisansa de Silva | Amal Shehan Perera
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Sifting through hundreds of old case documents to obtain information pertinent to the case in hand has been a major part of the legal profession for centuries. However, with the expansion of court systems and the compounding nature of case law, this task has become more and more intractable with time and resource constraints. Thus automation by Natural Language Processing presents itself as a viable solution. In this paper, we discuss a novel approach for predicting the winning party of a current court case by training an analytical model on a corpus of prior court cases which is then run on the prepared text on the current court case. This will allow legal professionals to efficiently and precisely prepare their cases to maximize the chance of victory. The model is built with and experimented using legal domain specific sub-models to provide more visibility to the final model, along with other variations. We show that our model with critical sentence annotation with a transformer encoder using RoBERTa based sentence embedding is able to obtain an accuracy of 75.75%, outperforming other models.

2021

pdf bib abs

Semantic Oppositeness Assisted Deep Contextual Modeling for Automatic Rumor Detection in Social Networks
Nisansa de Silva | Dejing Dou
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Social networks face a major challenge in the form of rumors and fake news, due to their intrinsic nature of connecting users to millions of others, and of giving any individual the power to post anything. Given the rapid, widespread dissemination of information in social networks, manually detecting suspicious news is sub-optimal. Thus, research on automatic rumor detection has become a necessity. Previous works in the domain have utilized the reply relations between posts, as well as the semantic similarity between the main post and its context, consisting of replies, in order to obtain state-of-the-art performance. In this work, we demonstrate that semantic oppositeness can improve the performance on the task of rumor detection. We show that semantic oppositeness captures elements of discord, which are not properly covered by previous efforts, which only utilize semantic similarity or reply structure. We show, with extensive experiments on recent data sets for this problem, that our proposed model achieves state-of-the-art performance. Further, we show that our model is more resistant to the variances in performance introduced by randomness.

2020

pdf bib abs

Network representation learning (NRL) is crucial in the area of graph learning. Recently, graph autoencoders and its variants have gained much attention and popularity among various types of node embedding approaches. Most existing graph autoencoder-based methods aim to minimize the reconstruction errors of the input network while not explicitly considering the semantic relatedness between nodes. In this paper, we propose a novel network embedding method which models the consistency across different views of networks. More specifically, we create a second view from the input network which captures the relation between nodes based on node content and enforce the latent representations from the two views to be consistent by incorporating a multiview adversarial regularization module. The experimental studies on benchmark datasets prove the effectiveness of this method, and demonstrate that our method compares favorably with the state-of-the-art algorithms on challenging tasks such as link prediction and node clustering. We also evaluate our method on a real-world application, i.e., 30-day unplanned ICU readmission prediction, and achieve promising results compared with several baseline methods.

pdf bib

Effective Approach to Develop a Sentiment Annotator For Legal Domain in a Low Resource Setting
Gathika Ratnayaka | Nisansa de Silva | Amal Shehan Perera | Ramesh Pathirana
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2018

pdf bib abs

Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning
Viraj Salaka | Menuka Warushavithana | Nisansa de Silva | Amal Shehan Perera | Gathika Ratnayaka | Thejan Rupasinghe
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This study proposes a novel way of identifying the sentiment of the phrases used in the legal domain. The added complexity of the language used in law, and the inability of the existing systems to accurately predict the sentiments of words in law are the main motivations behind this study. This is a transfer learning approach which can be used for other domain adaptation tasks as well. The proposed methodology achieves an improvement of over 6% compared to the source model’s accuracy in the legal domain.

Nisansa De Silva

2025

2024

2023

2022

2021

2020

2018

2014

Co-authors

Venues