Nisansa De Silva

Also published as: Nisansa de Silva


2024

pdf bib
Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga | Nisansa De Silva | Velayuthan Menan | Aloka Fernando | Charitha Rathnayake
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

pdf bib
Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods
Menan Velayuthan | Dilith Jayakody | Nisansa De Silva | Aloka Fernando | Surangika Ranathunga
Proceedings of the Ninth Conference on Machine Translation

This paper describes our submission to the WMT24 shared task for Low-Resource Languages of Spain in the Constrained task category. Due to the lack of deep learning-based data filtration methods for these languages, we propose a purely statistical-based, two-stage pipeline for data filtration. In the primary stage, we begin by removing spaces and punctuation from the source sentences (Spanish) and deduplicating them. We then filter out sentence pairs with inconsistent language predictions by the language identification model, followed by the removal of pairs with anomalous sentence length and word count ratios, using the development set statistics as the threshold. In the secondary stage, for corpora of significant size, we employ a Jensen Shannon divergence-based method to curate training data of the desired size. Our filtered data allowed us to complete a two-step training process in under 3 hours, with GPU power consumption kept below 1 kWh, making our system both economical and eco-friendly. The source code, training data, and best models are available on the project’s GitHub page.

pdf bib
Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research
Surangika Ranathunga | Nisansa De Silva | Dilith Jayakody | Aloka Fernando
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We analysed a sample of NLP research papers archived in ACL Anthology as an attempt to quantify the degree of openness and the benefit of such an open culture in the NLP community. We observe that papers published in different NLP venues show different patterns related to artefact reuse. We also note that more than 30% of the papers we analysed do not release their artefacts publicly. Further, we observe a wide language-wise disparity in publicly available NLP-related artefacts.

2023

pdf bib
Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language
Kasun Wickramasinghe | Nisansa de Silva
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Comparative Analysis of Named Entity Recognition in the Dungeons and Dragons Domain
Gayashan Weerasundara | Nisansa de Silva
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Some Natural Language Processing (NLP) tasks that are in the sufficiently solved state for general domain English still struggle to attain the same level of performance in specific domains. Named Entity Recognition (NER), which aims to find and categorize entities in text is such a task met with difficulties in adapting to domain specificity. This paper compares the performance of 10 NER models on 7 adventure books from the Dungeons and Dragons (D&D) domain which is a subdomain of fantasy literature. Fantasy literature, being rich and diverse in vocabulary, poses considerable challenges for conventional NER. In this study, we use open-source Large Language Models (LLM) to annotate the named entities and character names in each number of official D&D books and evaluate the precision and distribution of each model. The paper aims to identify the challenges and opportunities for improving NER in fantasy literature. Our results show that even in the off-the-shelf configuration, Flair, Trankit, and Spacy achieve better results for identifying named entities in the D&D domain compared to their peers.

2022

pdf bib
Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages
Gihan Weeraprameshwara | Vihanga Jayawickrama | Nisansa de Silva | Yudhanjaya Wijeratne
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib
Synthesis and Evaluation of a Domain-specific Large Data Set for Dungeons & Dragons
Akila Peiris | Nisansa de Silva
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib
Learning Sentence Embeddings In The Legal Domain with Low Resource Settings
Sahan Jayasinghe | Lakith Rambukkanage | Ashan Silva | Nisansa de Silva | Shehan Perera | Madhavi Perera
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib
Legal Case Winning Party Prediction With Domain Specific Auxiliary Models
Sahan Jayasinghe | Lakith Rambukkanage | Ashan Silva | Nisansa de Silva | Amal Shehan Perera
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Sifting through hundreds of old case documents to obtain information pertinent to the case in hand has been a major part of the legal profession for centuries. However, with the expansion of court systems and the compounding nature of case law, this task has become more and more intractable with time and resource constraints. Thus automation by Natural Language Processing presents itself as a viable solution. In this paper, we discuss a novel approach for predicting the winning party of a current court case by training an analytical model on a corpus of prior court cases which is then run on the prepared text on the current court case. This will allow legal professionals to efficiently and precisely prepare their cases to maximize the chance of victory. The model is built with and experimented using legal domain specific sub-models to provide more visibility to the final model, along with other variations. We show that our model with critical sentence annotation with a transformer encoder using RoBERTa based sentence embedding is able to obtain an accuracy of 75.75%, outperforming other models.

pdf bib
Automatic Generation of Abstracts for Research Papers
Dushan Kumarasinghe | Nisansa de Silva
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Summarizing has always been an important utility for reading long documents. Research papers are unique in this regard, as they have a compulsory summary in the form of the abstract in the beginning of the document which gives the gist of the entire study often within a set upper limit for the word count. Writing the abstract to be sufficiently succinct while being descriptive enough is a hard task even for native English speakers. This study is the first step in generating abstracts for research papers in the computational linguistics domain automatically using the domain-specific abstractive summarization power of the GPT-Neo model.

pdf bib
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Surangika Ranathunga | Nisansa de Silva
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, GDP and the speaker population of languages and provide possible reasons for this disparity, along with some suggestions to overcome the same.

pdf bib
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

2021

pdf bib
Semantic Oppositeness Assisted Deep Contextual Modeling for Automatic Rumor Detection in Social Networks
Nisansa de Silva | Dejing Dou
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Social networks face a major challenge in the form of rumors and fake news, due to their intrinsic nature of connecting users to millions of others, and of giving any individual the power to post anything. Given the rapid, widespread dissemination of information in social networks, manually detecting suspicious news is sub-optimal. Thus, research on automatic rumor detection has become a necessity. Previous works in the domain have utilized the reply relations between posts, as well as the semantic similarity between the main post and its context, consisting of replies, in order to obtain state-of-the-art performance. In this work, we demonstrate that semantic oppositeness can improve the performance on the task of rumor detection. We show that semantic oppositeness captures elements of discord, which are not properly covered by previous efforts, which only utilize semantic similarity or reply structure. We show, with extensive experiments on recent data sets for this problem, that our proposed model achieves state-of-the-art performance. Further, we show that our model is more resistant to the variances in performance introduced by randomness.

2020

pdf bib
Exploiting Node Content for Multiview Graph Convolutional Network and Adversarial Regularization
Qiuhao Lu | Nisansa de Silva | Dejing Dou | Thien Huu Nguyen | Prithviraj Sen | Berthold Reinwald | Yunyao Li
Proceedings of the 28th International Conference on Computational Linguistics

Network representation learning (NRL) is crucial in the area of graph learning. Recently, graph autoencoders and its variants have gained much attention and popularity among various types of node embedding approaches. Most existing graph autoencoder-based methods aim to minimize the reconstruction errors of the input network while not explicitly considering the semantic relatedness between nodes. In this paper, we propose a novel network embedding method which models the consistency across different views of networks. More specifically, we create a second view from the input network which captures the relation between nodes based on node content and enforce the latent representations from the two views to be consistent by incorporating a multiview adversarial regularization module. The experimental studies on benchmark datasets prove the effectiveness of this method, and demonstrate that our method compares favorably with the state-of-the-art algorithms on challenging tasks such as link prediction and node clustering. We also evaluate our method on a real-world application, i.e., 30-day unplanned ICU readmission prediction, and achieve promising results compared with several baseline methods.

pdf bib
Effective Approach to Develop a Sentiment Annotator For Legal Domain in a Low Resource Setting
Gathika Ratnayaka | Nisansa de Silva | Amal Shehan Perera | Ramesh Pathirana
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2018

pdf bib
Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning
Viraj Salaka | Menuka Warushavithana | Nisansa de Silva | Amal Shehan Perera | Gathika Ratnayaka | Thejan Rupasinghe
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This study proposes a novel way of identifying the sentiment of the phrases used in the legal domain. The added complexity of the language used in law, and the inability of the existing systems to accurately predict the sentiments of words in law are the main motivations behind this study. This is a transfer learning approach which can be used for other domain adaptation tasks as well. The proposed methodology achieves an improvement of over 6% compared to the source model’s accuracy in the legal domain.

2014

pdf bib
Building a WordNet for Sinhala
Indeewari Wijesiri | Malaka Gallage | Buddhika Gunathilaka | Madhuranga Lakjeewa | Daya Wimalasuriya | Gihan Dias | Rohini Paranavithana | Nisansa de Silva
Proceedings of the Seventh Global Wordnet Conference

Search
Co-authors