Tharindu Ranasinghe - ACL Anthology

Tharindu Ranasinghe

2025

Sinhala Encoder-only Language Models and Evaluation
Tharindu Ranasinghe | Hansi Hettiarachchi | Nadeesha Chathurangi Naradde Vidana Pathirana | Damith Premasiri | Lasitha Uyangodage | Isuri Nanomi Arachchige | Alistair Plum | Paul Rayson | Ruslan Mitkov
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.

MUSTS: MUltilingual Semantic Textual Similarity Benchmark
Tharindu Ranasinghe | Hansi Hettiarachchi | Constantin Orasan | Ruslan Mitkov
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Predicting semantic textual similarity (STS) is a complex and ongoing challenge in natural language processing (NLP). Over the years, researchers have developed a variety of supervised and unsupervised approaches to calculate STS automatically. Additionally, various benchmarks, which include STS datasets, have been established to consistently evaluate and compare these STS methods. However, they largely focus on high-resource languages, mixed with datasets annotated focusing on relatedness instead of similarity and containing automatically translated instances. Therefore, no dedicated benchmark for multilingual STS exists. To solve this gap, we introduce the Multilingual Semantic Textual Similarity Benchmark (MUSTS), which spans 13 languages, including low-resource languages. By evaluating more than 25 models on MUSTS, we establish the most comprehensive benchmark of multilingual STS methods. Our findings confirm that STS remains a challenging task, particularly for low-resource languages.

Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models
Damith Premasiri | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models

Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
Alphaeus Dmonte | Roland R Oruche | Tharindu Ranasinghe | Marcos Zampieri | Prasad Calyam
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.

Proceedings of the First Workshop on Language Models for Low-Resource Languages
Hansi Hettiarachchi | Tharindu Ranasinghe | Paul Rayson | Ruslan Mitkov | Mohamed Gaber | Damith Premasiri | Fiona Anting Tan | Lasitha Uyangodage
Proceedings of the First Workshop on Language Models for Low-Resource Languages

Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)
Hansi Hettiarachchi | Tharindu Ranasinghe | Paul Rayson | Ruslan Mitkov | Mohamed Gaber | Damith Premasiri | Fiona Anting Tan | Lasitha Randunu Chandrakantha Uyangodage
Proceedings of the First Workshop on Language Models for Low-Resource Languages

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

Does Machine Translation Impact Offensive Language Identification? The Case of Indo-Aryan Languages
Alphaeus Dmonte | Shrey Satapara | Rehab Alsudais | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the First Workshop on Language Models for Low-Resource Languages

The accessibility to social media platforms can be improved with the use of machine translation (MT). Non-standard features present in user-generated on social media content such as hashtags, emojis, and alternative spellings can lead to mistranslated instances by the MT systems. In this paper, we investigate the impact of MT on offensive language identification in Indo-Aryan languages. We use both original and MT datasets to evaluate the performance of various offensive language models. Our evaluation indicates that offensive language identification models achieve superior performance on original data than on MT data, and that the models trained on MT data identify offensive language more precisely on MT data than the models trained on original data.

Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Ernesto Luis Estevanell-Valladares | Alicia Picazo-Izquierdo | Tharindu Ranasinghe | Besik Mikaberidze | Simon Ostermann | Daniil Gurgurov | Philipp Mueller | Claudia Borg | Marián Šimko
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

ADOR: Dataset for Arabic Dialects in Hotel Reviews: A Human Benchmark for Sentiment Analysis
Maram I. Alharbi | Saad Ezzini | Hansi Hettiarachchi | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

Arabic machine translation remains a fundamentally challenging task, primarily due to the lack of comprehensive annotated resources. This study evaluates the performance of Meta’s NLLB-200 model in translating Modern Standard Arabic (MSA) into three regional dialects: Saudi, Maghribi, and Egyptian Arabic using a manually curated dataset of hotel reviews. We applied a multi-criteria human annotation framework to assess translation correctness, dialect accuracy, and sentiment and aspect preservation. Our analysis reveals significant variation in translation quality across dialects. While sentiment and aspect preservation were generally high, dialect accuracy and overall translation fidelity were inconsistent. For Saudi Arabic, over 95% of translations required human correction, highlighting systemic issues. Maghribi outputs demonstrated better dialectal authenticity, while Egyptian translations achieved the highest reliability with the lowest correction rate and fewest multi-criteria failures. These results underscore the limitations of current multilingual models in handling informal Arabic varieties and highlight the importance of dialect-sensitive evaluation.

Evaluating Large Language Models on Sentiment Analysis in Arabic Dialects
Maram I. Alharbi | Saad Ezzini | Hansi Hettiarachchi | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Despite recent progress in large language models (LLMs), their performance on Arabic dialects remains underexplored, particularly in the context of sentiment analysis. This study presents a comparative evaluation of three LLMs, DeepSeek-R1, Qwen2.5, and LLaMA-3, on sentiment classification across Modern Standard Arabic (MSA), Saudi dialect and Darija. We construct a balanced sentiment dataset by translating and validating MSA hotel reviews into Saudi dialect and Darija. Using parameter-efficient fine-tuning (LoRA) and dialect-specific prompts, we assess each model under matched and mismatched prompting conditions. Evaluation results show that Qwen2.5 achieves the highest macro F1 score of 79% on Darija input using MSA prompts, while DeepSeek performs best when prompted in the input dialect, reaching 71% on Saudi dialect. LLaMA-3 exhibits stable performance across prompt variations, with 75% macro F1 on Darija input under MSA prompting. Dialect-aware prompting consistently improves classification accuracy, particularly for neutral and negative sentiment classes.

LLM-based Embedders for Prior Case Retrieval
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

Detecting Fake News in the Era of Language Models
Muhammad Irfan Fikri Sabri | Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The proliferation of fake news has been amplified by the advent of large language models (LLMs), which can generate highly realistic and scalable misinformation. While prior studies have focused primarily on detecting human-generated fake news, the efficacy of current models against LLM-generated content remains underexplored. We address this gap by compiling a novel dataset combining public and LLM-generated fake news, redefining detection as a ternary classification task (real, human-generated fake, LLM-generated fake), and evaluating eight diverse classification models, including traditional machine learning, fine-tuned transformers, and few-shot prompted LLMs. Our findings highlight the strengths and limitations of these models in detecting evolving LLM-generated fake news, offering insights for future detection strategies.

Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects
Maram Alharbi | Salmane Chafik | Saad Ezzini | Ruslan Mitkov | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects
Maram I. Alharbi | Salmane Chafik | Saad Ezzini | Ruslan Mitkov | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.

Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy
Alistair Plum | Tharindu Ranasinghe | Christoph Purschke
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish. Despite its active development, Luxembourgish faces a digital data scarcity, exacerbated by Luxembourg’s multilingual context. We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts, in terms of size and type, of German and French data. We hypothesise that a model trained on Luxembourgish, German, and French will improve the model’s cross-lingual transfer learning capabilities and outperform monolingual and large multilingual models. To verify this, the study at hand explores whether multilingual or monolingual training is more beneficial for Luxembourgish language generation. For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.

2024

We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.

DARES: Dataset for Arabic Readability Estimation of School Materials
Mo El-Haj | Sultan Almujaiwel | Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.

What do Large Language Models Need for Machine Translation Evaluation?
Shenbin Qian | Archchana Sindhujan | Minnie Kabra | Diptesh Kanojia | Constantin Orasan | Tharindu Ranasinghe | Fred Blain
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.

Rater Cohesion and Quality from a Vicarious Perspective
Deepak Pandita | Tharindu Cyril Weerasooriya | Sujan Dutta | Sarah K. Luger | Tharindu Ranasinghe | Ashiqur R. KhudaBukhsh | Marcos Zampieri | Christopher M. Homan
Findings of the Association for Computational Linguistics: EMNLP 2024

Human feedback is essential for building human-centered AI systems across domains where disagreement is prevalent, such as AI safety, content moderation, or sentiment analysis. Many disagreements, particularly in politically charged settings, arise because raters have opposing values or beliefs. Vicarious annotation is a method for breaking down disagreement by asking raters how they think others would annotate the data. In this paper, we explore the use of vicarious annotation with analytical methods for moderating rater disagreement. We employ rater cohesion metrics to study the potential influence of political affiliations and demographic backgrounds on raters’ perceptions of offense. Additionally, we utilize CrowdTruth’s rater quality metrics, which consider the demographics of the raters, to score the raters and their annotations. We study how the rater quality metrics influence the in-group and cross-group rater cohesion across the personal and vicarious levels.

DORE: A Dataset for Portuguese Definition Generation
Anna Beatriz Dimas Furtado | Tharindu Ranasinghe | Frederic Blain | Ruslan Mitkov
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Definition modelling (DM) is the task of automatically generating a dictionary definition of a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.

Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language
Alistair Plum | Tharindu Ranasinghe | Christoph Purschke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual zero-shot experiments that could benefit many low-resource languages.

MentalHelp: A Multi-Task Dataset for Mental Health in Social Media
Nishat Raihan | Sadiya Sayara Chowdhury Puspo | Shafkat Farabi | Ana-Maria Bucur | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Early detection of mental health disorders is an essential step in treating and preventing mental health conditions. Computational approaches have been applied to users’ social media profiles in an attempt to identify various mental health conditions such as depression, PTSD, schizophrenia, and eating disorders. The interest in this topic has motivated the creation of various depression detection datasets. However, annotating such datasets is expensive and time-consuming, limiting their size and scope. To overcome this limitation, we present MentalHelp, a large-scale semi-supervised mental disorder detection dataset containing 14 million instances. The corpus was collected from Reddit and labeled in a semi-supervised way using an ensemble of three separate models - flan-T5, Disor-BERT, and Mental-BERT.

NSina: A News Corpus for Sinhala
Hansi Hettiarachchi | Damith Premasiri | Lasitha Randunu Chandrakantha Uyangodage | Tharindu Ranasinghe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Ruslan Mitkov | Saad Ezzini | Tharindu Ranasinghe | Ignatius Ezeani | Nouran Khallaf | Cengiz Acarturk | Matthew Bradbury | Mo El-Haj | Paul Rayson
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

A Federated Learning Approach to Privacy Preserving Offensive Language Identification
Marcos Zampieri | Damith Premasiri | Tharindu Ranasinghe
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users’ privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

MultiLS: An End-to-End Lexical Simplification Framework
Kai North | Tharindu Ranasinghe | Matthew Shardlow | Marcos Zampieri
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence’s original meaning. Several datasets exist for LS and each of them specialize in one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1) lexical complexity prediction (LCP), (2) substitute generation, and (3) substitute ranking for Portuguese.

2023

Target-Based Offensive Language Identification
Marcos Zampieri | Skye Morgan | Kai North | Tharindu Ranasinghe | Austin Simmmons | Paridhi Khandelwal | Sara Rosenthal | Preslav Nakov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present TBO, a new dataset for Target-based Offensive language identification. TBO contains post-level annotations regarding the harmfulness of an offensive post and token-level annotations comprising of the target and the offensive argument expression. Popular offensive language identification datasets for social media focus on annotation taxonomies only at the post level and more recently, some datasets have been released that feature only token-level annotations. TBO is an important resource that bridges the gap between post-level and token-level annotation datasets by introducing a single comprehensive unified annotation taxonomy. We use the TBO taxonomy to annotate post-level and token-level offensive language on English Twitter posts. We release an initial dataset of over 4,500 instances collected from Twitter and we carry out multiple experiments to compare the performance of different models trained and tested on TBO.

Offensive Language Identification in Transliterated and Code-Mixed Bangla
Md Nishat Raihan | Umma Tanmoy | Anika Binte Islam | Kai North | Tharindu Ranasinghe | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Identifying offensive content in social media is vital to create safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.

ALEXSIS+: Improving Substitute Generation and Selection for Lexical Simplification with Information Retrieval
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Matthew Shardlow | Marcos Zampieri
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Lexical simplification (LS) automatically replaces words that are deemed difficult to understand for a given target population with simpler alternatives, whilst preserving the meaning of the original sentence. The TSAR-2022 shared task on LS provided participants with a multilingual lexical simplification test set. It contained nearly 1,200 complex words in English, Portuguese, and Spanish and presented multiple candidate substitutions for each complex word. The competition did not make training data available; therefore, teams had to use either off-the-shelf pre-trained large language models (LLMs) or out-domain data to develop their LS systems. As such, participants were unable to fully explore the capabilities of LLMs by re-training and/or fine-tuning them on in-domain data. To address this important limitation, we present ALEXSIS+, a multilingual dataset in the aforementioned three languages, and ALEXSIS++, an English monolingual dataset that together contains more than 50,000 unique sentences retrieved from news corpora and annotated with cosine similarities to the original complex word and sentence. Using these additional contexts, we are able to generate new high-quality candidate substitutions that improve LS performance on the TSAR-2022 test set regardless of the language or model.

Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive
Tharindu Weerasooriya | Sujan Dutta | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan | Ashiqur KhudaBukhsh
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model classifiers are unable to predict how other human raters will respond, based on their political leanings. For (1), we conduct a ***noise audit*** at an unprecedented scale that combines both machine and human responses. For (2), we introduce a first-of-its-kind dataset of ***vicarious offense***. Our noise audit reveals that moderation outcomes vary wildly across different machine moderators. Our experiments with human moderators suggest that political leanings combined with sensitive issues affect both first-person and vicarious offense. The dataset is available through https://github.com/Homan-Lab/voiced.

Teacher and Student Models of Offensive Language in Social Media
Tharindu Ranasinghe | Marcos Zampieri
Findings of the Association for Computational Linguistics: ACL 2023

State-of-the-art approaches to identifying offensive language online make use of large pre-trained transformer models. However, the inference time, disk, and memory requirements of these transformer models present challenges for their wide usage in the real world. Even the distilled transformer models remain prohibitively large for many usage scenarios. To cope with these challenges, in this paper, we propose transferring knowledge from transformer models to much smaller neural models to make predictions at the token- and at the post-level. We show that this approach leads to lightweight offensive language identification models that perform on par with large transformers but with 100 times fewer parameters and much less memory usage

A Multi-task Learning Framework for Quality Estimation
Sourabh Deoghare | Paramveer Choudhary | Diptesh Kanojia | Tharindu Ranasinghe | Pushpak Bhattacharyya | Constantin Orăsan
Findings of the Association for Computational Linguistics: ACL 2023

Quality Estimation (QE) is the task of evaluating machine translation output in the absence of reference translation. Conventional approaches to QE involve training separate models at different levels of granularity viz., word-level, sentence-level, and document-level, which sometimes lead to inconsistent predictions for the same input. To overcome this limitation, we focus on jointly training a single model for sentence-level and word-level QE tasks in a multi-task learning framework. Using two multi-task learning-based QE approaches, we show that multi-task learning improves the performance of both tasks. We evaluate these approaches by performing experiments in different settings, viz., single-pair, multi-pair, and zero-shot. We compare the multi-task learning-based approach with baseline QE models trained on single tasks and observe an improvement of up to 4.28% in Pearson’s correlation (r) at sentence-level and 8.46% in F1-score at word-level, in the single-pair setting. In the multi-pair setting, we observe improvements of up to 3.04% at sentence-level and 13.74% at word-level; while in the zero-shot setting, we also observe improvements of up to 5.26% and 3.05%, respectively. We make the models proposed in this paper publically available.

Quality Estimation-Assisted Automatic Post-Editing
Sourabh Deoghare | Diptesh Kanojia | Fred Blain | Tharindu Ranasinghe | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2023

Automatic Post-Editing (APE) systems are prone to over-correction of the Machine Translation (MT) outputs. While Word-level Quality Estimation (QE) system can provide a way to curtail the over-correction, a significant performance gain has not been observed thus far by utilizing existing APE and QE combination strategies. In this paper, we propose joint training of a model on APE and QE tasks to improve the APE. Our proposed approach utilizes a multi-task learning (MTL) methodology, which shows significant improvement while treating both tasks as a ‘bargaining game’ during training. Moreover, we investigate various existing combination strategies and show that our approach achieves state-of-the-art performance for a ‘distant’ language pair, viz., English-Marathi. We observe an improvement of 1.09 TER and 1.37 BLEU points over a baseline QE-Unassisted APE system for English-Marathi, while also observing 0.46 TER and 0.62 BLEU points for English-German. Further, we discuss the results qualitatively and show how our approach helps reduce over-correction, thereby improving the APE performance. We also observe that the degree of integration between QE and APE directly correlates with the APE performance gain. We release our code and models publicly.

A Text-to-Text Model for Multilingual Offensive Language Identification
Tharindu Ranasinghe | Marcos Zampieri
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

Deep Learning Approaches to Detecting Safeguarding Concerns in Schoolchildren’s Online Conversations
Emma Franklin | Tharindu Ranasinghe
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

For school teachers and Designated Safeguarding Leads (DSLs), computers and other school-owned communication devices are both indispensable and deeply worrisome. For their education, children require access to the Internet, as well as a standard institutional ICT infrastructure, including e-mail and other forms of online communication technology. Given the sheer volume of data being generated and shared on a daily basis within schools, most teachers and DSLs can no longer monitor the safety and wellbeing of their students without the use of specialist safeguarding software. In this paper, we experiment with the use of state-of-the-art neural network models on the modelling of a dataset of almost 9,000 anonymised child-generated chat messages on the Microsoft Teams platform. The data was manually classified into eight fine-grained classes of safeguarding concerns (or false alarms) that a monitoring program would be interested in, and these were further split into two binary classes: true positives (real safeguarding concerns) and false positives (false alarms). For the fine grained classification, our models achieved a macro F1 score of 73.56, while for the binary classification, we achieved a macro F1 score of 87.32. This first experiment into the use of Deep Learning for detecting safeguarding concerns represents an important step towards achieving high-accuracy and reliable monitoring information for busy teachers and safeguarding leads.

Explainable Event Detection with Event Trigger Identification as Rationale Extraction
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Most event detection methods act at the sentence-level and focus on identifying sentences related to a particular event. However, identifying certain parts of a sentence that act as event triggers is also important and more challenging, especially when dealing with limited training data. Previous event detection attempts have considered these two tasks separately and have developed different methods. We hypothesise that similar to humans, successful sentence-level event detection models rely on event triggers to predict sentence-level labels. By exploring feature attribution methods that assign relevance scores to the inputs to explain model predictions, we study the behaviour of state-of-the-art sentence-level event detection models and show that explanations (i.e. rationales) extracted from these models can indeed be used to detect event triggers. We, therefore, (i) introduce a novel weakly-supervised method for event trigger detection; and (ii) propose to use event triggers as an explainable measure in sentence-level event detection. To the best of our knowledge, this is the first explainable machine learning approach to event trigger identification.

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

Deep Learning Methods for Identification of Multiword Flower and Plant Names
Damith Premasiri | Amal Haddad Haddad | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Multiword Terms (MWTs) are domain-specific Multiword Expressions (MWE) where two or more lexemes converge to form a new unit of meaning. The task of processing MWTs is crucial in many Natural Language Processing (NLP) applications, including Machine Translation (MT) and terminology extraction. However, the automatic detection of those terms is a difficult task and more research is still required to give more insightful and useful results in this field. In this study, we seek to fill this gap using state-of-the-art transformer models. We evaluate both BERT like discriminative transformer models and generative pre-trained transformer (GPT) models on this task, and we show that discriminative models perform better than current GPT models in multi-word terms identification task in flower and plant names in English and Spanish languages. Best discriminate models perform 94.3127%, 82.1733% F1 scores in English and Spanish data, respectively while ChatGPT could only perform 63.3183% and 47.7925% respectively.

Publish or Hold? Automatic Comment Moderation in Luxembourgish News Articles
Tharindu Ranasinghe | Alistair Plum | Christoph Purschke | Marcos Zampieri
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Recently, the internet has emerged as the primary platform for accessing news. In the majority of these news platforms, the users now have the ability to post comments on news articles and engage in discussions on various social media. While these features promote healthy conversations among users, they also serve as a breeding ground for spreading fake news, toxic discussions and hate speech. Moderating or removing such content is paramount to avoid unwanted consequences for the readers. How- ever, apart from a few notable exceptions, most research on automatic moderation of news article comments has dealt with English and other high resource languages. This leaves under-represented or low-resource languages at a loss. Addressing this gap, we perform the first large-scale qualitative analysis of more than one million Luxembourgish comments posted over the course of 14 years. We evaluate the performance of state-of-the-art transformer models in Luxembourgish news article comment moderation. Furthermore, we analyse how the language of Luxembourgish news article comments has changed over time. We observe that machine learning models trained on old comments do not perform well on recent data. The findings in this work will be beneficial in building news comment moderation systems for many low-resource languages

SurreyAI 2023 Submission for the Quality Estimation Shared Task
Archchana Sindhujan | Diptesh Kanojia | Constantin Orasan | Tharindu Ranasinghe
Proceedings of the Eighth Conference on Machine Translation

Quality Estimation (QE) systems are important in situations where it is necessary to assess the quality of translations, but there is no reference available. This paper describes the approach adopted by the SurreyAI team for addressing the Sentence-Level Direct Assessment shared task in WMT23. The proposed approach builds upon the TransQuest framework, exploring various autoencoder pre-trained language models within the MonoTransQuest architecture using single and ensemble settings. The autoencoder pre-trained language models employed in the proposed systems are XLMV, InfoXLM-large, and XLMR-large. The evaluation utilizes Spearman and Pearson correlation coefficients, assessing the relationship between machine-predicted quality scores and human judgments for 5 language pairs (English-Gujarati, English-Hindi, English-Marathi, English-Tamil and English-Telugu). The MonoTQ-InfoXLM-large approach emerges as a robust strategy, surpassing all other individual models proposed in this study by significantly improving over the baseline for the majority of the language pairs.

2022

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
Kai North | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the 29th International Conference on Computational Linguistics

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their potential substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS-ES protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated three models for substitute generation on this dataset, namely mBERT, XLM-R, and BERTimbau. The latter achieved the highest performance across all evaluation metrics.

DTW at Qur’an QA 2022: Utilising Transfer Learning with Transformers for Question Answering in a Low-resource Domain
Damith Premasiri | Tharindu Ranasinghe | Wajdi Zaghouani | Ruslan Mitkov
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

The task of machine reading comprehension (MRC) is a useful benchmark to evaluate the natural language understanding of machines. It has gained popularity in the natural language processing (NLP) field mainly due to the large number of datasets released for many languages. However, the research in MRC has been understudied in several domains, including religious texts. The goal of the Qur’an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur’an. This paper describes the DTW entry to the Quran QA 2022 shared task. Our methodology uses transfer learning to take advantage of available Arabic MRC data. We further improve the results using various ensemble learning strategies. Our approach provided a partial Reciprocal Rank (pRR) score of 0.49 on the test set, proving its strong performance on the task.

GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of the task is to automatically provide a set of candidate substitutions for complex words in context. The organizers provided participants with ALEXSIS a manually annotated dataset with instances split between a small trial set with a dozen instances in each of the three languages of the competition (English, Portuguese, Spanish) and a test set with over 300 instances in the three aforementioned languages. To cope with the lack of training data, participants had to either use alternative data sources or pre-trained language models. We experimented with monolingual models: BERTimbau, ELECTRA, and RoBERTA-largeBNE. Our best system achieved 1st place out of sixteen systems for Portuguese, 8th out of thirty-three systems for English, and 6th out of twelve systems for Spanish.

2021

An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that multilingual QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.

Evaluating the state-of-the-art event detection systems on determining spatio-temporal distribution of the events on the ground is performed unfrequently. But, the ability to both (1) extract events “in the wild” from text and (2) properly evaluate event detection systems has potential to support a wide variety of tasks such as monitoring the activity of socio-political movements, examining media coverage and public support of these movements, and informing policy decisions. Therefore, we study performance of the best event detection systems on detecting Black Lives Matter (BLM) events from tweets and news articles. The murder of George Floyd, an unarmed Black man, at the hands of police officers received global attention throughout the second half of 2020. Protests against police violence emerged worldwide and the BLM movement, which was once mostly regulated to the United States, was now seeing activity globally. This shared task asks participants to identify BLM related events from large unstructured data sources, using systems pretrained to extract socio-political events from text. We evaluate several metrics, accessing each system’s ability to identify protest events both temporally and spatially. Results show that identifying daily protest counts is an easier task than classifying spatial and temporal protest trends simultaneously, with maximum performance of 0.745 and 0.210 (Pearson r), respectively. Additionally, all baselines and participant systems suffered from low recall, with a maximum recall of 5.08.

fBERT: A Neural Transformer for Identifying Offensive Content
Diptanu Sarkar | Marcos Zampieri | Tharindu Ranasinghe | Alexander Ororbia
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over 1.4 million offensive instances. We evaluate fBERT’s performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.

WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments
Skye Morgan | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval2021 shared task containing over 3,000 manually annotated Facebook comments in German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning. Our results indicate that multitask learning achieves performance superior to the more common single task learning approach in all three tasks. We submit our best systems to GermEval-2021 under the team name WLV-RIT.

MUDES: Multilingual Detection of Offensive Spans
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.

Transformers to Fight the COVID-19 Infodemic
Lasitha Uyangodage | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

The massive spread of false information on social media has become a global risk especially in a global pandemic situation like COVID-19. False information detection has thus become a surging research topic in recent months. NLP4IF-2021 shared task on fighting the COVID-19 infodemic has been organised to strengthen the research in false information detection where the participants are asked to predict seven different binary labels regarding false information in a tweet. The shared task has been organised in three languages; Arabic, Bulgarian and English. In this paper, we present our approach to tackle the task objective using transformers. Overall, our approach achieves a 0.707 mean F1 score in Arabic, 0.578 mean F1 score in Bulgarian and 0.864 mean F1 score in English ranking 4^th place in all the languages.

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi
Saurabh Sampatrao Gaikwad | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

Can Multilingual Transformers Fight the COVID-19 Infodemic?
Lasitha Uyangodage | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The massive spread of false information on social media has become a global risk especially in a global pandemic situation like COVID-19. False information detection has thus become a surging research topic in recent months. In recent years, supervised machine learning models have been used to automatically identify false information in social media. However, most of these machine learning models focus only on the language they were trained on. Given the fact that social media platforms are being used in different languages, managing machine learning models for each and every language separately would be chaotic. In this research, we experiment with multilingual models to identify false information in social media by using two recently released multilingual false information detection datasets. We show that multilingual models perform on par with the monolingual models and sometimes even better than the monolingual models to detect false information in social media making them more useful in real-world scenarios.

TransWiC at SemEval-2021 Task 2: Transformer-based Multilingual and Cross-lingual Word-in-Context Disambiguation
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Identifying whether a word carries the same meaning or different meaning in two contexts is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. Most of the previous work in this area rely on language-specific resources making it difficult to generalise across languages. Considering this limitation, our approach to SemEval-2021 Task 2 is based only on pretrained transformer models and does not use any language-specific processing and resources. Despite that, our best model achieves 0.90 accuracy for English-English subtask which is very compatible compared to the best result of the subtask; 0.93 accuracy. Our approach also achieves satisfactory results in other monolingual and cross-lingual language pairs as well.

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans
Tharindu Ranasinghe | Diptanu Sarkar | Marcos Zampieri | Alexander Ororbia
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic posts, there are only a few studies that focus on detecting the words or expressions that make a post offensive. This motivates the organization of the SemEval-2021 Task 5: Toxic Spans Detection competition, which has provided participants with a dataset containing toxic spans annotation in English posts. In this paper, we present the WLV-RIT entry for the SemEval-2021 Task 5. Our best performing neural transformer model achieves an 0.68 F1-Score. Furthermore, we develop an open-source framework for multilingual detection of offensive spans, i.e., MUDES, based on neural transformers that detect toxic spans in texts.

Comparing Approaches to Dravidian Language Identification
Tommi Jauhiainen | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.

Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation
Diptesh Kanojia | Marina Fomicheva | Tharindu Ranasinghe | Frédéric Blain | Constantin Orăsan | Lucia Specia
Proceedings of the Sixth Conference on Machine Translation

Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation.

2020

TransQuest: Translation Quality Estimation with Cross-lingual Transformers
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 28th International Conference on Computational Linguistics

Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.

Intelligent Translation Memory Matching and Retrieval with Sentence Encoders
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Matching and retrieving previously translated segments from the Translation Memory is a key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper, we introduce sentence encoders to improve matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance-based algorithms.

Multilingual Offensive Language Identification with Cross-lingual Embeddings
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.

Offensive Language Identification in Greek
Zesis Pitenis | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the Twelfth Language Resources and Evaluation Conference

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.

BRUMS at SemEval-2020 Task 3: Contextualised Embeddings for Predicting the (Graded) Effect of Context in Word Similarity
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the team BRUMS submission to SemEval-2020 Task 3: Graded Word Similarity in Context. The system utilises state-of-the-art contextualised word embeddings, which have some task-specific adaptations, including stacked embeddings and average embeddings. Overall, the approach achieves good evaluation scores across all the languages, while maintaining simplicity. Following the final rankings, our approach is ranked within the top 5 solutions of each language while preserving the 1st position of Finnish subtask 2.

RGCL at SemEval-2020 Task 6: Neural Approaches to DefinitionExtraction
Tharindu Ranasinghe | Alistair Plum | Constantin Orasan | Ruslan Mitkov
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.

BRUMS at SemEval-2020 Task 12: Transformer Based Multilingual Offensive Language Identification in Social Media
Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe the team BRUMS entry to OffensEval 2: Multilingual Offensive Language Identification in Social Media in SemEval-2020. The OffensEval organizers provided participants with annotated datasets containing posts from social media in Arabic, Danish, English, Greek and Turkish. We present a multilingual deep learning model to identify offensive language in social media. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility between languages.

TransQuest at WMT2020: Sentence-Level Direct Assessment
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the Fifth Conference on Machine Translation

This paper presents the team TransQuest’s participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.

InfoMiner at WNUT-2020 Task 2: Transformer-based Covid-19 Informative Tweet Extraction
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Identifying informative tweets is an important step when building information extraction systems based on social media. WNUT-2020 Task 2 was organised to recognise informative tweets from noise tweets. In this paper, we present our approach to tackle the task objective using transformers. Overall, our approach achieves 10th place in the final rankings scoring 0.9004 F1 score for the test set.

2019

Emoji Powered Capsule Network to Detect Type and Target of Offensive Posts in Social Media
Hansi Hettiarachchi | Tharindu Ranasinghe
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper describes a novel research approach to detect type and target of offensive posts in social media using a capsule network. The input to the network was character embeddings combined with emoji embeddings. The approach was evaluated on all three subtasks in Task 6 - SemEval 2019: OffensEval: Identifying and Categorizing Offensive Language in Social Media. The evaluation also showed that even though the capsule networks have not been used commonly in natural language processing tasks, they can outperform existing state of the art solutions for offensive language detection in social media.

Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning
Alistair Plum | Tharindu Ranasinghe | Constantin Orasan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.

Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains

Semantic Textual Similarity with Siamese Neural Networks
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods

RGCL-WLV at SemEval-2019 Task 12: Toponym Detection
Alistair Plum | Tharindu Ranasinghe | Pablo Calleja | Constantin Orăsan | Ruslan Mitkov
Proceedings of the 13th International Workshop on Semantic Evaluation

This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.

Co-authors

Alistair Plum 7

Diptesh Kanojia 5

Frédéric Blain 4

Alphaeus Dmonte 4

Nishat Raihan 4

Matthew Shardlow 4

Lasitha Uyangodage 4

Maram I. Alharbi 3

Christoph Purschke 3

Fernando Alva-Manchego 2

Riza Theresa Batista-Navarro 2

Pushpak Bhattacharyya 2

Saul Calderon-Ramirez 2

Salmane Chafik 2

Sourabh Deoghare 2

Thomas François 2

Mohamed Gaber 2

Akio Hayakawa 2

Christopher Homan 2

Andrea Horbach 2

Anna Hülsing 2

Joseph Marvin Imperial 2

Laura Occhipinti 2

Alexander Ororbia 2

Nelson Peréz Rojas 2

Horacio Saggion 2

Martin Solis Salazar 2

Diptanu Sarkar 2

Archchana Sindhujan 2

Fiona Anting Tan 2

Lasitha Randunu Chandrakantha Uyangodage 2

Cengiz Acartürk 1

Maram Alharbi 1

Sultan Almujaiwel 1

Rehab Alsudais 1

Antonios Anastasopoulos 1

Martin Andrews 1

Isuri Nanomi Arachchige 1

Nora Aranberri 1

Dennis Atzenhofer 1

Matthew Bradbury 1

Judith Brenner 1

Ana-Maria Bucur 1

Pablo Calleja 1

Prasad Calyam 1

Paramveer Choudhary 1

Brenda Curtis 1

Anna Beatriz Dimas Furtado 1

Ernesto Luis Estevanell Valladares 1

Ignatius Ezeani 1

Shafkat Farabi 1

Marina Fomicheva 1

Mikel L. Forcada 1

Emma Franklin 1

Saurabh Sampatrao Gaikwad 1

Salvatore Giorgi 1

Daniil Gurgurov 1

Amal Haddad Haddad 1

Christopher M. Homan 1

Ali Hürriyetoğlu 1

Anika Binte Islam 1

Tommi Jauhiainen 1

Nouran Khallaf 1

Paridhi Khandelwal 1

Ashiqur R. KhudaBukhsh 1

Ashiqur Khudabukhsh 1

Maarit Koponen 1

Sirkku Latomaa 1

Sarah K. Luger 1

Beso Mikaberidze 1

Mikhail Mikhailov 1

Philipp Mueller 1

Preslav Nakov 1

Mara Nunziatini 1

Mary Nurminen 1

Roland R Oruche 1

Simon Ostermann 1

Deepak Pandita 1

Carla Parra Escartín 1

Nadeesha Chathurangi Naradde Vidana Pathirana 1

Alicia Picazo-Izquierdo 1

Zesis Pitenis 1

Maja Popović 1

Sadiya Sayara Chowdhury Puspo 1

Francesco Ignazio Re 1

Sara Rosenthal 1

Muhammad Irfan Fikri Sabri 1

Shrey Satapara 1

Carolina Scarton 1

Frederike Schierl 1

Austin Simmmons 1

Nicolas Stefanovitch 1

Niklas Stoehr 1

Eva Vanmassenhove 1

Sergi Àlvarez Vidal 1

Tharindu Weerasooriya 1

Tharindu Cyril Weerasooriya 1

Wajdi Zaghouani 1

Vanni Zavarella 1

Sanja Štajner 1

Venues