Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Ruslan Mitkov, Galia Angelova (Editors)

Anthology ID:
Varna, Bulgaria
INCOMA Ltd., Shoumen, Bulgaria
Bib Export formats:

pdf bib
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Ruslan Mitkov | Galia Angelova

pdf bib
Bipol: Multi-Axes Evaluation of Bias with Explainability in Benchmark Datasets
Tosin Adewumi | Isabella Södergren | Lama Alkhaled | Sana Al-azzawi | Foteini Simistira Liwicki | Marcus Liwicki

We investigate five English NLP benchmark datasets (on the superGLUE leaderboard) and two Swedish datasets for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Winogender diagnostic (AXg), Recognising Textual Entailment (RTE), Swedish CB, and SWEDN. Bias can be harmful and it is known to be common in data, which ML models learn from. In order to mitigate bias in data, it is crucial to be able to estimate it objectively. We use bipol, a novel multi-axes bias metric with explainability, to estimate and explain how much bias exists in these datasets. Multilingual, multi-axes bias evaluation is not very common. Hence, we also contribute a new, large Swedish bias-labelled dataset (of 2 million samples), translated from the English version and train the SotA mT5 model on it. In addition, we contribute new multi-axes lexica for bias detection in Swedish. We make the codes, model, and new dataset publicly available.

pdf bib
Automatically Generating Hindi Wikipedia Pages Using Wikidata as a Knowledge Graph: A Domain-Specific Template Sentences Approach
Aditya Agarwal | Radhika Mamidi

This paper presents a method for automatically generating Wikipedia articles in the Hindi language, using Wikidata as a knowledge base. Our method extracts structured information from Wikidata, such as the names of entities, their properties, and their relationships, and then uses this information to generate natural language text that conforms to a set of templates designed for the domain of interest. We evaluate our method by generating articles about scientists, and we compare the resulting articles to machine-translated articles. Our results show that more than 70% of the generated articles using our method are better in terms of coherence, structure, and readability. Our approach has the potential to significantly reduce the time and effort required to create Wikipedia articles in Hindi and could be extended to other languages and domains as well.

pdf bib
Cross-lingual Classification of Crisis-related Tweets Using Machine Translation
Shareefa Al Amer | Mark Lee | Phillip Smith

Utilisation of multilingual language models such as mBERT and XLM-RoBERTa has increasingly gained attention in recent work by exploiting the multilingualism of such models in different downstream tasks across different languages. However, performance degradation is expected in transfer learning across languages compared to monolingual performance although it is an acceptable trade-off considering the sparsity of resources and lack of available training data in low-resource languages. In this work, we study the effect of machine translation on the cross-lingual transfer learning in a crisis event classification task. Our experiments include measuring the effect of machine-translating the target data into the source language and vice versa. We evaluated and compared the performance in terms of accuracy and F1-Score. The results show that translating the source data into the target language improves the prediction accuracy by 14.8% and the Weighted Average F1-Score by 19.2% when compared to zero-shot transfer to an unseen language.

pdf bib
Lexicon-Driven Automatic Sentence Generation for the Skills Section in a Job Posting
Vera Aleksic | Mona Brems | Anna Mathes | Theresa Bertele

This paper presents a sentence generation pipeline as implemented on the online job board Stepstone. The goal is to automatically create a set of sentences for the candidate profile and the task description sections in a job ad, related to a given input skill. They must cover two different “tone of voice” variants in German (Du, Sie), three experience levels (junior, mid, senior), and two optionality values (skill is mandatory or optional/nice to have). The generation process considers the difference between soft skills, natural language competencies and hard skills, as well as more specific sub-categories such as IT skills, programming languages and similar. To create grammatically consistent text, morphosyntactic features from the proprietary skill ontology and lexicon are consulted. The approach is a lexicon-driven generation process that compares all lexical features of the new input skills with the ones already added to the sentence database and creates new sentences according to the corresponding templates.

pdf bib
Multilingual Racial Hate Speech Detection Using Transfer Learning
Abinew Ali Ayele | Skadi Dinter | Seid Muhie Yimam | Chris Biemann

The rise of social media eases the spread of hateful content, especially racist content with severe consequences. In this paper, we analyze the tweets targeting the death of George Floyd in May 2020 as the event accelerated debates on racism globally. We focus on the tweets published in French for a period of one month since the death of Floyd. Using the Yandex Toloka platform, we annotate the tweets into categories as hate, offensive or normal. Tweets that are offensive or hateful are further annotated as racial or non-racial. We build French hate speech detection models based on the multilingual BERT and CamemBERT and apply transfer learning by fine-tuning the HateXplain model. We compare different approaches to resolve annotation ties and find that the detection model based on CamemBERT yields the best results in our experiments.

pdf bib
Exploring Amharic Hate Speech Data Collection and Classification Approaches
Abinew Ali Ayele | Seid Muhie Yimam | Tadesse Destaw Belay | Tesfa Asfaw | Chris Biemann

In this paper, we present a study of efficient data selection and annotation strategies for Amharic hate speech. We also build various classification models and investigate the challenges of hate speech data selection, annotation, and classification for the Amharic language. From a total of over 18 million tweets in our Twitter corpus, 15.1k tweets are annotated by two independent native speakers, and a Cohen’s kappa score of 0.48 is achieved. A third annotator, a curator, is also employed to decide on the final gold labels. We employ both classical machine learning and deep learning approaches, which include fine-tuning AmFLAIR and AmRoBERTa contextual embedding models. Among all the models, AmFLAIR achieves the best performance with an F1-score of 72%. We publicly release the annotation guidelines, keywords/lexicon entries, datasets, models, and associated scripts with a permissive license.

pdf bib
Bhojpuri WordNet: Problems in Translating Hindi Synsets into Bhojpuri
Imran Ali | Praveen Gatla

Today, artificial intelligence systems are incredibly intelligent, however they lack the human like capacity for understanding. In this context, sense-based lexical resources become a requirement for artificially intelligent machines. Lexical resources like Wordnets have received scholarly attention because they are considered as the crucial sense-based resources in the field of natural language understanding. They can help in knowing the intended meaning of the communicated texts, as they are focused on the concept rather than the words. Wordnets are available only for 18 Indian languages. Keeping this in mind, we have initiated the development of a comprehensive wordnet for Bhojpuri. The present paper describes the creation of the synsets of Bhojpuri and discusses the problems that we faced while translating Hindi synsets into Bhojpuri. They are lexical anomalies, lexical mismatch words, synthesized forms, lack of technical words etc. Nearly 4000 Hindi synsets were mapped for their equivalent synsets in Bhojpuri following the expansion approach. We have also worked on the language-specific synsets, which are unique to Bhojpuri. This resource is useful in machine translation, sentiment analysis, word sense disambiguation, cross-lingual references among Indian languages, and Bhojpuri language teaching and learning.

pdf bib
3D-EX: A Unified Dataset of Definitions and Dictionary Examples
Fatemah Almeman | Hadi Sheikhi | Luis Espinosa Anke

Definitions are a fundamental building block in lexicography, linguistics and computational semantics. In NLP, they have been used for retrofitting word embeddings or augmenting contextual representations in language models. However, lexical resources containing definitions exhibit a wide range of properties, which has implications in the behaviour of models trained and evaluated on them. In this paper, we introduce 3D-EX, a dataset that aims to fill this gap by combining well-known English resources into one centralized knowledge repository in the form of <term, definition, example> triples. 3D-EX is a unified evaluation framework with carefully pre-computed train/validation/test splits to prevent memorization. We report experimental results that suggest that this dataset could be effectively leveraged in downstream NLP tasks. Code and data are available at

pdf bib
Are You Not moved? Incorporating Sensorimotor Knowledge to Improve Metaphor Detection
Ghadi Alnafesah | Phillip Smith | Mark Lee

Metaphors use words from one domain of knowledge to describe another, which can make the meaning less clear and require human interpretation to understand. This makes it difficult for automated models to detect metaphorical usage. The objective of the experiments in the paper is to enhance the ability of deep learning models to detect metaphors automatically. This is achieved by using two elements of semantic richness, sensory experience, and body-object interaction, as the main lexical features, combined with the contextual information present in the metaphorical sentences. The tests were conducted using classification and sequence labeling models for metaphor detection on the three metaphorical corpora VUAMC, MOH-X, and TroFi. The sensory experience led to significant improvements in the classification and sequence labelling models across all datasets. The highest gains were seen on the VUAMC dataset: recall increased by 20.9%, F1 by 7.5% for the classification model, and Recall increased by 11.66% and F1 by 3.69% for the sequence labelling model. Body-object interaction also showed positive impact on the three datasets.

pdf bib
HAQA and QUQA: Constructing Two Arabic Question-Answering Corpora for the Quran and Hadith
Sarah Alnefaie | Eric Atwell | Mohammad Ammar Alsalka

It is neither possible nor fair to compare the performance of question-answering systems for the Holy Quran and Hadith Sharif in Arabic due to both the absence of a golden test dataset on the Hadith Sharif and the small size and easy questions of the newly created golden test dataset on the Holy Quran. This article presents two question–answer datasets: Hadith Question–Answer pairs (HAQA) and Quran Question–Answer pairs (QUQA). HAQA is the first Arabic Hadith question–answer dataset available to the research community, while the QUQA dataset is regarded as the more challenging and the most extensive collection of Arabic question–answer pairs on the Quran. HAQA was designed and its data collected from several expert sources, while QUQA went through several steps in the construction phase; that is, it was designed and then integrated with existing datasets in different formats, after which the datasets were enlarged with the addition of new data from books by experts. The HAQA corpus consists of 1598 question–answer pairs, and that of QUQA contains 3382. They may be useful as gold–standard datasets for the evaluation process, as training datasets for language models with question-answering tasks and for other uses in artificial intelligence.

pdf bib
ConfliBERT-Arabic: A Pre-trained Arabic Language Model for Politics, Conflicts and Violence
Sultan Alsarra | Luay Abdeljaber | Wooseong Yang | Niamat Zawad | Latifur Khan | Patrick Brandt | Javier Osorio | Vito D’Orazio

This study investigates the use of Natural Language Processing (NLP) methods to analyze politics, conflicts and violence in the Middle East using domain-specific pre-trained language models. We introduce Arabic text and present ConfliBERT-Arabic, a pre-trained language models that can efficiently analyze political, conflict and violence-related texts. Our technique hones a pre-trained model using a corpus of Arabic texts about regional politics and conflicts. Performance of our models is compared to baseline BERT models. Our findings show that the performance of NLP models for Middle Eastern politics and conflict analysis are enhanced by the use of domain-specific pre-trained local language models. This study offers political and conflict analysts, including policymakers, scholars, and practitioners new approaches and tools for deciphering the intricate dynamics of local politics and conflicts directly in Arabic.

pdf bib
A Review in Knowledge Extraction from Knowledge Bases
Fabio Yanez | Andrés Montoyo | Yoan Gutierrez | Rafael Muñoz | Armando Suarez

Generative language models achieve the state of the art in many tasks within natural language processing (NLP). Although these models correctly capture syntactic information, they fail to interpret knowledge (semantics). Moreover, the lack of interpretability of these models promotes the use of other technologies as a replacement or complement to generative language models. This is the case with research focused on incorporating knowledge by resorting to knowledge bases mainly in the form of graphs. The generation of large knowledge graphs is carried out with unsupervised or semi-supervised techniques, which promotes the validation of this knowledge with the same type of techniques due to the size of the generated databases. In this review, we will explain the different techniques used to test and infer knowledge from graph structures with machine learning algorithms. The motivation of validating and inferring knowledge is to use correct knowledge in subsequent tasks with improved embeddings.

pdf bib
Evaluating of Large Language Models in Relationship Extraction from Unstructured Data: Empirical Study from Holocaust Testimonies
Isuri Anuradha | Le An Ha | Ruslan Mitkov | Vinita Nahar

Relationship extraction from unstructured data remains one of the most challenging tasks in the field of Natural Language Processing (NLP). The complexity of relationship extraction arises from the need to comprehend the underlying semantics, syntactic structures, and contextual dependencies within the text. Unstructured data poses challenges with diverse linguistic patterns, implicit relationships, contextual nuances, complicating accurate relationship identification and extraction. The emergence of Large Language Models (LLMs), such as GPT (Generative Pre-trained Transformer), has indeed marked a significant advancement in the field of NLP. In this work, we assess and evaluate the effectiveness of LLMs in relationship extraction in the Holocaust testimonies within the context of the Historical realm. By delving into this domain-specific context, we aim to gain deeper insights into the performance and capabilities of LLMs in accurately capturing and extracting relationships within the Holocaust domain by developing a novel knowledge graph to visualise the relationships of the Holocaust. To the best of our knowledge, there is no existing study which discusses relationship extraction in Holocaust testimonies. The majority of current approaches for Information Extraction (IE) in historic documents are either manual or OCR based. Moreover, in this study, we found that the Subject-Object-Verb extraction using GPT3-based relations produced more meaningful results compared to the Semantic Role labeling-based triple extraction.

pdf bib
Impact of Emojis on Automatic Analysis of Individual Emotion Categories
Ratchakrit Arreerard | Scott Piao

Automatic emotion analysis is a highly challenging task for Natural Language Processing, which has so far mainly relied on textual contents to determine the emotion of text. However, words are not the only media that carry emotional information. In social media, people also use emojis to convey their feelings. Recently, researchers have studied emotional aspects of emojis, and use emoji information to improve the emotion detection and classification, but many issues remain to be addressed. In this study, we examine the impact of emoji embedding on emotion classification and intensity prediction on four individual emotion categories, including anger, fear, joy, and sadness, in order to investigate how emojis affect the automatic analysis of individual emotion categories and intensity. We conducted a comparative study by testing five machine learning models with and without emoji embeddings involved. Our experiment demonstrates that emojis have varying impact on different emotion categories, and there is potential that emojis can be used to enhance emotion information processing.

pdf bib
Was That a Question? Automatic Classification of Discourse Meaning in Spanish
Santiago Arróniz | Sandra Kübler

This paper examines the effectiveness of different feature representations of audio data in accurately classifying discourse meaning in Spanish. The task involves determining whether an utterance is a declarative sentence, an interrogative, an imperative, etc. We explore how pitch contour can be represented for a discourse-meaning classification task, employing three different audio features: MFCCs, Mel-scale spectrograms, and chromagrams. We also determine if utilizing means is more effective in representing the speech signal, given the large number of coefficients produced during the feature extraction process. Finally, we evaluate whether these feature representation techniques are sensitive to speaker information. Our results show that a recurrent neural network architecture in conjunction with all three feature sets yields the best results for the task.

pdf bib
Designing the LECOR Learner Corpus for Romanian
Ana Maria Barbu | Elena Irimia | Carmen Mîrzea Vasile | Vasile Păiș

This article presents a work-in-progress project, which aims to build and utilize a corpus of Romanian texts written or spoken by non-native students of different nationalities, who learn Romanian as a foreign language in the one-year, intensive academic program organized by the University of Bucharest. This corpus, called LECOR – Learner Corpus for Romanian – is made up of pairs of texts: a version of the student and a corrected one of the teacher. Each version is automatically annotated with lemma and POS-tag, and the two versions are then compared, and the differences are marked as errors at this stage. The corpus also contains metadata file sets about students and their samples. In this article, the conceptual framework for building and utilization of the corpus is presented, including the acquisition and organization phases of the primary material, the annotation process, and the first attempts to adapt the NoSketch Engine query interface to the project’s objectives. The article concludes by outlining the next steps in the development of the corpus aimed at quantitative accumulation and the development of the error correction process and the complex error annotation.

pdf bib
Non-Parametric Memory Guidance for Multi-Document Summarization
Florian Baud | Alex Aussem

Multi-document summarization (MDS) is a difficult task in Natural Language Processing, aiming to summarize information from several documents. However, the source documents are often insufficient to obtain a qualitative summary. We propose a retriever-guided model combined with non-parametric memory for summary generation. This model retrieves relevant candidates from a database and then generates the summary considering the candidates with a copy mechanism and the source documents. The retriever is implemented with Approximate Nearest Neighbor Search (ANN) to search large databases. Our method is evaluated on the MultiXScience dataset which includes scientific articles. Finally, we discuss our results and possible directions for future work.

pdf bib
Beyond Information: Is ChatGPT Empathetic Enough?
Ahmed Belkhir | Fatiha Sadat

This paper aims to explore and enhance ChatGPT’s abilities to generate more human-like conversations by taking into account the emotional state of the user. To achieve this goal, a prompt-driven Emotional Intelligence is used through the empathetic dialogue dataset in order to propose a more empathetic conversational language model. We propose two altered versions of ChatGPT as follows: (1) an emotion-infused version which takes the user’s emotion as input before generating responses using an emotion classifier based on ELECTRA ; and (2) the emotion adapting version that tries to accommodate for how the user feels without any external component. By analyzing responses of the two proposed altered versions and comparing them to the standard version of ChatGPT, we find that using the external emotion classifier leads to more frequent and pronounced use of positive emotions compared to the standard version. On the other hand, using simple prompt engineering to take the user emotion into consideration, does the opposite. Finally, comparisons with state-of-the-art models highlight the potential of prompt engineering to enhance the emotional abilities of chatbots based on large language models.

pdf bib
Using Wikidata for Enhancing Compositionality in Pretrained Language Models
Meriem Beloucif | Mihir Bansal | Chris Biemann

One of the many advantages of pre-trained language models (PLMs) such as BERT and RoBERTa is their flexibility and contextual nature. These features give PLMs strong capabilities for representing lexical semantics. However, PLMs seem incapable of capturing high-level semantics in terms of compositionally. We show that when augmented with the relevant semantic knowledge, PMLs learn to capture a higher degree of lexical compositionality. We annotate a large dataset from Wikidata highlighting a type of semantic inference that is easy for humans to understand but difficult for PLMs, like the correlation between age and date of birth. We use this resource for finetuning DistilBERT, BERT large and RoBERTa. Our results show that the performance of PLMs against the test data continuously improves when augmented with such a rich resource. Our results are corroborated by a consistent improvement over most GLUE benchmark natural language understanding tasks.

pdf bib
Multimodal Learning for Accurate Visual Question Answering: An Attention-Based Approach
Jishnu Bhardwaj | Anurag Balakrishnan | Satyam Pathak | Ishan Unnarkar | Aniruddha Gawande | Benyamin Ahmadnia

This paper proposes an open-ended task for Visual Question Answering (VQA) that leverages the InceptionV3 Object Detection model and an attention-based Long Short-Term Memory (LSTM) network for question answering. Our proposed model provides accurate natural language answers to questions about an image, including those that require understanding contextual information and background details. Our findings demonstrate that the proposed approach can achieve high accuracy, even with complex and varied visual information. The proposed method can contribute to developing more advanced vision systems that can process and interpret visual information like humans.

pdf bib
Generative Models For Indic Languages: Evaluating Content Generation Capabilities
Savita Bhat | Vasudeva Varma | Niranjan Pedanekar

Large language models (LLMs) and generative AI have emerged as the most important areas in the field of natural language processing (NLP). LLMs are considered to be a key component in several NLP tasks, such as summarization, question-answering, sentiment classification, and translation. Newer LLMs, such as ChatGPT, BLOOMZ, and several such variants, are known to train on multilingual training data and hence are expected to process and generate text in multiple languages. Considering the widespread use of LLMs, evaluating their efficacy in multilingual settings is imperative. In this work, we evaluate the newest generative models (ChatGPT, mT0, and BLOOMZ) in the context of Indic languages. Specifically, we consider natural language generation (NLG) applications such as summarization and question-answering in monolingual and cross-lingual settings. We observe that current generative models have limited capability for generating text in Indic languages in a zero-shot setting. In contrast, generative models perform consistently better on manual quality-based evaluation in both Indic languages and English language generation. Considering limited generation performance, we argue that these LLMs are not intended to use in zero-shot fashion in downstream applications.

pdf bib
Measuring Spurious Correlation in Classification: “Clever Hans” in Translationese
Angana Borah | Daria Pylypenko | Cristina España-Bonet | Josef van Genabith

Recent work has shown evidence of “Clever Hans” behavior in high-performance neural translationese classifiers, where BERT-based classifiers capitalize on spurious correlations, in particular topic information, between data and target classification labels, rather than genuine translationese signals. Translationese signals are subtle (especially for professional translation) and compete with many other signals in the data such as genre, style, author, and, in particular, topic. This raises the general question of how much of the performance of a classifier is really due to spurious correlations in the data versus the signals actually targeted for by the classifier, especially for subtle target signals and in challenging (low resource) data settings. We focus on topic-based spurious correlation and approach the question from two directions: (i) where we have no knowledge about spurious topic information and its distribution in the data, (ii) where we have some indication about the nature of spurious topic correlations. For (i) we develop a measure from first principles capturing alignment of unsupervised topics with target classification labels as an indication of spurious topic information in the data. We show that our measure is the same as purity in clustering and propose a “topic floor” (as in a “noise floor”) for classification. For (ii) we investigate masking of known spurious topic carriers in classification. Both (i) and (ii) contribute to quantifying and (ii) to mitigating spurious correlations.

pdf bib
WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset
Hsuvas Borkakoty | Luis Espinosa Anke

A fundamental challenge in the current NLP context, dominated by language models, comes from the inflexibility of current architectures to “learn” new information. While model-centric solutions like continual learning or parameter-efficient fine-tuning are available, the question still remains of how to reliably identify changes in language or in the world. In this paper, we propose WikiTiDe, a dataset derived from pairs of timestamped definitions extracted from Wikipedia. We argue that such resources can be helpful for accelerating diachronic NLP, specifically, for training models able to scan knowledge resources for core updates concerning a concept, an event, or a named entity. Our proposed end-to-end method is fully automatic and leverages a bootstrapping algorithm for gradually creating a high-quality dataset. Our results suggest that bootstrapping the seed version of WikiTiDe leads to better-fine-tuned models. We also leverage fine-tuned models in a number of downstream tasks, showing promising results with respect to competitive baselines.

pdf bib
BERTabaporu: Assessing a Genre-Specific Language Model for Portuguese NLP
Pablo Botton Costa | Matheus Camasmie Pavan | Wesley Ramos Santos | Samuel Caetano Silva | Ivandré Paraboni

Transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT) are now mainstream in the NLP field, but extensions to languages other than English, to new domains and/or to more specific text genres are still in demand. In this paper we introduced BERTabaporu, a BERT language model that has been pre-trained on Twitter data in the Brazilian Portuguese language. The model is shown to outperform the best-known general-purpose model for this language in three Twitter-related NLP tasks, making a potentially useful resource for Portuguese NLP in general.

pdf bib
Comparison of Multilingual Entity Linking Approaches
Ivelina Bozhinova | Andrey Tagarev

Despite rapid developments in the field of Natural Language Processing (NLP) in the past few years, the task of Multilingual Entity Linking (MEL) and especially its end-to-end formulation remains challenging. In this paper we aim to evaluate solutions for general end-to-end multilingual entity linking by conducting experiments using both existing complete approaches and novel combinations of pipelines for solving the task. The results identify the best performing current solutions and suggest some directions for further research.

pdf bib
Automatic Extraction of the Romanian Academic Word List: Data and Methods
Ana-Maria Bucur | Andreea Dincă | Madalina Chitez | Roxana Rogobete

This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of data: (a) existing data, such as the Romanian Frequency List based on the ROMBAC corpus, and (b) self-compiled data, such as the expert academic writing corpus EXPRES. For constructing the academic word list, we follow the methodology for building the Academic Vocabulary List for the English language. The distribution of Ro-AWL features (general distribution, POS distribution) into four disciplinary datasets is in line with previous research. Ro-AWL is freely available and can be used for teaching, research and NLP applications.

pdf bib
Stance Prediction from Multimodal Social Media Data
Lais Carraro Leme Cavalheiro | Matheus Camasmie Pavan | Ivandré Paraboni

Stance prediction - the computational task of inferring attitudes towards a given target topic of interest - relies heavily on text data provided by social media or similar sources, but it may also benefit from non-text information such as demographics (e.g., users’ gender, age, etc.), network structure (e.g., friends, followers, etc.), interactions (e.g., mentions, replies, etc.) and other non-text properties (e.g., time information, etc.). However, so-called hybrid (or in some cases multimodal) approaches to stance prediction have only been developed for a small set of target languages, and often making use of count-based text models (e.g., bag-of-words) and time-honoured classification methods (e.g., support vector machines). As a means to further research in the field, in this work we introduce a number of text- and non-text models for stance prediction in the Portuguese language, which make use of more recent methods based on BERT and an ensemble architecture, and ask whether a BERT stance classifier may be enhanced with different kinds of network-related information.

pdf bib
From Stigma to Support: A Parallel Monolingual Corpus and NLP Approach for Neutralizing Mental Illness Bias
Mason Choey

Negative attitudes and perceptions towards mental illness continue to be pervasive in our society. One of the factors contributing to and reinforcing this stigma is the usage of language that is biased against mental illness. Identifying biased language and replacing it with person-first, neutralized language is a first step towards eliminating harmful stereotypes and creating a supportive and inclusive environment for those living with mental illness. This paper presents a novel Natural Language Processing (NLP) system that aims to automatically identify biased text related to mental illness and suggest neutral language replacements without altering the original text’s meaning. Building on previous work in the field, this paper presents the Mental Illness Neutrality Corpus (MINC) comprising over 5500 mental illness-biased text and neutralized sentence pairs (in English), which is used to fine-tune a CONCURRENT model system developed by Pryzant et al. (2020). After evaluation, the model demonstrates high proficiency in neutralizing mental illness bias with an accuracy of 98.7%. This work contributes a valuable resource for reducing mental illness bias in text and has the potential for further research in tackling more complex nuances and multilingual biases.

pdf bib
BB25HLegalSum: Leveraging BM25 and BERT-Based Clustering for the Summarization of Legal Documents
Leonardo de Andrade | Karin Becker

Legal document summarization aims to provide a clear understanding of the main points and arguments in a legal document, contributing to the efficiency of the judicial system. In this paper, we propose BB25HLegalSum, a method that combines BERT clusters with the BM25 algorithm to summarize legal documents and present them to users with highlighted important information. The process involves selecting unique, relevant sentences from the original document, clustering them to find sentences about a similar subject, combining them to generate a summary according to three strategies, and highlighting them to the user in the original document. We outperformed baseline techniques using the BillSum dataset, a widely used benchmark in legal document summarization. Legal workers positively assessed the highlighted presentation.

pdf bib
SSSD: Leveraging Pre-trained Models and Semantic Search for Semi-supervised Stance Detection
André de Sousa | Karin Becker

Pre-trained models (PTMs) based on the Transformers architecture are trained on massive amounts of data and can capture nuances and complexities in linguistic expressions, making them a powerful tool for many natural language processing tasks. In this paper, we present SSSD (Semantic Similarity Stance Detection), a semi-supervised method for stance detection on Twitter that automatically labels a large, domain-related corpus for training a stance classification model. The method assumes as input a domain set of tweets about a given target and a labeled query set of tweets of representative arguments related to the stances. It scales the automatic labeling of a large number of tweets, and improves classification accuracy by leveraging the power of PTMs and semantic search to capture context and meaning. We largely outperformed all baselines in experiments using the Semeval benchmark.

pdf bib
Detecting Text Formality: A Study of Text Classification Approaches
Daryna Dementieva | Nikolay Babakov | Alexander Panchenko

Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation—GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments – monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.

pdf bib
Developing a Multilingual Corpus of Wikipedia Biographies
Hannah Devinney | Anton Eklund | Igor Ryazanov | Jingwen Cai

For many languages, Wikipedia is the most accessible source of biographical information. Studying how Wikipedia describes the lives of people can provide insights into societal biases, as well as cultural differences more generally. We present a method for extracting datasets of Wikipedia biographies. The accompanying codebase is adapted to English, Swedish, Russian, Chinese, and Farsi, and is extendable to other languages. We present an exploratory analysis of biographical topics and gendered patterns in four languages using topic modelling and embedding clustering. We find similarities across languages in the types of categories present, with the distribution of biographies concentrated in the language’s core regions. Masculine terms are over-represented and spread out over a wide variety of topics. Feminine terms are less frequent and linked to more constrained topics. Non-binary terms are nearly non-represented.

pdf bib
A Computational Analysis of the Voices of Shakespeare’s Characters
Liviu P. Dinu | Ana Sabina Uban

In this paper we propose a study of a relatively novel problem in authorship attribution research: that of classifying the stylome of characters in a literary work. We choose as a case study the plays of William Shakespeare, presumably the most renowned and respected dramatist in the history of literature. Previous research in the field of authorship attribution has shown that the writing style of an author can be characterized and distinguished from that of other authors automatically. The question we propose to answer is a related but different one: can the styles of different characters be distinguished? We aim to verify in this way if an author managed to create believable characters with individual styles, and focus on Shakespeare’s iconic characters. We present our experiments using various features and models, including an SVM and a neural network, show that characters in Shakespeare’s plays can be classified with up to 50% accuracy.

pdf bib
Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning
Fahad Ebrahim | Mike Joy

Source code plagiarism is a critical ethical issue in computer science education where students use someone else’s work as their own. It can be treated as a binary classification problem where the output can be either: yes (plagiarism found) or no (plagiarism not found). In this research, we have taken the open-source dataset ‘SOCO’, which contains two programming languages (PLs), namely Java and C/C++ (although our method could be applied to any PL). Source codes should be converted to vector representations that capture both the syntax and semantics of the text, known as contextual embeddings. These embeddings would be generated using source code pre-trained models (CodePTMs). The cosine similarity scores of three different CodePTMs were selected as features. The classifier selection and parameter tuning were conducted with the assistance of Automated Machine Learning (AutoML). The selected classifiers were tested, initially on Java, and the proposed approach produced average to high results compared to other published research, and surpassed the baseline (the JPlag plagiarism detection tool). For C/C++, the approach outperformed other research work and produced the highest ranking score.

pdf bib
Identifying Semantic Argument Types in Predication and Copredication Contexts: A Zero-Shot Cross-Lingual Approach
Deniz Ekin Yavas | Laura Kallmeyer | Rainer Osswald | Elisabetta Jezek | Marta Ricchiardi | Long Chen

Identifying semantic argument types in predication contexts is not a straightforward task for several reasons, such as inherent polysemy, coercion, and copredication phenomena. In this paper, we train monolingual and multilingual classifiers with a zero-shot cross-lingual approach to identify semantic argument types in predications using pre-trained language models as feature extractors. We train classifiers for different semantic argument types and for both verbal and adjectival predications. Furthermore, we propose a method to detect copredication using these classifiers through identifying the argument semantic type targeted in different predications over the same noun in a sentence. We evaluate the performance of the method on copredication test data with Food•Event nouns for 5 languages.

pdf bib
A Review of Research-Based Automatic Text Simplification Tools
Isabel Espinosa-Zaragoza | José Abreu-Salas | Elena Lloret | Paloma Moreda | Manuel Palomar

In the age of knowledge, the democratisation of information facilitated through the Internet may not be as pervasive if written language poses challenges to particular sectors of the population. The objective of this paper is to present an overview of research-based automatic text simplification tools. Consequently, we describe aspects such as the language, language phenomena, language levels simplified, approaches, specific target populations these tools are created for (e.g. individuals with cognitive impairment, attention deficit, elderly people, children, language learners), and accessibility and availability considerations. The review of existing studies covering automatic text simplification tools is undergone by searching two databases: Web of Science and Scopus. The eligibility criteria involve text simplification tools with a scientific background in order to ascertain how they operate. This methodology yielded 27 text simplification tools that are further analysed. Some of the main conclusions reached with this review are the lack of resources accessible to the public, the need for customisation to foster the individual’s independence by allowing the user to select what s/he finds challenging to understand while not limiting the user’s capabilities and the need for more simplification tools in languages other than English, to mention a few.

pdf bib
Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings
Michael Faerber | Nicholas Popovic

In this paper, we propose Vocab-Expander at, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education.

pdf bib
On the Generalization of Projection-Based Gender Debiasing in Word Embedding
Elisabetta Fersini | Antonio Candelieri | Lorenzo Pastore

Gender bias estimation and mitigation techniques in word embeddings lack an understanding of their generalization capabilities. In this work, we complement prior research by comparing in a systematic way four gender bias metrics (Word Embedding Association Tes, Relative Negative Sentiment Bias, Embedding Coherence Test and Bias Analogy Test), two types of projection-based gender mitigation strategies (hard- and soft-debiasing) on three well-known word embedding representations (Word2Vec, FastText and Glove). The experiments have shown that the considered word embeddings are consistent between them but the debiasing techniques are inconsistent across the different metrics, also highlighting the potential risk of unintended bias after the mitigation strategies.

pdf bib
Mapping Explicit and Implicit Discourse Relations between the RST-DT and the PDTB 3.0
Nelson Filipe Costa | Nadia Sheikh | Leila Kosseim

In this paper we propose a first empirical mapping between the RST-DT and the PDTB 3.0. We provide an original algorithm which allowed the mapping of 6,510 (80.0%) explicit and implicit discourse relations between the overlapping articles of the RST-DT and PDTB 3.0 discourse annotated corpora. Results of the mapping show that while it is easier to align segments of implicit discourse relations, the mapping obtained between the aligned explicit discourse relations is more unambiguous.

pdf bib
Bigfoot in Big Tech: Detecting Out of Domain Conspiracy Theories
Matthew Fort | Zuoyu Tian | Elizabeth Gabel | Nina Georgiades | Noah Sauer | Daniel Dakota | Sandra Kübler

We investigate approaches to classifying texts into either conspiracy theory or mainstream using the Language Of Conspiracy (LOCO) corpus. Since conspiracy theories are not monolithic constructs, we need to identify approaches that robustly work in an out-of- domain setting (i.e., across conspiracy topics). We investigate whether optimal in-domain set- tings can be transferred to out-of-domain set- tings, and we investigate different methods for bleaching to steer classifiers away from words typical for an individual conspiracy theory. We find that BART works better than an SVM, that we can successfully classify out-of-domain, but there are no clear trends in how to choose the best source training domains. Addition- ally, bleaching only topic words works better than bleaching all content words or completely delexicalizing texts.

pdf bib
Deep Learning Approaches to Detecting Safeguarding Concerns in Schoolchildren’s Online Conversations
Emma Franklin | Tharindu Ranasinghe

For school teachers and Designated Safeguarding Leads (DSLs), computers and other school-owned communication devices are both indispensable and deeply worrisome. For their education, children require access to the Internet, as well as a standard institutional ICT infrastructure, including e-mail and other forms of online communication technology. Given the sheer volume of data being generated and shared on a daily basis within schools, most teachers and DSLs can no longer monitor the safety and wellbeing of their students without the use of specialist safeguarding software. In this paper, we experiment with the use of state-of-the-art neural network models on the modelling of a dataset of almost 9,000 anonymised child-generated chat messages on the Microsoft Teams platform. The data was manually classified into eight fine-grained classes of safeguarding concerns (or false alarms) that a monitoring program would be interested in, and these were further split into two binary classes: true positives (real safeguarding concerns) and false positives (false alarms). For the fine grained classification, our models achieved a macro F1 score of 73.56, while for the binary classification, we achieved a macro F1 score of 87.32. This first experiment into the use of Deep Learning for detecting safeguarding concerns represents an important step towards achieving high-accuracy and reliable monitoring information for busy teachers and safeguarding leads.

pdf bib
On the Identification and Forecasting of Hate Speech in Inceldom
Paolo Gajo | Arianna Muti | Katerina Korre | Silvia Bernardini | Alberto Barrón-Cedeño

Spotting hate speech in social media posts is crucial to increase the civility of the Web and has been thoroughly explored in the NLP community. For the first time, we introduce a multilingual corpus for the analysis and identification of hate speech in the domain of inceldom, built from incel Web forums in English and Italian, including expert annotation at the post level for two kinds of hate speech: misogyny and racism. This resource paves the way for the development of mono- and cross-lingual models for (a) the identification of hateful (misogynous and racist) posts and (b) the forecasting of the amount of hateful responses that a post is likely to trigger. Our experiments aim at improving the performance of Transformer-based models using masked language modeling pre-training and dataset merging. The results show that these strategies boost the models’ performance in all settings (binary classification, multi-label classification and forecasting), especially in the cross-lingual scenarios.

pdf bib
T2KG: Transforming Multimodal Document to Knowledge Graph
Santiago Galiano | Rafael Muñoz | Yoan Gutiérrez | Andrés Montoyo | Jose Ignacio Abreu | Luis Alfonso Ureña

The large amount of information in digital format that exists today makes it unfeasible to use manual means to acquire the knowledge contained in these documents. Therefore, it is necessary to develop tools that allow us to incorporate this knowledge into a structure that is easy to use by both machines and humans. This paper presents a system that can incorporate the relevant information from a document in any format, structured or unstructured, into a semantic network that represents the existing knowledge in the document. The system independently processes from structured documents based on its annotation scheme to unstructured documents, written in natural language, for which it uses a set of sensors that identifies the relevant information and subsequently incorporates it to enrich the semantic network that is created by linking all the information based on the knowledge discovered.

pdf bib
!Translate: When You Cannot Cook Up a Translation, Explain
Federico Garcea | Margherita Martinelli | Maja Milicević Petrović | Alberto Barrón-Cedeño

In the domain of cuisine, both dishes and ingredients tend to be heavily rooted in the local context they belong to. As a result, the associated terms are often realia tied to specific cultures and languages. This causes difficulties for non-speakers of the local language and ma- chine translation (MT) systems alike, as it implies a lack of the concept and/or of a plausible translation. MT typically opts for one of two alternatives: keeping the source language terms untranslated or relying on a hyperonym/near-synonym in the target language, provided one exists. !Translate proposes a better alternative: explaining. Given a cuisine entry such as a restaurant menu item, we identify culture-specific terms and enrich the output of the MT system with automatically retrieved definitions of the non-translatable terms in the target language, making the translation more actionable for the final user.

pdf bib
An Evaluation of Source Factors in Concatenation-Based Context-Aware Neural Machine Translation
Harritxu Gete | Thierry Etchegoyhen

We explore the use of source factors in context-aware neural machine translation, specifically concatenation-based models, to improve the translation quality of inter-sentential phenomena. Context sentences are typically concatenated to the sentence to be translated, with string-based markers to separate the latter from the former. Although previous studies have measured the impact of prefixes to identify and mark context information, the use of learnable factors has only been marginally explored. In this study, we evaluate the impact of single and multiple source context factors in English-German and Basque-Spanish contextual translation. We show that this type of factors can significantly enhance translation accuracy for phenomena such as gender and register coherence in Basque-Spanish, while also improving BLEU results in some scenarios. These results demonstrate the potential of factor-based context identification to improve context-aware machine translation in future research.

pdf bib
Lessons Learnt from Linear Text Segmentation: a Fair Comparison of Architectural and Sentence Encoding Strategies for Successful Segmentation
Iacopo Ghinassi | Lin Wang | Chris Newell | Matthew Purver

Recent works on linear text segmentation have shown new state-of-the-art results nearly every year. Most times, however, these recent advances include a variety of different elements which makes it difficult to evaluate which individual components of the proposed methods bring about improvements for the task and, more generally, what actually works for linear text segmentation. Moreover, evaluating text segmentation is notoriously difficult and the use of a metric such as Pk, which is widely used in existing literature, presents specific problems that complicates a fair comparison between segmentation models. In this work, then, we draw from a number of existing works to assess which is the state-of-the-art in linear text segmentation, investigating what architectures and features work best for the task. For doing so, we present three models representative of a variety of approaches, we compare them to existing methods and we inspect elements composing them, so as to give a more complete picture of which technique is more successful and why that might be the case. At the same time, we highlight a specific feature of Pk which can bias the results and we report our results using different settings, so as to give future literature a more comprehensive set of baseline results for future developments. We then hope that this work can serve as a solid foundation to foster research in the area, overcoming task-specific difficulties such as evaluation setting and providing new state-of-the-art results.

pdf bib
Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce
Serge Gladkoff | Lifeng Han | Goran Nenadic

In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce “Student’s t-Distribution” method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give a quantitative analysis of how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student’s t-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This t-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce.

pdf bib
Data Augmentation for Fake News Detection by Combining Seq2seq and NLI
Anna Glazkova

State-of-the-art data augmentation methods help improve the generalization of deep learning models. However, these methods often generate examples that contradict the preserving class labels. This is crucial for some natural language processing tasks, such as fake news detection. In this work, we combine sequence-to-sequence and natural language inference models for data augmentation in the fake news detection domain using short news texts, such as tweets and news titles. This approach allows us to generate new training examples that do not contradict facts from the original texts. We use the non-entailment probability for the pair of the original and generated texts as a loss function for a transformer-based sequence-to-sequence model. The proposed approach has demonstrated the effectiveness on three classification benchmarks in fake news detection in terms of the F1-score macro and ROC AUC. Moreover, we showed that our approach retains the class label of the original text more accurately than other transformer-based methods.

pdf bib
Exploring Unsupervised Semantic Similarity Methods for Claim Verification in Health Care News Articles
Vishwani Gupta | Astrid Viciano | Holger Wormer | Najmehsadat Mousavinezhad

In the 21st century, the proliferation of fake information has emerged as a significant threat to society. Particularly, healthcare medical reporters face challenges when verifying claims related to treatment effects, side effects, and risks mentioned in news articles, relying on scientific publications for accuracy. The accurate communication of scientific information in news articles has long been a crucial concern in the scientific community, as the dissemination of misinformation can have dire consequences in the healthcare domain. Healthcare medical reporters would greatly benefit from efficient methods to retrieve evidence from scientific publications supporting specific claims. This paper delves into the application of unsupervised semantic similarity models to facilitate claim verification for medical reporters, thereby expediting the process. We explore unsupervised multilingual evidence retrieval techniques aimed at reducing the time required to obtain evidence from scientific studies. Instead of employing content classification, we propose an approach that retrieves relevant evidence from scientific publications for claim verification within the healthcare domain. Given a claim and a set of scientific publications, our system generates a list of the most similar paragraphs containing supporting evidence. Furthermore, we evaluate the performance of state-of-the-art unsupervised semantic similarity methods in this task. As the claim and evidence are present in a cross-lingual space, we find that the XML-RoBERTa model exhibits high accuracy in achieving our objective. Through this research, we contribute to enhancing the efficiency and reliability of claim verification for healthcare medical reporters, enabling them to accurately source evidence from scientific publications in a timely manner.

pdf bib
AlphaMWE-Arabic: Arabic Edition of Multilingual Parallel Corpora with Multiword Expression Annotations
Najet Hadj Mohamed | Malak Rassem | Lifeng Han | Goran Nenadic

Multiword Expressions (MWEs) have been a bottleneck for Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks due to their idiomaticity, ambiguity, and non-compositionality. Bilingual parallel corpora introducing MWE annotations are very scarce which set another challenge for current Natural Language Processing (NLP) systems, especially in a multilingual setting. This work presents AlphaMWE-Arabic, an Arabic edition of the AlphaMWE parallel corpus with MWE annotations. We introduce how we created this corpus including machine translation (MT), post-editing, and annotations for both standard and dialectal varieties, i.e. Tunisian and Egyptian Arabic. We analyse the MT errors when they meet MWEs-related content, both quantitatively using the human-in-the-loop metric HOPE and qualitatively. We report the current state-of-the-art MT systems are far from reaching human parity performances. We expect our bilingual English-Arabic corpus will be an asset for multilingual research on MWEs such as translation and localisation, as well as for monolingual settings including the study of Arabic-specific lexicography and phrasal verbs on MWEs. Our corpus and experimental data are available at

pdf bib
Performance Analysis of Arabic Pre-trained Models on Named Entity Recognition Task
Abdelhalim Hafedh Dahou | Mohamed Amine Cheragui | Ahmed Abdelali

Named Entity Recognition (NER) is a crucial task within natural language processing (NLP) that entails the identification and classification of entities, such as person, organization and location. This study delves into NER specifically in the Arabic language, focusing on the Algerian dialect. While previous research in NER has primarily concentrated on Modern Standard Arabic (MSA), the advent of social media has prompted a need to address the variations found in different Arabic dialects. Moreover, given the notable achievements of Large-scale pre-trained models (PTMs) based on the BERT architecture, this paper aims to evaluate Arabic pre-trained models using an Algerian dataset that covers different domains and writing styles. Additionally, an error analysis is conducted to identify PTMs’ limitations, and an investigation is carried out to assess the performance of trained MSA models on the Algerian dialect. The experimental results and subsequent analysis shed light on the complexities of NER in Arabic, offering valuable insights for future research endeavors.

pdf bib
Discourse Analysis of Argumentative Essays of English Learners Based on CEFR Level
Blaise Hanel | Leila Kosseim

In this paper, we investigate the relationship between the use of discourse relations and the CEFR-level of argumentative English learner essays. Using both the Rhetorical Structure Theory (RST) and the Penn Discourse TreeBank (PDTB) frameworks, we analyze essays from The International Corpus Network of Asian Learners (ICNALE), and the Corpus and Repository of Writing (CROW). Results show that the use of the RST relations of Explanation and Background, as well as the first-level PDTB sense of Contingency, are influenced by the English proficiency level of the writer.

pdf bib
Improving Translation Quality for Low-Resource Inuktitut with Various Preprocessing Techniques
Mathias Hans Erik Stenlund | Mathilde Nanni | Micaella Bruton | Meriem Beloucif

Neural machine translation has been shown to outperform all other machine translation paradigms when trained in a high-resource setting. However, it still performs poorly when dealing with low-resource languages, for which parallel data for training is scarce. This is especially the case for morphologically complex languages such as Turkish, Tamil, Uyghur, etc. In this paper, we investigate various preprocessing methods for Inuktitut, a low-resource indigenous language from North America, without a morphological analyzer. On both the original and romanized scripts, we test various preprocessing techniques such as Byte-Pair Encoding, random stemming, and data augmentation using Hungarian for the Inuktitut-to-English translation task. We found that there are benefits to retaining the original script as it helps to achieve higher BLEU scores than the romanized models.

pdf bib
Enriched Pre-trained Transformers for Joint Slot Filling and Intent Detection
Momchil Hardalov | Ivan Koychev | Preslav Nakov

Detecting the user’s intent and finding the corresponding slots among the utterance’s words are important tasks in natural language understanding. Their interconnected nature makes their joint modeling a standard part of training such models. Moreover, data scarceness and specialized vocabularies pose additional challenges. Recently, the advances in pre-trained language models, namely contextualized models such as ELMo and BERT have revolutionized the field by tapping the potential of training very large models with just a few steps of fine-tuning on a task-specific dataset. Here, we leverage such models, and we design a novel architecture on top of them. Moreover, we propose an intent pooling attention mechanism, and we reinforce the slot filling task by fusing intent distributions, word features, and token representations. The experimental results on standard datasets show that our model outperforms both the current non-BERT state of the art as well as stronger BERT-based baselines.

pdf bib
Unimodal Intermediate Training for Multimodal Meme Sentiment Classification
Muzhaffar Hazman | Susan McKeever | Josephine Griffith

Internet Memes remain a challenging form of user-generated content for automated sentiment classification. The availability of labelled memes is a barrier to developing sentiment classifiers of multimodal memes. To address the shortage of labelled memes, we propose to supplement the training of a multimodal meme classifier with unimodal (image-only and text-only) data. In this work, we present a novel variant of supervised intermediate training that uses relatively abundant sentiment-labelled unimodal data. Our results show a statistically significant performance improvement from the incorporation of unimodal text data. Furthermore, we show that the training set of labelled memes can be reduced by 40% without reducing the performance of the downstream model.

pdf bib
Explainable Event Detection with Event Trigger Identification as Rationale Extraction
Hansi Hettiarachchi | Tharindu Ranasinghe

Most event detection methods act at the sentence-level and focus on identifying sentences related to a particular event. However, identifying certain parts of a sentence that act as event triggers is also important and more challenging, especially when dealing with limited training data. Previous event detection attempts have considered these two tasks separately and have developed different methods. We hypothesise that similar to humans, successful sentence-level event detection models rely on event triggers to predict sentence-level labels. By exploring feature attribution methods that assign relevance scores to the inputs to explain model predictions, we study the behaviour of state-of-the-art sentence-level event detection models and show that explanations (i.e. rationales) extracted from these models can indeed be used to detect event triggers. We, therefore, (i) introduce a novel weakly-supervised method for event trigger detection; and (ii) propose to use event triggers as an explainable measure in sentence-level event detection. To the best of our knowledge, this is the first explainable machine learning approach to event trigger identification.

pdf bib
Clinical Text Classification to SNOMED CT Codes Using Transformers Trained on Linked Open Medical Ontologies
Anton Hristov | Petar Ivanov | Anna Aksenova | Tsvetan Asamov | Pavlin Gyurov | Todor Primov | Svetla Boytcheva

We present an approach for medical text coding with SNOMED CT. Our approach uses publicly available linked open data from terminologies and ontologies as training data for the algorithms. We claim that even small training corpora made of short text snippets can be used to train models for the given task. We propose a method based on transformers enhanced with clustering and filtering of the candidates. Further, we adopt a classical machine learning approach - support vector classification (SVC) using transformer embeddings. The resulting approach proves to be more accurate than the predictions given by Large Language Models. We evaluate on a dataset generated from linked open data for SNOMED codes related to morphology and topography for four use cases. Our transformers-based approach achieves an F1-score of 0.82 for morphology and 0.99 for topography codes. Further, we validate the applicability of our approach in a clinical context using labelled real clinical data that are not used for model training.

pdf bib
Towards a Consensus Taxonomy for Annotating Errors in Automatically Generated Text
Rudali Huidrom | Anya Belz

Error analysis aims to provide insights into system errors at different levels of granularity. NLP as a field has a long-standing tradition of analysing and reporting errors which is generally considered good practice. There are existing error taxonomies tailored for different types of NLP task. In this paper, we report our work reviewing existing research on meaning/content error types in generated text, attempt to identify emerging consensus among existing meaning/content error taxonomies, and propose a standardised error taxonomy on this basis. We find that there is virtually complete agreement at the highest taxonomic level where errors of meaning/content divide into (1) Content Omission, (2) Content Addition, and (3) Content Substitution. Consensus in the lower levels is less pronounced, but a compact standardised consensus taxonomy can nevertheless be derived that works across generation tasks and application domains.

pdf bib
Uncertainty Quantification of Text Classification in a Multi-Label Setting for Risk-Sensitive Systems
Jinha Hwang | Carol Gudumotu | Benyamin Ahmadnia

This paper addresses the challenge of uncertainty quantification in text classification for medical purposes and provides a three-fold approach to support robust and trustworthy decision-making by medical practitioners. Also, we address the challenge of imbalanced datasets in the medical domain by utilizing the Mondrian Conformal Predictor with a Naïve Bayes classifier.

pdf bib
Pretraining Language- and Domain-Specific BERT on Automatically Translated Text
Tatsuya Ishigaki | Yui Uehara | Goran Topić | Hiroya Takamura

Domain-specific pretrained language models such as SciBERT are effective for various tasks involving text in specific domains. However, pretraining BERT requires a large-scale language resource, which is not necessarily available in fine-grained domains, especially in non-English languages. In this study, we focus on a setting with no available domain-specific text for pretraining. To this end, we propose a simple framework that trains a BERT on text in the target language automatically translated from a resource-rich language, e.g., English. In this paper, we particularly focus on the materials science domain in Japanese. Our experiments pertain to the task of entity and relation extraction for this domain and language. The experiments demonstrate that the various models pretrained on translated texts consistently perform better than the general BERT in terms of F1 scores although the domain-specific BERTs do not use any human-authored domain-specific text. These results imply that BERTs for various low-resource domains can be successfully trained on texts automatically translated from resource-rich languages.

pdf bib
Categorising Fine-to-Coarse Grained Misinformation: An Empirical Study of the COVID-19 Infodemic
Ye Jiang | Xingyi Song | Carolina Scarton | Iknoor Singh | Ahmet Aker | Kalina Bontcheva

The spread of COVID-19 misinformation on social media became a major challenge for citizens, with negative real-life consequences. Prior research focused on detection and/or analysis of COVID-19 misinformation. However, fine-grained classification of misinformation claims has been largely overlooked. The novel contribution of this paper is in introducing a new dataset which makes fine-grained distinctions between statements that assert, comment or question on false COVID-19 claims. This new dataset not only enables social behaviour analysis but also enables us to address both evidence-based and non-evidence-based misinformation classification tasks. Lastly, through leave claim out cross-validation, we demonstrate that classifier performance on unseen COVID-19 misinformation claims is significantly different, as compared to performance on topics present in the training data.

pdf bib
Bridging the Gap between Subword and Character Segmentation in Pretrained Language Models
Shun Kiyono | Sho Takase | Shengzhe Li | Toshinori Sato

Pretrained language models require the use of consistent segmentation (e.g., subword- or character-level segmentation) in pretraining and finetuning. In NLP, many tasks are modeled by subword-level segmentation better than by character-level segmentation. However, because of their format, several tasks require the use of character-level segmentation. Thus, in order to tackle both types of NLP tasks, language models must be independently pretrained for both subword and character-level segmentation. However, this is an inefficient and costly procedure. Instead, this paper proposes a method for training a language model with unified segmentation. This means that the trained model can be finetuned on both subword- and character-level segmentation. The principle of the method is to apply the subword regularization technique to generate a mixture of subword- and character-level segmentation. Through experiment on BERT models, we demonstrate that our method can halve the computational cost of pretraining.

pdf bib
Evaluating Data Augmentation for Medication Identification in Clinical Notes
Jordan Koontz | Maite Oronoz | Alicia Pérez

We evaluate the effectiveness of using data augmentation to improve the generalizability of a Named Entity Recognition model for the task of medication identification in clinical notes. We compare disparate data augmentation methods, namely mention-replacement and a generative model, for creating synthetic training examples. Through experiments on the n2c2 2022 Track 1 Contextualized Medication Event Extraction data set, we show that data augmentation with supplemental examples created with GPT-3 can boost the performance of a transformer-based model for small training sets.

pdf bib
Advancing Topical Text Classification: A Novel Distance-Based Method with Contextual Embeddings
Andriy Kosar | Guy De Pauw | Walter Daelemans

This study introduces a new method for distance-based unsupervised topical text classification using contextual embeddings. The method applies and tailors sentence embeddings for distance-based topical text classification. This is achieved by leveraging the semantic similarity between topic labels and text content, and reinforcing the relationship between them in a shared semantic space. The proposed method outperforms a wide range of existing sentence embeddings on average by 35%. Presenting an alternative to the commonly used transformer-based zero-shot general-purpose classifiers for multiclass text classification, the method demonstrates significant advantages in terms of computational efficiency and flexibility, while maintaining comparable or improved classification results.

pdf bib
Taxonomy-Based Automation of Prior Approval Using Clinical Guidelines
Saranya Krishnamoorthy | Ayush Singh

Performing prior authorization on patients in a medical facility is a time-consuming and challenging task for insurance companies. Automating the clinical decisions that lead to authorization can reduce the time that staff spend executing such procedures. To better facilitate such critical decision making, we present an automated approach to predict one of the challenging tasks in the process called primary clinical indicator prediction, which is the outcome of this procedure. The proposed solution is to create a taxonomy to capture the main categories in primary clinical indicators. Our approach involves an important step of selecting what is known as the “primary indicator” – one of the several heuristics based on clinical guidelines that are published and publicly available. A taxonomy based PI classification system was created to help in the recognition of PIs from free text in electronic health records (EHRs). This taxonomy includes comprehensive explanations of each PI, as well as examples of free text that could be used to detect each PI. The major contribution of this work is to introduce a taxonomy created by three professional nurses with many years of experience. We experiment with several state-of-the-art supervised and unsupervised techniques with a focus on prior approval for spinal imaging. The results indicate that the proposed taxonomy is capable of increasing the performance of unsupervised approaches by up to 10 F1 points. Further, in the supervised setting, we achieve an F1 score of 0.61 using a conventional technique based on term frequency–inverse document frequency that outperforms other deep-learning approaches.

pdf bib
Simultaneous Interpreting as a Noisy Channel: How Much Information Gets Through
Maria Kunilovskaya | Heike Przybyl | Ekaterina Lapshinova-Koltunski | Elke Teich

We explore the relationship between information density/surprisal of source and target texts in translation and interpreting in the language pair English-German, looking at the specific properties of translation (“translationese”). Our data comes from two bidirectional English-German subcorpora representing written and spoken mediation modes collected from European Parliament proceedings. Within each language, we (a) compare original speeches to their translated or interpreted counterparts, and (b) explore the association between segment-aligned sources and targets in each translation direction. As additional variables, we consider source delivery mode (read-out, impromptu) and source speech rate in interpreting. We use language modelling to measure the information rendered by words in a segment and to characterise the cross-lingual transfer of information under various conditions. Our approach is based on statistical analyses of surprisal values, extracted from n-gram models of our dataset. The analysis reveals that while there is a considerable positive correlation between the average surprisal of source and target segments in both modes, information output in interpreting is lower than in translation, given the same amount of input. Significantly lower information density in spoken mediated production compared to non-mediated speech in the same language can indicate a possible simplification effect in interpreting.

pdf bib
Challenges of GPT-3-Based Conversational Agents for Healthcare
Fabian Lechner | Allison Lahnala | Charles Welch | Lucie Flek

The potential of medical domain dialogue agents lies in their ability to provide patients with faster information access while enabling medical specialists to concentrate on critical tasks. However, the integration of large-language models (LLMs) into these agents presents certain limitations that may result in serious consequences. This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA). We perform several evaluations contextualized in terms of standard medical principles. We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.

pdf bib
Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks
João Leite | Carolina Scarton | Diego Silva

Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent “noisy” self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT architectures varying in size. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.

pdf bib
A Practical Survey on Zero-Shot Prompt Design for In-Context Learning
Yinheng Li

The remarkable advancements in large language models (LLMs) have brought about significant improvements in Natural Language Processing(NLP) tasks. This paper presents a comprehensive review of in-context learning techniques, focusing on different types of prompts, including discrete, continuous, few-shot, and zero-shot, and their impact on LLM performance. We explore various approaches to prompt design, such as manual design, optimization algorithms, and evaluation methods, to optimize LLM performance across diverse tasks. Our review covers key research studies in prompt engineering, discussing their methodologies and contributions to the field. We also delve into the challenges faced in evaluating prompt performance, given the absence of a single “best” prompt and the importance of considering multiple metrics. In conclusion, the paper highlights the critical role of prompt design in harnessing the full potential of LLMs and provides insights into the combination of manual design, optimization techniques, and rigorous evaluation for more effective and efficient use of LLMs in various NLP tasks.

pdf bib
Classifying COVID-19 Vaccine Narratives
Yue Li | Carolina Scarton | Xingyi Song | Kalina Bontcheva

Vaccine hesitancy is widespread, despite the government’s information campaigns and the efforts of the World Health Organisation (WHO). Categorising the topics within vaccine-related narratives is crucial to understand the concerns expressed in discussions and identify the specific issues that contribute to vaccine hesitancy. This paper addresses the need for monitoring and analysing vaccine narratives online by introducing a novel vaccine narrative classification task, which categorises COVID-19 vaccine claims into one of seven categories. Following a data augmentation approach, we first construct a novel dataset for this new classification task, focusing on the minority classes. We also make use of fact-checker annotated data. The paper also presents a neural vaccine narrative classifier that achieves an accuracy of 84% under cross-validation. The classifier is publicly available for researchers and journalists.

pdf bib
Sign Language Recognition and Translation: A Multi-Modal Approach Using Computer Vision and Natural Language Processing
Jacky Li | Jaren Gerdes | James Gojit | Austin Tao | Samyak Katke | Kate Nguyen | Benyamin Ahmadnia

Sign-to-Text (S2T) is a hand gesture recognition program in the American Sign Language (ASL) domain. The primary objective of S2T is to classify standard ASL alphabets and custom signs and convert the classifications into a stream of text using neural networks. This paper addresses the shortcomings of pure Computer Vision techniques and applies Natural Language Processing (NLP) as an additional layer of complexity to increase S2T’s robustness.

pdf bib
Classification-Aware Neural Topic Model Combined with Interpretable Analysis - for Conflict Classification
Tianyu Liang | Yida Mu | Soonho Kim | Darline Kuate | Julie Lang | Rob Vos | Xingyi Song

A large number of conflict events are affecting the world all the time. In order to analyse such conflict events effectively, this paper presents a Classification-Aware Neural Topic Model (CANTM-IA) for Conflict Information Classification and Topic Discovery. The model provides a reliable interpretation of classification results and discovered topics by introducing interpretability analysis. At the same time, interpretation is introduced into the model architecture to improve the classification performance of the model and to allow interpretation to focus further on the details of the data. Finally, the model architecture is optimised to reduce the complexity of the model.

pdf bib
Data Augmentation for Fake Reviews Detection
Ming Liu | Massimo Poesio

In this research, we studied the relationship between data augmentation and model accuracy for the task of fake review detection. We used data generation methods to augment two different fake review datasets and compared the performance of models trained with the original data and with the augmented data. Our results show that the accuracy of our fake review detection model can be improved by 0.31 percentage points on DeRev Test and by 7.65 percentage points on Amazon Test by using the augmented datasets.

pdf bib
Coherent Story Generation with Structured Knowledge
Congda Ma | Kotaro Funakoshi | Kiyoaki Shirai | Manabu Okumura

The emergence of pre-trained language models has taken story generation, which is the task of automatically generating a comprehensible story from limited information, to a new stage. Although generated stories from the language models are fluent and grammatically correct, the lack of coherence affects their quality. We propose a knowledge-based multi-stage model that incorporates the schema, a kind of structured knowledge, to guide coherent story generation. Our framework includes a schema acquisition module, a plot generation module, and a surface realization module. In the schema acquisition module, high-relevant structured knowledge pieces are selected as a schema. In the plot generation module, a coherent plot plan is navigated by the schema. In the surface realization module, conditioned by the generated plot, a story is generated. Evaluations show that our methods can generate more comprehensible stories than strong baselines, especially with higher global coherence and less repetition.

pdf bib
Studying Common Ground Instantiation Using Audio, Video and Brain Behaviours: The BrainKT Corpus
Eliot Maës | Thierry Legou | Leonor Becerra | Philippe Blache

An increasing amount of multimodal recordings has been paving the way for the development of a more automatic way to study language and conversational interactions. However this data largely comprises of audio and video recordings, leaving aside other modalities that might complement this external view of the conversation but might be more difficult to collect in naturalistic setups, such as participants brain activity. In this context, we present BrainKT, a natural conversational corpus with audio, video and neuro-physiological signals, collected with the aim of studying information exchanges and common ground instantiation in conversation in a new, more in-depth way. We recorded conversations from 28 dyads (56 participants) during 30 minutes experiments where subjects were first tasked to collaborate on a joint information game, then freely drifted to the topic of their choice. During each session, audio and video were captured, along with the participants’ neural signal (EEG with Biosemi 64) and their electro-physiological activity (with Empatica-E4). The paper situates this new type of resources in the literature, presents the experimental setup and describes the different kinds of annotations considered for the corpus.

pdf bib
Reading between the Lines: Information Extraction from Industry Requirements
Ole Magnus Holter | Basil Ell

Industry requirements describe the qualities that a project or a service must provide. Most requirements are, however, only available in natural language format and are embedded in textual documents. To be machine-understandable, a requirement needs to be represented in a logical format. We consider that a requirement consists of a scope, which is the requirement’s subject matter, a condition, which is any condition that must be fulfilled for the requirement to be relevant, and a demand, which is what is required. We introduce a novel task, the identification of the semantic components scope, condition, and demand in a requirement sentence, and establish baselines using sequence labelling and few-shot learning. One major challenge with this task is the implicit nature of the scope, often not stated in the sentence. By including document context information, we improved the average performance for scope detection. Our study provides insights into the difficulty of machine understanding of industry requirements and suggests strategies for addressing this challenge.

pdf bib
Transformer-Based Language Models for Bulgarian
Iva Marinova | Kiril Simov | Petya Osenova

This paper presents an approach for training lightweight and robust language models for Bulgarian that mitigate gender, political, racial, and other biases in the data. Our method involves scraping content from major Bulgarian online media providers using a specialized procedure for source filtering, topic selection, and lexicon-based removal of inappropriate language during the pre-training phase. We continuously improve the models by incorporating new data from various domains, including social media, books, scientific literature, and linguistically modified corpora. Our motivation is to provide a solution that is sufficient for all natural language processing tasks in Bulgarian, and to address the lack of existing procedures for guaranteeing the robustness of such models.

pdf bib
Multi-task Ensemble Learning for Fake Reviews Detection and Helpfulness Prediction: A Novel Approach
Alimuddin Melleng | Anna Jurek-Loughrey | Deepak P

Research on fake reviews detection and review helpfulness prediction is prevalent, yet most studies tend to focus solely on either fake reviews detection or review helpfulness prediction, considering them separate research tasks. In contrast to this prevailing pattern, we address both challenges concurrently by employing a multi-task learning approach. We posit that undertaking these tasks simultaneously can enhance the performance of each task through shared information among features. We utilize pre-trained RoBERTa embeddings with a document-level data representation. This is coupled with an array of deep learning and neural network models, including Bi-LSTM, LSTM, GRU, and CNN. Additionally, we em- ploy ensemble learning techniques to integrate these models, with the objective of enhancing overall prediction accuracy and mitigating the risk of overfitting. The findings of this study offer valuable insights to the fields of natural language processing and machine learning and present a novel perspective on leveraging multi-task learning for the twin challenges of fake reviews detection and review helpfulness prediction

pdf bib
Data Fusion for Better Fake Reviews Detection
Alimuddin Melleng | Anna Jurek-Loughrey | Deepak P

Online reviews have become critical in informing purchasing decisions, making the detection of fake reviews a crucial challenge to tackle. Many different Machine Learning based solutions have been proposed, using various data representations such as n-grams or document embeddings. In this paper, we first explore the effectiveness of different data representations, including emotion, document embedding, n-grams, and noun phrases in embedding for mat, for fake reviews detection. We evaluate these representations with various state-of-the-art deep learning models, such as BILSTM, LSTM, GRU, CNN, and MLP. Following this, we propose to incorporate different data repre- sentations and classification models using early and late data fusion techniques in order to im- prove the prediction performance. The experiments are conducted on four datasets: Hotel, Restaurant, Amazon, and Yelp. The results demonstrate that combination of different data representations significantly outperform any of the single data representations

pdf bib
Dimensions of Quality: Contrasting Stylistic vs. Semantic Features for Modelling Literary Quality in 9,000 Novels
Pascale Moreira | Yuri Bizzoni

In computational literary studies, the challenging task of predicting quality or reader-appreciation of narrative texts is confounded by volatile definitions of quality and the vast feature space that may be considered in modeling. In this paper, we explore two different types of feature sets: stylistic features on one hand, and semantic features on the other. We conduct experiments on a corpus of 9,089 English language literary novels published in the 19th and 20th century, using GoodReads’ ratings as a proxy for reader-appreciation. Examining the potential of both approaches, we find that some types of books are more predictable in one model than in the other, which may indicate that texts have different prominent characteristics (stylistic complexity, a certain narrative progression at the sentiment-level).

pdf bib
BanglaBait: Semi-Supervised Adversarial Approach for Clickbait Detection on Bangla Clickbait Dataset
Md. Motahar Mahtab | Monirul Haque | Mehedi Hasan | Farig Sadeque

Intentionally luring readers to click on a particular content by exploiting their curiosity defines a title as clickbait. Although several studies focused on detecting clickbait titles in English articles, low-resource language like Bangla has not been given adequate attention. To tackle clickbait titles in Bangla, we have constructed the first Bangla clickbait detection dataset containing 15,056 labeled news articles and 65,406 unlabelled news articles extracted from clickbait-dense news sites. Each article has been labeled by three expert linguists and includes an article’s title, body, and other metadata. By incorporating labeled and unlabelled data, we finetune a pre-trained Bangla transformer model in an adversarial fashion using Semi-Supervised Generative Adversarial Networks (SS-GANs). The proposed model acts as a good baseline for this dataset, outperforming traditional neural network models (LSTM, GRU, CNN) and linguistic feature-based models. We expect that this dataset and the detailed analysis and comparison of these clickbait detection models will provide a fundamental basis for future research into detecting clickbait titles in Bengali articles.

pdf bib
TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping
Attila Nagy | Dorina Lakatos | Botond Barta | Judit Ács

Data augmentation methods for neural machine translation are particularly useful when limited amount of training data is available, which is often the case when dealing with low-resource languages. We introduce a novel augmentation method, which generates new sentences by swapping objects and subjects across bisentences. This is performed simultaneously based on the dependency parse trees of the source and target sentences. We name this method TreeSwap. Our results show that TreeSwap achieves consistent improvements over baseline models in 4 language pairs in both directions on resource-constrained datasets. We also explore domain-specific corpora, but find that our method does not make significant improvements on law, medical and IT data. We report the scores of similar augmentation methods and find that TreeSwap performs comparably. We also analyze the generated sentences qualitatively and find that the augmentation produces a correct translation in most cases. Our code is available on Github.

pdf bib
Automatic Assessment Of Spoken English Proficiency Based on Multimodal and Multitask Transformers
Kamel Nebhi | György Szaszák

This paper describes technology developed to automatically grade students on their English spontaneous spoken language proficiency with common european framework of reference for languages (CEFR) level. Our automated assessment system contains two tasks: elicited imitation and spontaneous speech assessment. Spontaneous speech assessment is a challenging task that requires evaluating various aspects of speech quality, content, and coherence. In this paper, we propose a multimodal and multitask transformer model that leverages both audio and text features to perform three tasks: scoring, coherence modeling, and prompt relevancy scoring. Our model uses a fusion of multiple features and multiple modality attention to capture the interactions between audio and text modalities and learn from different sources of information.

pdf bib
Medical Concept Mention Identification in Social Media Posts Using a Small Number of Sample References
Vasudevan Nedumpozhimana | Sneha Rautmare | Meegan Gower | Nishtha Jain | Maja Popović | Patricia Buffini | John Kelleher

Identification of mentions of medical concepts in social media text can provide useful information for caseload prediction of diseases like Covid-19 and Measles. We propose a simple model for the automatic identification of the medical concept mentions in the social media text. We validate the effectiveness of the proposed model on Twitter, Reddit, and News/Media datasets.

pdf bib
Context-Aware Module Selection in Modular Dialog Systems
Jan Nehring | René Marcel Berk | Stefan Hillmann

In modular dialog systems, a dialog system consists of multiple conversational agents. The task “module selection” selects the appropriate sub-dialog system for an incoming user utterance. Current models for module selection use features derived from the current user turn only, such as the utterances text or confidence values of the natural language understanding systems of the individual conversational agents, or they perform text classification on the user utterance. However, dialogs often span multiple turns, and turns are embedded into a context. Therefore, looking at the current user turn only is a source of error in certain situations. This work proposes four models for module selection that include the dialog history and the current user turn into module selection. We show that these models surpass the current state of the art in module selection.

pdf bib
Human Value Detection from Bilingual Sensory Product Reviews
Boyu Niu | Céline Manetta | Frédérique Segond

We applied text classification methods on a corpus of product reviews we created with the help of a questionnaire. We found that for certain values, “traditional” deep neural networks like CNN can give promising results compared to the baseline. We propose some ideas to improve the results in the future. The bilingual corpus we created which contains more than 16 000 consumer reviews associated to the human value profile of the authors can be used for different marketing purposes.

pdf bib
Word Sense Disambiguation for Automatic Translation of Medical Dialogues into Pictographs
Magali Norré | Rémi Cardon | Vincent Vandeghinste | Thomas François

Word sense disambiguation is an NLP task embedded in different applications. We propose to evaluate its contribution to the automatic translation of French texts into pictographs, in the context of communication between doctors and patients with an intellectual disability. Different general and/or medical language models (Word2Vec, fastText, CamemBERT, FlauBERT, DrBERT, and CamemBERT-bio) are tested in order to choose semantically correct pictographs leveraging the synsets in the French WordNets (WOLF and WoNeF). The results of our automatic evaluations show that our method based on Word2Vec and fastText significantly improves the precision of medical translations into pictographs. We also present an evaluation corpus adapted to this task.

pdf bib
A Research-Based Guide for the Creation and Deployment of a Low-Resource Machine Translation System
John E. Ortega | Kenneth Church

The machine translation (MT) field seems to focus heavily on English and other high-resource languages. Though, low-resource MT (LRMT) is receiving more attention than in the past. Successful LRMT systems (LRMTS) should make a compelling business case in terms of demand, cost and quality in order to be viable for end users. When used by communities where low-resource languages are spoken, LRMT quality should not only be determined by the use of traditional metrics like BLEU, but it should also take into account other factors in order to be inclusive and not risk overall rejection by the community. MT systems based on neural methods tend to perform better with high volumes of training data, but they may be unrealistic and even harmful for LRMT. It is obvious that for research purposes, the development and creation of LRMTS is necessary. However, in this article, we argue that two main workarounds could be considered by companies that are considering deployment of LRMTS in the wild: human-in-the-loop and sub-domains.

pdf bib
MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
Jan Pasek | Jakub Sido | Miloslav Konopik | Ondrej Prazak

This work proposes a new pipeline for leveraging data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.

pdf bib
Forming Trees with Treeformers
Nilay Patel | Jeffrey Flanigan

Human language is known to exhibit a nested, hierarchical structure, allowing us to form complex sentences out of smaller pieces. However, many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in their architecture—that is, they don’t have an inductive bias toward hierarchical structure. Additionally, Transformers are known to perform poorly on compositional generalization tasks which require such structures. In this paper, we introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm which learns a composition operator and pooling function to construct hierarchical encodings for phrases and sentences. Our extensive experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer and show significant improvements in compositional generalization as well as in downstream tasks such as machine translation, abstractive summarization, and various natural language understanding tasks.

pdf bib
Evaluating Unsupervised Hierarchical Topic Models Using a Labeled Dataset
Judicael Poumay | Ashwin Ittoo

Topic modeling is a commonly used method for identifying and extracting topics from a corpus of documents. While several evaluation techniques, such as perplexity and topic coherence, have been developed to assess the quality of extracted topics, they fail to determine whether all topics have been identified and to what extent they have been represented. Additionally, hierarchical topic models have been proposed, but the quality of the hierarchy produced has not been adequately evaluated. This study proposes a novel approach to evaluating topic models that supplements existing methods. Using a labeled dataset, we trained hierarchical topic models in an unsupervised manner and used the known labels to evaluate the accuracy of the results. Our findings indicate that labels encompassing a substantial number of documents achieve high accuracy of over 70%. Although there are 90 labels in the dataset, labels that cover only 1% of the data still achieve an average accuracy of 37.9%, demonstrating the effectiveness of hierarchical topic models even on smaller subsets. Furthermore, we demonstrate that these labels can be used to assess the quality of the topic tree and confirm that hierarchical topic models produce coherent taxonomies for the labels.

pdf bib
HTMOT: Hierarchical Topic Modelling over Time
Judicael Poumay | Ashwin Ittoo

Topic models provide an efficient way of extracting insights from text and supporting decision-making. Recently, novel methods have been proposed to model topic hierarchy or temporality. Modeling temporality provides more precise topics by separating topics that are characterized by similar words but located over distinct time periods. Conversely, modeling hierarchy provides a more detailed view of the content of a corpus by providing topics and sub-topics. However, no models have been proposed to incorporate both hierarchy and temporality which could be beneficial for applications such as environment scanning. Therefore, we propose a novel method to perform Hierarchical Topic Modelling Over Time (HTMOT). We evaluate the performance of our approach on a corpus of news articles using the Word Intrusion task. Results demonstrate that our model produces topics that elegantly combine a hierarchical structure and a temporal aspect. Furthermore, our proposed Gibbs sampling implementation shows competitive performance compared to previous state-of-the-art methods.

pdf bib
Multilingual Continual Learning Approaches for Text Classification
Karan Praharaj | Irina Matveeva

Multilingual continual learning is important for models that are designed to be deployed over long periods of time and are required to be updated when new data becomes available. Such models are continually applied to new unseen data that can be in any of the supported languages. One challenge in this scenario is to ensure consistent performance of the model throughout the deployment lifecycle, beginning from the moment of first deployment. We empirically assess the strengths and shortcomings of some continual learning methods in a multilingual setting across two tasks.

pdf bib
Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study
Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov

Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

pdf bib
Deep Learning Methods for Identification of Multiword Flower and Plant Names
Damith Premasiri | Amal Haddad Haddad | Tharindu Ranasinghe | Ruslan Mitkov

Multiword Terms (MWTs) are domain-specific Multiword Expressions (MWE) where two or more lexemes converge to form a new unit of meaning. The task of processing MWTs is crucial in many Natural Language Processing (NLP) applications, including Machine Translation (MT) and terminology extraction. However, the automatic detection of those terms is a difficult task and more research is still required to give more insightful and useful results in this field. In this study, we seek to fill this gap using state-of-the-art transformer models. We evaluate both BERT like discriminative transformer models and generative pre-trained transformer (GPT) models on this task, and we show that discriminative models perform better than current GPT models in multi-word terms identification task in flower and plant names in English and Spanish languages. Best discriminate models perform 94.3127%, 82.1733% F1 scores in English and Spanish data, respectively while ChatGPT could only perform 63.3183% and 47.7925% respectively.

pdf bib
Improving Aspect-Based Sentiment with End-to-End Semantic Role Labeling Model
Pavel Přibáň | Ondrej Prazak

This paper presents a series of approaches aimed at enhancing the performance of Aspect-Based Sentiment Analysis (ABSA) by utilizing extracted semantic information from a Semantic Role Labeling (SRL) model. We propose a novel end-to-end Semantic Role Labeling model that effectively captures most of the structured semantic information within the Transformer hidden state. We believe that this end-to-end model is well-suited for our newly proposed models that incorporate semantic information. We evaluate the proposed models in two languages, English and Czech, employing ELECTRA-small models. Our combined models improve ABSA performance in both languages. Moreover, we achieved new state-of-the-art results on the Czech ABSA.

pdf bib
huPWKP: A Hungarian Text Simplification Corpus
Noémi Prótár | Dávid Márk Nemeskey

In this article we introduce huPWKP, the first parallel corpus consisting of Hungarian standard language-simplified sentence pairs. As Hungarian is a quite low-resource language in regards to text simplification, we opted for translating an already existing corpus, PWKP (Zhu et al., 2010), on which we performed some cleaning in order to improve its quality. We evaluated the corpus both with the help of human evaluators and by training a seq2seq model on both the Hungarian corpus and the original (cleaned) English corpus. The Hungarian model performed slightly worse in terms of automatic metrics; however, the English model attains a SARI score close to the state of the art on the official PWKP set. According to the human evaluation, the corpus performs at around 3 on a scale ranging from 1 to 5 in terms of information retention and increase in simplification and around 3.7 in terms of grammaticality.

pdf bib
Topic Modeling Using Community Detection on a Word Association Graph
Mahfuzur Rahman Chowdhury | Intesur Ahmed | Farig Sadeque | Muhammad Yanhaona

Topic modeling of a text corpus is one of the most well-studied areas of information retrieval and knowledge discovery. Despite several decades of research in the area that begets an array of modeling tools, some common problems still obstruct automated topic modeling from matching users’ expectations. In particular, existing topic modeling solutions suffer when the distribution of words among the underlying topics is uneven or the topics are overlapped. Furthermore, many solutions ask the user to provide a topic count estimate as input, which limits their usefulness in modeling a corpus where such information is unavailable. We propose a new topic modeling approach that overcomes these shortcomings by formulating the topic modeling problem as a community detection problem in a word association graph/network that we generate from the text corpus. Experimental evaluation using multiple data sets of three different types of text corpora shows that our approach is superior to prominent topic modeling alternatives in most cases. This paper describes our approach and discusses the experimental findings.

pdf bib
Exploring Techniques to Detect and Mitigate Non-Inclusive Language Bias in Marketing Communications Using a Dictionary-Based Approach
Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Rahul Ponnusamy | John P. McCrae | Michaela Comerford | Jay Megaro | Deniz Keles | Last Feremenga

We propose a new dataset for detecting non-inclusive language in sentences in English. These sentences were gathered from public sites, explaining what is inclusive and what is non-inclusive. We also extracted potentially non-inclusive keywords/phrases from the guidelines from business websites. A phrase dictionary was created by using an automatic extension with a word embedding trained on a massive corpus of general English text. In the end, a phrase dictionary was constructed by hand-editing the previous one to exclude inappropriate expansions and add the keywords from the guidelines. In a business context, the words individuals use can significantly impact the culture of inclusion and the quality of interactions with clients and prospects. Knowing the right words to avoid helps customers of different backgrounds and historically excluded groups feel included. They can make it easier to have productive, engaging, and positive communications. You can find the dictionaries, the code, and the method for making requests for the corpus at (we will release the link for data and code once the paper is accepted).

pdf bib
Does the “Most Sinfully Decadent Cake Ever” Taste Good? Answering Yes/No Questions from Figurative Contexts
Geetanjali Rakshit | Jeffrey Flanigan

Figurative language is commonplace in natural language, and while making communication memorable and creative, can be difficult to understand. In this work, we investigate the robustness of Question Answering (QA) models on figurative text. Yes/no questions, in particular, are a useful probe of figurative language understanding capabilities of large language models. We propose FigurativeQA, a set of 1000 yes/no questions with figurative and non-figurative contexts, extracted from the domains of restaurant and product reviews. We show that state-of-the-art BERT-based QA models exhibit an average performance drop of up to 15% points when answering questions from figurative contexts, as compared to non-figurative ones. While models like GPT-3 and ChatGPT are better at handling figurative texts, we show that further performance gains can be achieved by automatically simplifying the figurative contexts into their non-figurative (literal) counterparts. We find that the best overall model is ChatGPT with chain-of-thought prompting to generate non-figurative contexts. Our work provides a promising direction for building more robust QA models with figurative language understanding capabilities.

pdf bib
Modeling Easiness for Training Transformers with Curriculum Learning
Leonardo Ranaldi | Giulia Pucci | Fabio Massimo Zanzotto

Directly learning from complex examples is generally problematic for humans and machines. Indeed, a better strategy is exposing learners to examples in a reasonable, pedagogically-motivated order. Curriculum Learning (CL) has been proposed to import this strategy when training machine learning models. In this paper, building on Curriculum Learning, we propose a novel, linguistically motivated measure to determine example complexity for organizing examples during learning. Our complexity measure - LRC- is based on length, rarity, and comprehensibility. Our resulting learning model is CL-LRC, that is, CL with LRC. Experiments on downstream tasks show that CL-LRC outperforms existing CL and non-CL methods for training BERT and RoBERTa from scratch. Furthermore, we analyzed different measures, including perplexity, loss, and learning curve of different models pre-trained from scratch, showing that CL-LRC performs better than the state-of-the-art.

pdf bib
The Dark Side of the Language: Pre-trained Transformers in the DarkNet
Leonardo Ranaldi | Aria Nourbakhsh | Elena Sofia Ruzzetti | Arianna Patrizi | Dario Onorati | Michele Mastromattei | Francesca Fallucchi | Fabio Massimo Zanzotto

Pre-trained Transformers are challenging human performances in many Natural Language Processing tasks. The massive datasets used for pre-training seem to be the key to their success on existing tasks. In this paper, we explore how a range of pre-trained natural language understanding models performs on definitely unseen sentences provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks perform on par with pre-trained Transformers even after fine-tuning. Only after what we call extreme domain adaptation, that is, retraining with the masked language model task on all the novel corpus, pre-trained Transformers reach their standard high results. This suggests that huge pre-training corpora may give Transformers unexpected help since they are exposed to many of the possible sentences.

pdf bib
PreCog: Exploring the Relation between Memorization and Performance in Pre-trained Language Models
Leonardo Ranaldi | Elena Sofia Ruzzetti | Fabio Massimo Zanzotto

Large Language Models (LLMs) are impressive machines with the ability to memorize, possibly generalized learning examples. We present here a small, focused contribution to the analysis of the interplay between memorization and performance of BERT in downstream tasks. We propose PreCog, a measure for evaluating memorization from pre-training, and we analyze its correlation with the BERT’s performance. Our experiments show that highly memorized examples are better classified, suggesting memorization is an essential key to success for BERT.

pdf bib
Publish or Hold? Automatic Comment Moderation in Luxembourgish News Articles
Tharindu Ranasinghe | Alistair Plum | Christoph Purschke | Marcos Zampieri

Recently, the internet has emerged as the primary platform for accessing news. In the majority of these news platforms, the users now have the ability to post comments on news articles and engage in discussions on various social media. While these features promote healthy conversations among users, they also serve as a breeding ground for spreading fake news, toxic discussions and hate speech. Moderating or removing such content is paramount to avoid unwanted consequences for the readers. How- ever, apart from a few notable exceptions, most research on automatic moderation of news article comments has dealt with English and other high resource languages. This leaves under-represented or low-resource languages at a loss. Addressing this gap, we perform the first large-scale qualitative analysis of more than one million Luxembourgish comments posted over the course of 14 years. We evaluate the performance of state-of-the-art transformer models in Luxembourgish news article comment moderation. Furthermore, we analyse how the language of Luxembourgish news article comments has changed over time. We observe that machine learning models trained on old comments do not perform well on recent data. The findings in this work will be beneficial in building news comment moderation systems for many low-resource languages

pdf bib
Cross-Lingual Speaker Identification for Indian Languages
Amaan Rizvi | Anupam Jamatia | Dwijen Rudrapal | Kunal Chakma | Björn Gambäck

The paper introduces a cross-lingual speaker identification system for Indian languages, utilising a Long Short-Term Memory dense neural network (LSTM-DNN). The system was trained on audio recordings in English and evaluated on data from Hindi, Kannada, Malayalam, Tamil, and Telugu, with a view to how factors such as phonetic similarity and native accent affect performance. The model was fed with MFCC (mel-frequency cepstral coefficient) features extracted from the audio file. For comparison, the corresponding mel-spectrogram images were also used as input to a ResNet-50 model, while the raw audio was used to train a Siamese network. The LSTM-DNN model outperformed the other two models as well as two more traditional baseline speaker identification models, showing that deep learning models are superior to probabilistic models for capturing low-level speech features and learning speaker characteristics.

pdf bib
ChemXtract’ A System for Extraction of Chemical Events from Patent Documents
Pattabhi RK Rao | Sobha Lalitha Devi

ChemXtraxt main goal is to extract the chemical events from patent documents. Event extraction requires that we first identify the names of chemical compounds involved in the events. Thus, in this work two extractions are done and they are (a) names of chemical compounds and (b) event that identify the specific involvement of the chemical compounds in a chemical reaction. Extraction of essential elements of a chemical reaction, generally known as Named Entity Recognition (NER), extracts the compounds, condition and yields, their specific role in reaction and assigns a label according to the role it plays within a chemical reaction. Whereas event extraction identifies the chemical event relations between the chemical compounds identified. Here in this work we have used Neural Conditional Random Fields (NCRF), which combines the power of artificial neural network (ANN) and CRFs. Different levels of features that include linguistic, orthographical and lexical clues are used. The results obtained are encouraging.

pdf bib
Mind the User! Measures to More Accurately Evaluate the Practical Value of Active Learning Strategies
Julia Romberg

One solution to limited annotation budgets is active learning (AL), a collaborative process of human and machine to strategically select a small but informative set of examples. While current measures optimize AL from a pure machine learning perspective, we argue that for a successful transfer into practice, additional criteria must target the second pillar of AL, the human annotator. In text classification, e.g., where practitioners regularly encounter datasets with an increased number of imbalanced classes, measures like F1 fall short when finding all classes or identifying rare cases is required. We therefore introduce four measures that reflect class-related demands that users place on data acquisition. In a comprehensive comparison of uncertainty-based, diversity-based, and hybrid query strategies on six different datasets, we find that strong F1 performance is not necessarily associated with full class coverage. Uncertainty sampling outperforms diversity sampling in selecting minority classes and covering classes more efficiently, while diversity sampling excels in selecting less monotonous batches. Our empirical findings emphasize that a holistic view is essential when evaluating AL approaches to ensure their usefulness in practice - the actual, but often overlooked, goal of development. To this end, standard measures for assessing the performance of text classification need to be complemented by such that more appropriately reflect user needs.

pdf bib
Event Annotation and Detection in Kannada-English Code-Mixed Social Media Data
Sumukh S | Abhinav Appidi | Manish Shrivastava

Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers, Parts-of-Speech (POS) taggers and Named Entity Recognition (NER) for analysing code-mixed data have been developed. One such important tool is Event Detection, an important information retrieval task which can be used to identify critical facts occurring in the vast streams of unstructured text data available. While event detection from text is a hard problem on its own, social media data adds to it with its informal nature, and code-mixed (Kannada-English) data further complicates the problem due to its word-level mixing, lack of structure and incomplete information. In this work, we have tried to address this problem. We have proposed guidelines for the annotation of events in Kannada-English CM data and provided some baselines for the same with careful feature selection.

pdf bib
Three Approaches to Client Email Topic Classification
Branislava Šandrih Todorović | Katarina Josipović | Jurij Kodre

This paper describes a use case that was implemented and is currently running in production at the Nova Ljubljanska Banka, that involves classifying incoming client emails in the Slovenian language according to their topics and priorities. Since the proposed approach relies only on the Named Entity Recogniser (NER) of personal names as a language-dependent resource (for the purpose of anonymisation), that is the only prerequisite for applying the approach to any other language.

pdf bib
Exploring Abstractive Text Summarisation for Podcasts: A Comparative Study of BART and T5 Models
Parth Saxena | Mo El-Haj

Podcasts have become increasingly popular in recent years, resulting in a massive amount of audio content being produced every day. Efficient summarisation of podcast episodes can enable better content management and discovery for users. In this paper, we explore the use of abstractive text summarisation methods to generate high-quality summaries of podcast episodes. We use pre-trained models, BART and T5, to fine-tune on a dataset of Spotify’s 100K podcast. We evaluate our models using automated metrics and human evaluation, and find that the BART model fine-tuned on the podcast dataset achieved a higher ROUGE-1 and ROUGE-L score compared to other models, while the T5 model performed better in terms of semantic meaning. The human evaluation indicates that both models produced high-quality summaries that were well received by participants. Our study demonstrates the effectiveness of abstractive summarisation methods for podcast episodes and offers insights for improving the summarisation of audio content.

pdf bib
Exploring the Landscape of Natural Language Processing Research
Tim Schopf | Karim Arabi | Florian Matthes

As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics, identifies trends, and outlines areas for future research remains absent. Contributing to closing this gap, we have systematically classified and analyzed research papers in the ACL Anthology. As a result, we present a structured overview of the research landscape, provide a taxonomy of fields of study in NLP, analyze recent developments in NLP, summarize our findings, and highlight directions for future work.

pdf bib
Efficient Domain Adaptation of Sentence Embeddings Using Adapters
Tim Schopf | Dennis N. Schneider | Florian Matthes

Sentence embeddings enable us to capture the semantic similarity of short texts. Most sentence embedding models are trained for general semantic textual similarity tasks. Therefore, to use sentence embeddings in a particular domain, the model must be adapted to it in order to achieve good results. Usually, this is done by fine-tuning the entire sentence embedding model for the domain of interest. While this approach yields state-of-the-art results, all of the model’s weights are updated during fine-tuning, making this method resource-intensive. Therefore, instead of fine-tuning entire sentence embedding models for each target domain individually, we propose to train lightweight adapters. These domain-specific adapters do not require fine-tuning all underlying sentence embedding model parameters. Instead, we only train a small number of additional parameters while keeping the weights of the underlying sentence embedding model fixed. Training domain-specific adapters allows always using the same base model and only exchanging the domain-specific adapters to adapt sentence embeddings to a specific domain. We show that using adapters for parameter-efficient domain adaptation of sentence embeddings yields competitive performance within 1% of a domain-adapted, entirely fine-tuned sentence embedding model while only training approximately 3.6% of the parameters.

pdf bib
AspectCSE: Sentence Embeddings for Aspect-Based Semantic Textual Similarity Using Contrastive Learning and Structured Knowledge
Tim Schopf | Emanuel Gerber | Malte Ostendorff | Florian Matthes

Generic sentence embeddings provide coarse-grained approximation of semantic textual similarity, but ignore specific aspects that make texts similar. Conversely, aspect-based sentence embeddings provide similarities between texts based on certain predefined aspects. Thus, similarity predictions of texts are more targeted to specific requirements and more easily explainable. In this paper, we present AspectCSE, an approach for aspect-based contrastive learning of sentence embeddings. Results indicate that AspectCSE achieves an average improvement of 3.97% on information retrieval tasks across multiple aspects compared to the previous best results. We also propose the use of Wikidata knowledge graph properties to train models of multi-aspect sentence embeddings in which multiple specific aspects are simultaneously considered during similarity predictions. We demonstrate that multi-aspect embeddings outperform even single-aspect embeddings on aspect-specific information retrieval tasks. Finally, we examine the aspect-based sentence embedding space and demonstrate that embeddings of semantically similar aspect labels are often close, even without explicit similarity training between different aspect labels.

pdf bib
Tackling the Myriads of Collusion Scams on YouTube Comments of Cryptocurrency Videos
Sadat Shahriar | Arjun Mukherjee

Despite repeated measures, YouTube’s comment section has been a fertile ground for scammers. With the growth of the cryptocurrency market and obscurity around it, a new form of scam, namely “Collusion Scam” has emerged as a dominant force within YouTube’s comment space. Unlike typical scams and spams, collusion scams employ a cunning persuasion strategy, using the facade of genuine social interactions within comment threads to create an aura of trust and success to entrap innocent users. In this research, we collect 1,174 such collusion scam threads and perform a detailed analysis, which is tailored towards the successful detection of these scams. We find that utilization of the collusion dynamics can provide an accuracy of 96.67% and an F1-score of 93.04%. Furthermore, we demonstrate the robust predictive power of metadata associated with these threads and user channels, which act as compelling indicators of collusion scams. Finally, we show that modern LLM, like chatGPT, can effectively detect collusion scams without the need for any training.

pdf bib
Exploring Deceptive Domain Transfer Strategies: Mitigating the Differences among Deceptive Domains
Sadat Shahriar | Arjun Mukherjee | Omprakash Gnawali

Deceptive text poses a significant threat to users, resulting in widespread misinformation and disorder. While researchers have created numerous cutting-edge techniques for detecting deception in domain-specific settings, whether there is a generic deception pattern so that deception-related knowledge in one domain can be transferred to the other remains mostly unexplored. Moreover, the disparities in textual expression across these many mediums pose an additional obstacle for generalization. To this end, we present a Multi-Task Learning (MTL)-based deception generalization strategy to reduce the domain-specific noise and facilitate a better understanding of deception via a generalized training. As deceptive domains, we use News (fake news), Tweets (rumors), and Reviews (fake reviews) and employ LSTM and BERT model to incorporate domain transfer techniques. Our proposed architecture for the combined approach of domain-independent and domain-specific training improves the deception detection performance by up to 5.28% in F1-score.

pdf bib
Party Extraction from Legal Contract Using Contextualized Span Representations of Parties
Sanjeepan Sivapiran | Charangan Vasantharajan | Uthayasanker Thayasivam

Extracting legal entities from legal documents, particularly legal parties in contract documents, poses a significant challenge for legal assistive software. Many existing party extraction systems tend to generate numerous false positives due to the complex structure of the legal text. In this study, we present a novel and accurate method for extracting parties from legal contract documents by leveraging contextual span representations. To facilitate our approach, we have curated a large-scale dataset comprising 1000 contract documents with party annotations. Our method incorporates several enhancements to the SQuAD 2.0 question-answering system, specifically tailored to handle the intricate nature of the legal text. These enhancements include modifications to the activation function, an increased number of encoder layers, and the addition of normalization and dropout layers stacked on top of the output encoder layer. Baseline experiments reveal that our model, fine-tuned on our dataset, outperforms the current state-of-the-art model. Furthermore, we explore various combinations of the aforementioned techniques to further enhance the accuracy of our method. By employing a hybrid approach that combines 24 encoder layers with normalization and dropout layers, we achieve the best results, exhibiting an exact match score of 0.942 (+6.2% improvement).

pdf bib
From Fake to Hyperpartisan News Detection Using Domain Adaptation
Răzvan-Alexandru Smădu | Sebastian-Vasile Echim | Dumitru-Clementin Cercel | Iuliana Marin | Florin Pop

Unsupervised Domain Adaptation (UDA) is a popular technique that aims to reduce the domain shift between two data distributions. It was successfully applied in computer vision and natural language processing. In the current work, we explore the effects of various unsupervised domain adaptation techniques between two text classification tasks: fake and hyperpartisan news detection. We investigate the knowledge transfer from fake to hyperpartisan news detection without involving target labels during training. Thus, we evaluate UDA, cluster alignment with a teacher, and cross-domain contrastive learning. Extensive experiments show that these techniques improve performance, while including data augmentation further enhances the results. In addition, we combine clustering and topic modeling algorithms with UDA, resulting in improved performances compared to the initial UDA setup.

pdf bib
Prompt-Based Approach for Czech Sentiment Analysis
Jakub Šmíd | Pavel Přibáň

This paper introduces the first prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech. We employ the sequence-to-sequence models to solve the aspect-based tasks simultaneously and demonstrate the superiority of our prompt-based approach over traditional fine-tuning. In addition, we conduct zero-shot and few-shot learning experiments for sentiment classification and show that prompting yields significantly better results with limited training examples compared to traditional fine-tuning. We also demonstrate that pre-training on data from the target domain can lead to significant improvements in a zero-shot scenario.

pdf bib
Measuring Gender Bias in Natural Language Processing: Incorporating Gender-Neutral Linguistic Forms for Non-Binary Gender Identities in Abusive Speech Detection
Nasim Sobhani | Kinshuk Sengupta | Sarah Jane Delany

Predictions from machine learning models can reflect bias in the data on which they are trained. Gender bias has been shown to be prevalent in natural language processing models. The research into identifying and mitigating gender bias in these models predominantly considers gender as binary, male and female, neglecting the fluidity and continuity of gender as a variable. In this paper, we present an approach to evaluate gender bias in a prediction task, which recognises the non-binary nature of gender. We gender-neutralise a random subset of existing real-world hate speech data. We extend the existing template approach for measuring gender bias to include test examples that are gender-neutral. Measuring the bias across a selection of hate speech datasets we show that the bias for the gender-neutral data is closer to that seen for test instances that identify as male than those that identify as female.

pdf bib
LeSS: A Computationally-Light Lexical Simplifier for Spanish
Sanja Stajner | Daniel Ibanez | Horacio Saggion

Due to having knowledge of only basic vocabulary, many people cannot understand up-to-date written information and thus make informed decisions and fully participate in the society. We propose LeSS, a modular lexical simplification architecture that outperforms state-of-the-art lexical simplification systems for Spanish. In addition to its state-of-the-art performance, LeSS is computationally light, using much less disk space, CPU and GPU, and having faster loading and execution time than the transformer-based lexical simplification models which are predominant in the field.

pdf bib
Hindi to Dravidian Language Neural Machine Translation Systems
Vijay Sundar Ram | Sobha Lalitha Devi

Neural machine translation (NMT) has achieved state-of-art performance in high-resource language pairs, but the performance of NMT drops in low-resource conditions. Morphologically rich languages are yet another challenge in NMT. The common strategy to handle this issue is to apply sub-word segmentation. In this work, we compare the morphologically inspired segmentation methods against the Byte Pair Encoding (BPE) in processing the input for building NMT systems for Hindi to Malayalam and Hindi to Tamil, where Hindi is an Indo-Aryan language and Malayalam and Tamil are south Dravidian languages. These two languages are low resource, morphologically rich and agglutinative. Malayalam is more agglutinative than Tamil. We show that for both the language pairs, the morphological segmentation algorithm out-performs BPE. We also present an elaborate analysis on translation outputs from both the NMT systems.

pdf bib
Looking for Traces of Textual Deepfakes in Bulgarian on Social Media
Irina Temnikova | Iva Marinova | Silvia Gargova | Ruslana Margova | Ivan Koychev

Textual deepfakes can cause harm, especially on social media. At the moment, there are models trained to detect deepfake messages mainly for the English language, but no research or datasets currently exist for detecting them in most low-resource languages, such as Bulgarian. To address this gap, we explore three approaches. First, we machine translate an English-language social media dataset with bot messages into Bulgarian. However, the translation quality is unsatisfactory, leading us to create a new Bulgarian-language dataset with real social media messages and those generated by two language models (a new Bulgarian GPT-2 model – GPT-WEB-BG, and ChatGPT). We machine translate it into English and test existing English GPT-2 and ChatGPT detectors on it, achieving only 0.44-0.51 accuracy. Next, we train our own classifiers on the Bulgarian dataset, obtaining an accuracy of 0.97. Additionally, we apply the classifier with the highest results to a recently released Bulgarian social media dataset with manually fact-checked messages, which successfully identifies some of the messages as generated by Language Models (LM). Our results show that the use of machine translation is not suitable for textual deepfakes detection. We conclude that combining LM text detection with fact-checking is the most appropriate method for this task, and that identifying Bulgarian textual deepfakes is indeed possible.

pdf bib
Propaganda Detection in Russian Telegram Posts in the Scope of the Russian Invasion of Ukraine
Natalia Vanetik | Marina Litvak | Egor Reviakin | Margarita Tiamanova

The emergence of social media has made it more difficult to recognize and analyze misinformation efforts. Popular messaging software Telegram has developed into a medium for disseminating political messages and misinformation, particularly in light of the conflict in Ukraine. In this paper, we introduce a sizable corpus of Telegram posts containing pro-Russian propaganda and benign political texts. We evaluate the corpus by applying natural language processing (NLP) techniques to the task of text classification in this corpus. Our findings indicate that, with an overall accuracy of over 96% for confirmed sources as propagandists and oppositions and 92% for unconfirmed sources, our method can successfully identify and categorize pro- Russian propaganda posts. We highlight the consequences of our research for comprehending political communications and propaganda on social media.

pdf bib
Auto-Encoding Questions with Retrieval Augmented Decoding for Unsupervised Passage Retrieval and Zero-Shot Question Generation
Stalin Varanasi | Muhammad Umer Tariq Butt | Guenter Neumann

Dense passage retrieval models have become state-of-the-art for information retrieval on many Open-domain Question Answering (ODQA) datasets. However, most of these models rely on supervision obtained from the ODQA datasets, which hinders their performance in a low-resource setting. Recently, retrieval-augmented language models have been proposed to improve both zero-shot and supervised information retrieval. However, these models have pre-training tasks that are agnostic to the target task of passage retrieval. In this work, we propose Retrieval Augmented Auto-encoding of Questions for zero-shot dense information retrieval. Unlike other pre-training methods, our pre-training method is built for target information retrieval, thereby making the pre-training more efficient. Our method consists of a dense IR model for encoding questions and retrieving documents during training and a conditional language model that maximizes the question’s likelihood by marginalizing over retrieved documents. As a by-product, we can use this conditional language model for zero-shot question generation from documents. We show that the IR model obtained through our method improves the current state-of-the-art of zero-shot dense information retrieval, and we improve the results even further by training on a synthetic corpus created by zero-shot question generation.

pdf bib
NoHateBrazil: A Brazilian Portuguese Text Offensiveness Analysis System
Francielle Vargas | Isabelle Carvalho | Wolfgang Schmeisser-Nieto | Fabrício Benevenuto | Thiago Pardo

Hate speech is a surely relevant problem in Brazil. Nevertheless, its regulation is not effective due to the difficulty to identify, quantify and classify offensive comments. Here, we introduce a novel system for offensive comment analysis in Brazilian Portuguese. The system titled “NoHateBrazil” recognizes explicit and implicit offensiveness in context at a fine-grained level. Specifically, we propose a framework for data collection, human annotation and machine learning models that were used to build the system. In addition, we assess the potential of our system to reflect stereotypical beliefs against marginalized groups by contrasting them with counter-stereotypes. As a result, a friendly web application was implemented, which besides presenting relevant performance, showed promising results towards mitigation of the risk of reinforcing social stereotypes. Lastly, new measures were proposed to improve the explainability of offensiveness classification and reliability of the model’s predictions.

pdf bib
Socially Responsible Hate Speech Detection: Can Classifiers Reflect Social Stereotypes?
Francielle Vargas | Isabelle Carvalho | Ali Hürriyetoğlu | Thiago Pardo | Fabrício Benevenuto

Recent studies have shown that hate speech technologies may propagate social stereotypes against marginalized groups. Nevertheless, there has been a lack of realistic approaches to assess and mitigate biased technologies. In this paper, we introduce a new approach to analyze the potential of hate-speech classifiers to reflect social stereotypes through the investigation of stereotypical beliefs by contrasting them with counter-stereotypes. We empirically measure the distribution of stereotypical beliefs by analyzing the distinctive classification of tuples containing stereotypes versus counter-stereotypes in machine learning models and datasets. Experiment results show that hate speech classifiers attribute unreal or negligent offensiveness to social identity groups by reflecting and reinforcing stereotypical beliefs regarding minorities. Furthermore, we also found that models that embed expert and context information from offensiveness markers present promising results to mitigate social stereotype bias towards socially responsible hate speech detection.

pdf bib
Predicting Sentence-Level Factuality of News and Bias of Media Outlets
Francielle Vargas | Kokil Jaidka | Thiago Pardo | Fabrício Benevenuto

Automated news credibility and fact-checking at scale require accurate prediction of news factuality and media bias. This paper introduces a large sentence-level dataset, titled “FactNews”, composed of 6,191 sentences expertly annotated according to factuality and media bias definitions proposed by AllSides. We use FactNews to assess the overall reliability of news sources by formulating two text classification problems for predicting sentence-level factuality of news reporting and bias of media outlets. Our experiments demonstrate that biased sentences present a higher number of words compared to factual sentences, besides having a predominance of emotions. Hence, the fine-grained analysis of subjectivity and impartiality of news articles showed promising results for predicting the reliability of entire media outlets. Finally, due to the severity of fake news and political polarization in Brazil, and the lack of research for Portuguese, both dataset and baseline were proposed for Brazilian Portuguese.

pdf bib
Classification of US Supreme Court Cases Using BERT-Based Techniques
Shubham Vatsal | Adam Meyers | John E. Ortega

Models based on bidirectional encoder representations from transformers (BERT) produce state of the art (SOTA) results on many natural language processing (NLP) tasks such as named entity recognition (NER), part-of-speech (POS) tagging etc. An interesting phenomenon occurs when classifying long documents such as those from the US supreme court where BERT-based models can be considered difficult to use on a first-pass or out-of-the-box basis. In this paper, we experiment with several BERT-based classification techniques for US supreme court decisions or supreme court database (SCDB) and compare them with the previous SOTA results. We then compare our results specifically with SOTA models for long documents. We compare our results for two classification tasks: (1) a broad classification task with 15 categories and (2) a fine-grained classification task with 279 categories. Our best result produces an accuracy of 80% on the 15 broad categories and 60% on the fine-grained 279 categories which marks an improvement of 8% and 28% respectively from previously reported SOTA results.

pdf bib
Kāraka-Based Answer Retrieval for Question Answering in Indic Languages
Devika Verma | Ramprasad S. Joshi | Aiman A. Shivani | Rohan D. Gupta

Kārakas from ancient Paninian grammar form a concise set of semantic roles that capture crucial aspect of sentence meaning pivoted on the action verb. In this paper, we propose employing a kāraka-based approach for retrieving answers in Indic question-answering systems. To study and evaluate this novel approach, empirical experiments are conducted over large benchmark corpora in Hindi and Marathi. The results obtained demonstrate the effectiveness of the proposed method. Additionally, we explore the varying impact of two approaches for extracting kārakas. The literature surveyed and experiments conducted encourage hope that kāraka annotation can improve communication with machines using natural languages, particularly in low-resource languages.

pdf bib
Comparative Analysis of Named Entity Recognition in the Dungeons and Dragons Domain
Gayashan Weerasundara | Nisansa de Silva

Some Natural Language Processing (NLP) tasks that are in the sufficiently solved state for general domain English still struggle to attain the same level of performance in specific domains. Named Entity Recognition (NER), which aims to find and categorize entities in text is such a task met with difficulties in adapting to domain specificity. This paper compares the performance of 10 NER models on 7 adventure books from the Dungeons and Dragons (D&D) domain which is a subdomain of fantasy literature. Fantasy literature, being rich and diverse in vocabulary, poses considerable challenges for conventional NER. In this study, we use open-source Large Language Models (LLM) to annotate the named entities and character names in each number of official D&D books and evaluate the precision and distribution of each model. The paper aims to identify the challenges and opportunities for improving NER in fantasy literature. Our results show that even in the off-the-shelf configuration, Flair, Trankit, and Spacy achieve better results for identifying named entities in the D&D domain compared to their peers.

pdf bib
Comparative Analysis of Anomaly Detection Algorithms in Text Data
Yizhou Xu | Kata Gábor | Jérôme Milleret | Frédérique Segond

Text anomaly detection (TAD) is a crucial task that aims to identify texts that deviate significantly from the norm within a corpus. Despite its importance in various domains, TAD remains relatively underexplored in natural language processing. This article presents a systematic evaluation of 22 TAD algorithms on 17 corpora using multiple text representations, including monolingual and multilingual SBERT. The performance of the algorithms is compared based on three criteria: degree of supervision, theoretical basis, and architecture used. The results demonstrate that semi-supervised methods utilizing weak labels outperform both unsupervised methods and semi-supervised methods using only negative samples for training. Additionally, we explore the application of TAD techniques in hate speech detection. The results provide valuable insights for future TAD research and guide the selection of suitable algorithms for detecting text anomalies in different contexts.

pdf bib
Poetry Generation Combining Poetry Theme Labels Representations
Yingyu Yan | Dongzhen Wen | Liang Yang | Dongyu Zhang | Hongfei Lin

Ancient Chinese poetry is the earliest literary genre that took shape in Chinese literature and has a dissemination effect, showing China’s profound cultural heritage. At the same time, the generation of ancient poetry is an important task in the field of digital humanities, which is of great significance to the inheritance of national culture and the education of ancient poetry. The current work in the field of poetry generation is mainly aimed at improving the fluency and structural accuracy of words and sentences, ignoring the theme unity of poetry generation results. In order to solve this problem, this paper proposes a graph neural network poetry theme representation model based on label embedding. On the basis of the network representation of poetry, the topic feature representation of poetry is constructed and learned from the granularity of words. Then, the features of the poetry theme representation model are combined with the autoregressive language model to construct a theme-oriented ancient Chinese poetry generation model TLPG (Poetry Generation with Theme Label). Through machine evaluation and evaluation by experts in related fields, the model proposed in this paper has significantly improved the topic consistency of poetry generation compared with existing work on the premise of ensuring the fluency and format accuracy of poetry.

pdf bib
Evaluating Generative Models for Graph-to-Text Generation
Shuzhou Yuan | Michael Faerber

Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate GPT-3 and ChatGPT on two graph-to-text datasets and compare their performance with that of finetuned LLM models such as T5 and BART. Our results demonstrate that generative models are capable of generating fluent and coherent text, achieving BLEU scores of 10.57 and 11.08 for the AGENDA and WebNLG datasets, respectively. However, our error analysis reveals that generative models still struggle with understanding the semantic relations between entities, and they also tend to generate text with hallucinations or irrelevant information. As a part of error analysis, we utilize BERT to detect machine-generated text and achieve high macro-F1 scores. We have made the text generated by generative models publicly available.

pdf bib
Microsyntactic Unit Detection Using Word Embedding Models: Experiments on Slavic Languages
Iuliia Zaitova | Irina Stenger | Tania Avgustinova

Microsyntactic units have been defined as language-specific transitional entities between lexicon and grammar, whose idiomatic properties are closely tied to syntax. These units are typically described based on individual constructions, making it difficult to understand them comprehensively as a class. This study proposes a novel approach to detect microsyntactic units using Word Embedding Models (WEMs) trained on six Slavic languages, namely Belarusian, Bulgarian, Czech, Polish, Russian, and Ukrainian, and evaluates how well these models capture the nuances of syntactic non-compositionality. To evaluate the models, we develop a cross-lingual inventory of microsyntactic units using the lists of microsyntantic units available at the Russian National Corpus. Our results demonstrate the effectiveness of WEMs in capturing microsyntactic units across all six Slavic languages under analysis. Additionally, we find that WEMs tailored for syntax-based tasks consistently outperform other WEMs at the task. Our findings contribute to the theory of microsyntax by providing insights into the detection of microsyntactic units and their cross-linguistic properties.

pdf bib
Systematic TextRank Optimization in Extractive Summarization
Morris Zieve | Anthony Gregor | Frederik Juul Stokbaek | Hunter Lewis | Ellis Marie Mendoza | Benyamin Ahmadnia

With the ever-growing amount of textual data, extractive summarization has become increasingly crucial for efficiently processing information. The TextRank algorithm, a popular unsupervised method, offers excellent potential for this task. In this paper, we aim to optimize the performance of TextRank by systematically exploring and verifying the best preprocessing and fine-tuning techniques. We extensively evaluate text preprocessing methods, such as tokenization, stemming, and stopword removal, to identify the most effective combination with TextRank. Additionally, we examine fine-tuning strategies, including parameter optimization and incorporation of domain-specific knowledge, to achieve superior summarization quality.