Pranaydeep Singh - ACL Anthology

Pranaydeep Singh

2025

EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization
Pranaydeep Singh | Eneko Agirre | Gorka Azkune | Orphee De Clercq | Els Lefever
Findings of the Association for Computational Linguistics: ACL 2025

Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks – including POS tagging, Sentiment Analysis, NLI, and NER – demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.

2024

Lemmatisation of Medieval Greek: Against the Limits of Transformer’s Capabilities?
Colin Swaelens | Pranaydeep Singh | Ilse de Vos | Els Lefever
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents preliminary experiments for the lemmatisation of unedited, Byzantine Greek epigrams. This type of Greek is quite different from its classical ancestor, mostly because of its orthographic inconsistencies. Existing lemmatisation algorithms display an accuracy drop of around 30pp when tested on these Byzantine book epigrams. We conducted seven different lemmatisation experiments, which were either transformer-based or based on neural edit-trees. The best performing lemmatiser was a hybrid method combining transformer-based embeddings with a dictionary look-up. We compare our results with existing lemmatisers, and provide a detailed error analysis revealing why unedited, Byzantine Greek is so challenging for lemmatisation.

Exploring Aspect-Based Sentiment Analysis Methodologies for Literary-Historical Research Purposes
Tess Dejaeghere | Pranaydeep Singh | Els Lefever | Julie Birkholz
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

This study explores aspect-based sentiment analysis (ABSA) methodologies for literary-historical research, aiming to address the limitations of traditional sentiment analysis in understanding nuanced aspects of literature. It evaluates three ABSA toolchains: rule-based, machine learning-based (utilizing BERT and MacBERTh embeddings), and a prompt-based workflow with Mixtral 8x7B. Findings highlight challenges and potentials of ABSA for literary-historical analysis, emphasizing the need for context-aware annotation strategies and technical skills. The research contributes by curating a multilingual corpus of travelogues, publishing an annotated dataset for ABSA, creating openly available Jupyter Notebooks with Python code for each modeling approach, conducting pilot experiments on literary-historical texts, and proposing future endeavors to advance ABSA methodologies in this domain.

Findings of the WASSA 2024 EXALT shared task on Explainability for Cross-Lingual Emotion in Tweets
Aaron Maladry | Pranaydeep Singh | Els Lefever
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper presents a detailed description and results of the first shared task on explainability for cross-lingual emotion in tweets. Given a tweet in one of the five target languages (Dutch, Russian, Spanish, English, and French), systems should predict the correct emotion label (Task 1), as well as the words triggering the predicted emotion label (Task 2). The tweets were collected based on a list of stop words to prevent topical or emotional bias and were subsequently manually annotated. For both tasks, only a training corpus for English was provided, obliging participating systems to design cross-lingual approaches. Our shared task received submissions from 14 teams for the emotion detection task and from 6 teams for the trigger word detection task. The highest macro F1-scores obtained for both tasks are respectively 0.629 and 0.616, demonstrating that cross-lingual emotion detection is still a challenging task.

2023

Too Many Cooks Spoil the Model: Are Bilingual Models for Slovene Better than a Large Multilingual Model?
Pranaydeep Singh | Aaron Maladry | Els Lefever
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

This paper investigates whether adding data of typologically closer languages improves the performance of transformer-based models for three different downstream tasks, namely Part-of-Speech tagging, Named Entity Recognition, and Sentiment Analysis, compared to a monolingual and plain multilingual language model. For the presented pilot study, we performed experiments for the use case of Slovene, a low(er)-resourced language belonging to the Slavic language family. The experiments were carried out in a controlled setting, where a monolingual model for Slovene was compared to combined language models containing Slovene, trained with the same amount of Slovene data. The experimental results show that adding typologically closer languages indeed improves the performance of the Slovene language model, and even succeeds in outperforming the large multilingual XLM-RoBERTa model for NER and PoS-tagging. We also reveal that, contrary to intuition, distantly or unrelated languages also combine admirably with Slovene, often out-performing XLM-R as well. All the bilingual models used in the experiments are publicly available at https://github.com/pranaydeeps/BLAIR

Misery Loves Complexity: Exploring Linguistic Complexity in the Context of Emotion Detection
Pranaydeep Singh | Luna De Bruyne | Orphée De Clercq | Els Lefever
Findings of the Association for Computational Linguistics: EMNLP 2023

Given the omnipresence of social media in our society, thoughts and opinions are being shared online in an unprecedented manner. This means that both positive and negative emotions can be equally and freely expressed. However, the negativity bias posits that human beings are inherently drawn to and more moved by negativity and, as a consequence, negative emotions get more traffic. Correspondingly, when writing about emotions this negativity bias could lead to expressions of negative emotions that are linguistically more complex. In this paper, we attempt to use readability and linguistic complexity metrics to better understand the manifestation of emotions on social media platforms like Reddit based on the widely-used GoEmotions dataset. We demonstrate that according to most metrics, negative emotions indeed tend to generate more complex text than positive emotions. In addition, we examine whether a higher complexity hampers the automatic identification of emotions. To answer this question, we fine-tuned three state-of-the-art transformers (BERT, RoBERTa, and SpanBERT) on the same emotion detection dataset. We demonstrate that these models often fail to predict emotions for the more complex texts. More advanced LLMs like RoBERTa and SpanBERT also fail to improve by significant margins on complex samples. This calls for a more nuanced interpretation of the emotion detection performance of transformer models. We make the automatically annotated data available for further research at: https://huggingface.co/datasets/pranaydeeps/CAMEO

2022

When the Student Becomes the Master: Learning Better and Smaller Monolingual Models from mBERT
Pranaydeep Singh | Els Lefever
Proceedings of the 29th International Conference on Computational Linguistics

In this research, we present pilot experiments to distil monolingual models from a jointly trained model for 102 languages (mBERT). We demonstrate that it is possible for the target language to outperform the original model, even with a basic distillation setup. We evaluate our methodology for 6 languages with varying amounts of resources and language families.

Combining Language Models and Linguistic Information to Label Entities in Memes
Pranaydeep Singh | Aaron Maladry | Els Lefever
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations

This paper describes the system we developed for the shared task ‘Hero, Villain and Victim: Dissecting harmful memes for Semantic role labelling of entities’ organised in the framework of the Second Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation (Constraint 2022). We present an ensemble approach combining transformer-based models and linguistic information, such as the presence of irony and implicit sentiment associated to the target named entities. The ensemble system obtains promising classification scores, resulting in a third place finish in the competition.

How Language-Dependent is Emotion Detection? Evidence from Multilingual BERT
Luna De Bruyne | Pranaydeep Singh | Orphee De Clercq | Els Lefever | Veronique Hoste
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)

As emotion analysis in text has gained a lot of attention in the field of natural language processing, differences in emotion expression across languages could have consequences for how emotion detection models work. We evaluate the language-dependence of an mBERT-based emotion detection model by comparing language identification performance before and after fine-tuning on emotion detection, and performing (adjusted) zero-shot experiments to assess whether emotion detection models rely on language-specific information. When dealing with typologically dissimilar languages, we found evidence for the language-dependence of emotion detection.

Investigating the Quality of Static Anchor Embeddings from Transformers for Under-Resourced Languages
Pranaydeep Singh | Orphee De Clercq | Els Lefever
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

This paper reports on experiments for cross-lingual transfer using the anchor-based approach of Schuster et al. (2019) for English and a low-resourced language, namely Hindi. For the sake of comparison, we also evaluate the approach on three very different higher-resourced languages, viz. Dutch, Russian and Chinese. Initially designed for ELMo embeddings, we analyze the approach for the more recent BERT family of transformers for a variety of tasks, both mono and cross-lingual. The results largely prove that like most other cross-lingual transfer approaches, the static anchor approach is underwhelming for the low-resource language, while performing adequately for the higher resourced ones. We attempt to provide insights into both the quality of the anchors, and the performance for low-shot cross-lingual transfer to better understand this performance gap. We make the extracted anchors and the modified train and test sets available for future research at https://github.com/pranaydeeps/Vyaapak

SentEMO: A Multilingual Adaptive Platform for Aspect-based Sentiment and Emotion Analysis
Ellen De Geyndt | Orphee De Clercq | Cynthia Van Hee | Els Lefever | Pranaydeep Singh | Olivier Parent | Veronique Hoste
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

In this paper, we present the SentEMO platform, a tool that provides aspect-based sentiment analysis and emotion detection of unstructured text data such as reviews, emails and customer care conversations. Currently, models have been trained for five domains and one general domain and are implemented in a pipeline approach, where the output of one model serves as the input for the next. The results are presented in three dashboards, allowing companies to gain more insights into what stakeholders think of their products and services. The SentEMO platform is available at https://sentemo.ugent.be

2021

A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek
Pranaydeep Singh | Gorik Rutten | Els Lefever
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT

LT3 at SemEval-2021 Task 6: Using Multi-Modal Compact Bilinear Pooling to Combine Visual and Textual Understanding in Memes
Pranaydeep Singh | Els Lefever
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Internet memes have become ubiquitous in social media networks today. Due to their popularity, they are also a widely used mode of expression to spread disinformation online. As memes consist of a mixture of text and image, they require a multi-modal approach for automatic analysis. In this paper, we describe our contribution to the SemEval-2021 Detection of Persuasian Techniques in Texts and Images Task. We propose a Multi-Modal learning system, which incorporates “memebeddings”, viz. joint text and vision features by combining them with compact bilinear pooling, to automatically identify rhetorical and psychological disinformation techniques. The experimental results show that the proposed system constantly outperforms the competition’s baseline, and achieves the 2nd best Macro F1-score and 14th best Micro F1-score out of all participants.

2020

Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings
Pranaydeep Singh | Els Lefever
Proceedings of the 4th Workshop on Computational Approaches to Code Switching

This paper investigates the use of unsupervised cross-lingual embeddings for solving the problem of code-mixed social media text understanding. We specifically investigate the use of these embeddings for a sentiment analysis task for Hinglish Tweets, viz. English combined with (transliterated) Hindi. In a first step, baseline models, initialized with monolingual embeddings obtained from large collections of tweets in English and code-mixed Hinglish, were trained. In a second step, two systems using cross-lingual embeddings were researched, being (1) a supervised classifier and (2) a transfer learning approach trained on English sentiment data and evaluated on code-mixed data. We demonstrate that incorporating cross-lingual embeddings improves the results (F1-score of 0.635 versus a monolingual baseline of 0.616), without any parallel data required to train the cross-lingual embeddings. In addition, the results show that the cross-lingual embeddings not only improve the results in a fully supervised setting, but they can also be used as a base for distant supervision, by training a sentiment model in one of the source languages and evaluating on the other language projected in the same space. The transfer learning experiments result in an F1-score of 0.556, which is almost on par with the supervised settings and speak to the robustness of the cross-lingual embeddings approach.

Identifying Cognates in English-Dutch and French-Dutch by means of Orthographic Information and Cross-lingual Word Embeddings
Els Lefever | Sofie Labat | Pranaydeep Singh
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although the classifier already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.

LT3 at SemEval-2020 Task 8: Multi-Modal Multi-Task Learning for Memotion Analysis
Pranaydeep Singh | Nina Bauwelinck | Els Lefever
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Internet memes have become a very popular mode of expression on social media networks today. Their multi-modal nature, caused by a mixture of text and image, makes them a very challenging research object for automatic analysis. In this paper, we describe our contribution to the SemEval-2020 Memotion Analysis Task. We propose a Multi-Modal Multi-Task learning system, which incorporates “memebeddings”, viz. joint text and vision features, to learn and optimize for all three Memotion subtasks simultaneously. The experimental results show that the proposed system constantly outperforms the competition’s baseline, and the system setup with continual learning (where tasks are trained sequentially) obtains the best classification F1-scores.

LT3 at SemEval-2020 Task 9: Cross-lingual Embeddings for Sentiment Analysis of Hinglish Social Media Text
Pranaydeep Singh | Els Lefever
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes our contribution to the SemEval-2020 Task 9 on Sentiment Analysis for Code-mixed Social Media Text. We investigated two approaches to solve the task of Hinglish sentiment analysis. The first approach uses cross-lingual embeddings resulting from projecting Hinglish and pre-trained English FastText word embeddings in the same space. The second approach incorporates pre-trained English embeddings that are incrementally retrained with a set of Hinglish tweets. The results show that the second approach performs best, with an F1-score of 70.52% on the held-out test data.

Co-authors

Nina Bauwelinck 1

Julie Birkholz 1

Ellen De Geyndt 1

Tess Dejaeghere 1

Olivier Parent 1

Colin Swaelens 1

Cynthia Van Hee 1

Venues