Urja Khurana


pdf bib
Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting
Lea Krause | Wondimagegnhue Tufa | Selene Baez Santamaria | Angel Daza | Urja Khurana | Piek Vossen
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

While the fluency and coherence of Large Language Models (LLMs) in text generation have seen significant improvements, their competency in generating appropriate expressions of uncertainty remains limited.Using a multilingual closed-book QA task and GPT-3.5, we explore how well LLMs are calibrated and express certainty across a diverse set of languages, including low-resource settings. Our results reveal strong performance in high-resource languages but a marked decline in performance in lower-resource languages. Across all, we observe an exaggerated expression of confidence in the model, which does not align with the correctness or likelihood of its responses. Our findings highlight the need for further research into accurate calibration of LLMs especially in a multilingual setting.

pdf bib
Leveraging Few-Shot Data Augmentation and Waterfall Prompting for Response Generation
Lea Krause | Selene Báez Santamaría | Michiel van der Meer | Urja Khurana
Proceedings of The Eleventh Dialog System Technology Challenge

This paper discusses our approaches for task-oriented conversational modelling using subjective knowledge, with a particular emphasis on response generation. Our methodology was shaped by an extensive data analysis that evaluated key factors such as response length, sentiment, and dialogue acts present in the provided dataset. We used few-shot learning to augment the data with newly generated subjective knowledge items and present three approaches for DSTC11: (1) task-specific model exploration, (2) incorporation of the most frequent question into all generated responses, and (3) a waterfall prompting technique using a combination of both GPT-3 and ChatGPT.


pdf bib
Will It Blend? Mixing Training Paradigms & Prompting for Argument Quality Prediction
Michiel van der Meer | Myrthe Reuver | Urja Khurana | Lea Krause | Selene Baez Santamaria
Proceedings of the 9th Workshop on Argument Mining

This paper describes our contributions to the Shared Task of the 9th Workshop on Argument Mining (2022). Our approach uses Large Language Models for the task of Argument Quality Prediction. We perform prompt engineering using GPT-3, and also investigate the training paradigms multi-task learning, contrastive learning, and intermediate-task training. We find that a mixed prediction setup outperforms single models. Prompting GPT-3 works best for predicting argument validity, and argument novelty is best estimated by a model trained using all three training paradigms.

pdf bib
Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions
Urja Khurana | Ivar Vermeulen | Eric Nalisnick | Marloes Van Noorloos | Antske Fokkens
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

The subjectivity of automatic hate speech detection makes it a complex task, reflected in different and incomplete definitions in NLP. We present hate speech criteria, developed with insights from a law and social science expert, that help researchers create more explicit definitions and annotation guidelines on five aspects: (1) target groups and (2) dominance, (3) perpetrator characteristics, (4) explicit presence of negative interactions, and the (5) type of consequences/effects. Definitions can be structured so that they cover a more broad or more narrow phenomenon and conscious choices can be made on specifying criteria or leaving them open. We argue that the goal and exact task developers have in mind should determine how the scope of hate speech is defined. We provide an overview of the properties of datasets from hatespeechdata.com that may help select the most suitable dataset for a specific scenario.


pdf bib
How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task
Urja Khurana | Eric Nalisnick | Antske Fokkens
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Despite their success, modern language models are fragile. Even small changes in their training pipeline can lead to unexpected results. We study this phenomenon by examining the robustness of ALBERT (Lan et al., 2020) in combination with Stochastic Weight Averaging (SWA)—a cheap way of ensembling—on a sentiment analysis task (SST-2). In particular, we analyze SWA’s stability via CheckList criteria (Ribeiro et al., 2020), examining the agreement on errors made by models differing only in their random seed. We hypothesize that SWA is more stable because it ensembles model snapshots taken along the gradient descent trajectory. We quantify stability by comparing the models’ mistakes with Fleiss’ Kappa (Fleiss, 1971) and overlap ratio scores. We find that SWA reduces error rates in general; yet the models still suffer from their own distinct biases (according to CheckList).