Tanvi Dinkar

2025

Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement
Gavin Abercrombie | Tanvi Dinkar | Amanda Cercas Curry | Verena Rieser | Dirk Hovy
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP

We commonly use agreement measures to assess the utility of judgements made by human annotators in Natural Language Processing (NLP) tasks. While inter-annotator agreement is frequently used as an indication of label reliability by measuring consistency between annotators, we argue for the additional use of intra-annotator agreement to measure label stability (and annotator consistency) over time. However, in a systematic review, we find that the latter is rarely reported in this field. Calculating these measures can act as important quality control and could provide insights into why annotators disagree. We conduct exploratory annotation experiments to investigate the relationships between these measures and perceptions of subjectivity and ambiguity in text items, finding that annotators provide inconsistent responses around 25% of the time across four different NLP tasks.

2024

pdf bib abs

ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text
Tanvi Dinkar | Gavin Abercrombie | Verena Rieser
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

ReproHum is a large multi-institution project designed to examine the reproducibility of human evaluations of natural language processing. As part of the second phase of the project, we attempt to reproduce an evaluation of the fluency of continuations generated by a pre-trained language model compared to a range of baselines. Working within the constraints of the project, with limited information about the original study, and without access to their participant pool, or the responses of individual participants, we find that we are not able to reproduce the original results. Our participants display a greater tendency to prefer one of the system responses, avoiding a judgement of ‘equal fluency’ more than in the original study. We also conduct further evaluations: we elicit ratings from (1) a broader range of participants; (2) from the same participants at different times; and (3) with an altered definition of fluency. Results of these experiments suggest that the original evaluation collected too few ratings, and that the task formulation may be quite ambiguous. Overall, although we were able to conduct a re-evaluation study, we conclude that the original evaluation was not comprehensive enough to make truly meaningful comparisons

pdf bib

Proceedings of Safety4ConvAI: The Third Workshop on Safety for Conversational AI @ LREC-COLING 2024
Tanvi Dinkar | Giuseppe Attanasio | Amanda Cercas Curry | Ioannis Konstas | Dirk Hovy | Verena Rieser
Proceedings of Safety4ConvAI: The Third Workshop on Safety for Conversational AI @ LREC-COLING 2024

pdf bib abs

Re-examining Sexism and Misogyny Classification with Annotator Attitudes
Aiqi Jiang | Nikolas Vitsakis | Tanvi Dinkar | Gavin Abercrombie | Ioannis Konstas
Findings of the Association for Computational Linguistics: EMNLP 2024

Gender-Based Violence (GBV) is an increasing problem online, but existing datasets fail to capture the plurality of possible annotator perspectives or ensure the representation of affected groups. We revisit two important stages in the moderation pipeline for GBV: (1) manual data labelling; and (2) automated classification. For (1), we examine two datasets to investigate the relationship between annotator identities and attitudes and the responses they give to two GBV labelling tasks. To this end, we collect demographic and attitudinal information from crowd-sourced annotators using three validated surveys from Social Psychology. We find that higher Right Wing Authoritarianism scores are associated with a higher propensity to label text as sexist, while for Social Dominance Orientation and Neosexist Attitudes, higher scores are associated with a negative tendency to do so.For (2), we conduct classification experiments using Large Language Models and five prompting strategies, including infusing prompts with annotator information. We find: (i) annotator attitudes affect the ability of classifiers to predict their labels; (ii) including attitudinal information can boost performance when we use well-structured brief annotator descriptions; and (iii) models struggle to reflect the increased complexity and imbalanced classes of the new label sets.

pdf bib abs

Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis
Sachin Sasidharan Nair | Tanvi Dinkar | Gavin Abercrombie
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Growing awareness of a ‘Reproducibility Crisis’ in natural language processing (NLP) has focused on human evaluations of generative systems. While labelling for supervised classification tasks makes up a large part of human input to systems, the reproduction of such efforts has thus far not been been explored. In this paper, we re-implement a human data collection study for sentiment analysis of code-mixed Malayalam movie reviews, as well as automated classification experiments. We find that missing and under-specified information makes reproduction challenging, and we observe potentially consequential differences between the original labels and those we collect. Classification results indicate that the reliability of the labels is important for stable performance.

2023

pdf bib abs

We demonstrate an embodied conversational agent that can function as a receptionist and generate a mixture of open and closed-domain dialogue along with facial expressions, by using a large language model (LLM) to develop an engaging conversation. We deployed the system onto a Furhat robot, which is highly expressive and capable of using both verbal and nonverbal cues during interaction. The system was designed specifically for the National Robotarium to interact with visitors through natural conversations, providing them with information about the facilities, research, news, upcoming events, etc. The system utilises the state-of-the-art GPT-3.5 model to generate such information along with domain-general conversations and facial expressions based on prompt engineering.

pdf bib abs

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib abs

Safety and Robustness in Conversational AI
Tanvi Dinkar
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

In this position paper, I will present the research interests in my PostDoc on safety and robustness specific to conversational AI, including then relevant overlap from my PhD.

pdf bib

Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems
Vojtech Hudecek | Patricia Schmidtova | Tanvi Dinkar | Javier Chiyah-Garcia | Weronika Sieinska
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

pdf bib abs

iLab at SemEval-2023 Task 11 Le-Wi-Di: Modelling Disagreement or Modelling Perspectives?
Nikolas Vitsakis | Amit Parekh | Tanvi Dinkar | Gavin Abercrombie | Ioannis Konstas | Verena Rieser
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

There are two competing approaches for modelling annotator disagreement: distributional soft-labelling approaches (which aim to capture the level of disagreement) or modelling perspectives of individual annotators or groups thereof. We adapt a multi-task architecture which has previously shown success in modelling perspectives to evaluate its performance on the SEMEVAL Task 11. We do so by combining both approaches, i.e. predicting individual annotator perspectives as an interim step towards predicting annotator disagreement. Despite its previous success, we found that a multi-task approach performed poorly on datasets which contained distinct annotator opinions, suggesting that this approach may not always be suitable when modelling perspectives. Furthermore, our results explain that while strongly perspectivist approaches might not achieve state-of-the-art performance according to evaluation metrics used by distributional approaches, our approach allows for a more nuanced understanding of individual perspectives present in the data. We argue that perspectivist approaches are preferable because they enable decision makers to amplify minority views, and that it is important to re-evaluate metrics to reflect this goal.

pdf bib abs

Mirages. On Anthropomorphism in Dialogue Systems
Gavin Abercrombie | Amanda Cercas Curry | Tanvi Dinkar | Verena Rieser | Zeerak Talat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automated dialogue or conversational systems are anthropomorphised by developers and personified by users. While a degree of anthropomorphism is inevitable, conscious and unconscious design choices can guide users to personify them to varying degrees. Encouraging users to relate to automated systems as if they were human can lead to transparency and trust issues, and high risk scenarios caused by over-reliance on their outputs. As a result, natural language processing researchers have investigated the factors that induce personification and develop resources to mitigate such effects. However, these efforts are fragmented, and many aspects of anthropomorphism have yet to be explored. In this paper, we discuss the linguistic factors that contribute to the anthropomorphism of dialogue systems and the harms that can arise thereof, including reinforcing gender stereotypes and conceptions of acceptable language. We recommend that future efforts towards developing dialogue systems take particular care in their design, development, release, and description; and attend to the many linguistic cues that can elicit personification by users.

2022

pdf bib abs

Studying Alignment in a Collaborative Learning Activity via Automatic Methods: The Link Between What We Say and Do
Utku Norman | Tanvi Dinkar | Barbara Bruno | Chloé Clavel
Dialogue Discourse Volume 13

A dialogue is successful when there is alignment between the speakers, at different linguistic levels. In this work, we consider the dialogue occurring between interlocutors engaged in a collaborative learning task, where they are evaluated on how well they performed and how much they learnt. Our main contribution is to propose new automatic measures to study alignment; focusing on lexical alignment, and a new alignment context that we introduce termed as behavioural alignment (when an instruction given by one interlocutor was followed with concrete actions in a physical environment by another). Thus we propose methodologies to create a link between what was said, and what was done as a consequence. To do so, we focus on expressions related to the task in the situated activity. These expressions are minimally required by the interlocutors to make progress in the task. We then observe how these local alignment contexts build to dialogue level phenomena; success in the task. What distinguishes our approach from other works, is the treatment of alignment as a procedure that occurs in stages. Since we utilise a dataset of spontaneous speech dialogues elicited from children, a second contribution of our work is to study how spontaneous speech phenomena (such as when interlocutors say "uh", "oh" ...) are used in the process of alignment. Lastly, we make public the dataset to study alignment in educational dialogues. Our results show that all teams lexically and behaviourally align to some degree regardless of their performance and learning, and our measures capture that teams that did not succeed in the task were simply slower to collaborate. Thus we find that teams that performed better, were faster to align. Furthermore, our methodology captures a productive, collaborative period that includes the time where the interlocutors came up with their best solutions. We also find that well-performing teams verbalise the marker "oh" more when they are behaviourally aligned, compared to other times in the dialogue; showing that this marker is an important cue in alignment. To the best of our knowledge, we are the first to study the role of "oh" as an information management marker in a behavioural context (i.e. in connection to actions taken in a physical environment), compared to only a verbal one. Our measures contribute to the research in the field of educational dialogue and the intersection between dialogue and collaborative learning research.

pdf bib

Fillers in Spoken Language Understanding: Computational and Psycholinguistic Perspectives
Tanvi Dinkar | Chloé Clavel | Ioana Vasilescu
Traitement Automatique des Langues, Volume 63, Numéro 3 : Etats de l'art en TAL [Review articles in NLP]

2021

pdf bib

From local hesitations to global impressions of a speaker’s feeling of knowing
Tanvi Dinkar | Beatrice Biancardi | Chloé Clavel
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)

2020

pdf bib abs

The importance of fillers for text representations of speech transcripts
Tanvi Dinkar | Pierre Colombo | Matthieu Labeau | Chloé Clavel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While being an essential component of spoken language, fillers (e.g. “um” or “uh”) often remain overlooked in Spoken Language Understanding (SLU) tasks. We explore the possibility of representing them with deep contextualised embeddings, showing improvements on modelling spoken language and two downstream tasks — predicting a speaker’s stance and expressed confidence.