Gavin Abercrombie


2023

pdf bib
Temporal and Second Language Influence on Intra-Annotator Agreement and Stability in Hate Speech Labelling
Gavin Abercrombie | Dirk Hovy | Vinodkumar Prabhakaran
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

Much work in natural language processing (NLP) relies on human annotation. The majority of this implicitly assumes that annotator’s labels are temporally stable, although the reality is that human judgements are rarely consistent over time. As a subjective annotation task, hate speech labels depend on annotator’s emotional and moral reactions to the language used to convey the message. Studies in Cognitive Science reveal a ‘foreign language effect’, whereby people take differing moral positions and perceive offensive phrases to be weaker in their second languages. Does this affect annotations as well? We conduct an experiment to investigate the impacts of (1) time and (2) different language conditions (English and German) on measurements of intra-annotator agreement in a hate speech labelling task. While we do not observe the expected lower stability in the different language condition, we find that overall agreement is significantly lower than is implicitly assumed in annotation tasks, which has important implications for dataset reproducibility in NLP.

pdf bib
iLab at SemEval-2023 Task 11 Le-Wi-Di: Modelling Disagreement or Modelling Perspectives?
Nikolas Vitsakis | Amit Parekh | Tanvi Dinkar | Gavin Abercrombie | Ioannis Konstas | Verena Rieser
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

There are two competing approaches for modelling annotator disagreement: distributional soft-labelling approaches (which aim to capture the level of disagreement) or modelling perspectives of individual annotators or groups thereof. We adapt a multi-task architecture which has previously shown success in modelling perspectives to evaluate its performance on the SEMEVAL Task 11. We do so by combining both approaches, i.e. predicting individual annotator perspectives as an interim step towards predicting annotator disagreement. Despite its previous success, we found that a multi-task approach performed poorly on datasets which contained distinct annotator opinions, suggesting that this approach may not always be suitable when modelling perspectives. Furthermore, our results explain that while strongly perspectivist approaches might not achieve state-of-the-art performance according to evaluation metrics used by distributional approaches, our approach allows for a more nuanced understanding of individual perspectives present in the data. We argue that perspectivist approaches are preferable because they enable decision makers to amplify minority views, and that it is important to re-evaluate metrics to reflect this goal.

pdf bib
SemEval-2023 Task 11: Learning with Disagreements (LeWiDi)
Elisa Leonardelli | Gavin Abercrombie | Dina Almanea | Valerio Basile | Tommaso Fornaciari | Barbara Plank | Verena Rieser | Alexandra Uma | Massimo Poesio
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

NLP datasets annotated with human judgments are rife with disagreements between the judges. This is especially true for tasks depending on subjective judgments such as sentiment analysis or offensive language detection. Particularly in these latter cases, the NLP community has come to realize that the common approach of reconciling’ these different subjective interpretations risks misrepresenting the evidence. Many NLP researchers have therefore concluded that rather than eliminating disagreements from annotated corpora, we should preserve themindeed, some argue that corpora should aim to preserve all interpretations produced by annotators. But this approach to corpus creation for NLP has not yet been widely accepted. The objective of the Le-Wi-Di series of shared tasks is to promote this approach to developing NLP models by providing a unified framework for training and evaluating with such datasets. We report on the second such shared task, which differs from the first edition in three crucial respects: (i) it focuses entirely on NLP, instead of both NLP and computer vision tasks in its first edition; (ii) it focuses on subjective tasks, instead of covering different types of disagreements as training with aggregated labels for subjective NLP tasks is in effect a misrepresentation of the data; and (iii) for the evaluation, we concentrated on soft approaches to evaluation. This second edition of Le-Wi-Di attracted a wide array of partici- pants resulting in 13 shared task submission papers.

pdf bib
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib
Resources for Automated Identification of Online Gender-Based Violence: A Systematic Review
Gavin Abercrombie | Aiqi Jiang | Poppy Gerrard-abbott | Ioannis Konstas | Verena Rieser
The 7th Workshop on Online Abuse and Harms (WOAH)

Online Gender-Based Violence (GBV), such as misogynistic abuse is an increasingly prevalent problem that technological approaches have struggled to address. Through the lens of the GBV framework, which is rooted in social science and policy, we systematically review 63 available resources for automated identification of such language. We find the datasets are limited in a number of important ways, such as their lack of theoretical grounding and stakeholder input, static nature, and focus on certain media platforms. Based on this review, we recommend development of future resources rooted in sociological expertise andcentering stakeholder voices, namely GBV experts and people with lived experience of GBV.

pdf bib
Mirages. On Anthropomorphism in Dialogue Systems
Gavin Abercrombie | Amanda Cercas Curry | Tanvi Dinkar | Verena Rieser | Zeerak Talat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automated dialogue or conversational systems are anthropomorphised by developers and personified by users. While a degree of anthropomorphism is inevitable, conscious and unconscious design choices can guide users to personify them to varying degrees. Encouraging users to relate to automated systems as if they were human can lead to transparency and trust issues, and high risk scenarios caused by over-reliance on their outputs. As a result, natural language processing researchers have investigated the factors that induce personification and develop resources to mitigate such effects. However, these efforts are fragmented, and many aspects of anthropomorphism have yet to be explored. In this paper, we discuss the linguistic factors that contribute to the anthropomorphism of dialogue systems and the harms that can arise thereof, including reinforcing gender stereotypes and conceptions of acceptable language. We recommend that future efforts towards developing dialogue systems take particular care in their design, development, release, and description; and attend to the many linguistic cues that can elicit personification by users.

2022

pdf bib
SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems
Emily Dinan | Gavin Abercrombie | A. Bergman | Shannon Spruit | Dirk Hovy | Y-Lan Boureau | Verena Rieser
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The social impact of natural language processing and its applications has received increasing attention. In this position paper, we focus on the problem of safety for end-to-end conversational AI. We survey the problem landscape therein, introducing a taxonomy of three observed phenomena: the Instigator, Yea-Sayer, and Impostor effects. We then empirically assess the extent to which current tools can measure these effects and current systems display them. We release these tools as part of a “first aid kit” (SafetyKit) to quickly assess apparent safety concerns. Our results show that, while current tools are able to provide an estimate of the relative safety of systems in various settings, they still have several shortcomings. We suggest several future directions and discuss ethical considerations.

pdf bib
Risk-graded Safety for Handling Medical Queries in Conversational AI
Gavin Abercrombie | Verena Rieser
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Conversational AI systems can engage in unsafe behaviour when handling users’ medical queries that may have severe consequences and could even lead to deaths. Systems therefore need to be capable of both recognising the seriousness of medical inputs and producing responses with appropriate levels of risk. We create a corpus of human written English language medical queries and the responses of different types of systems. We label these with both crowdsourced and expert annotations. While individual crowdworkers may be unreliable at grading the seriousness of the prompts, their aggregated labels tend to agree with professional opinion to a greater extent on identifying the medical queries and recognising the risk types posed by the responses. Results of classification experiments suggest that, while these tasks can be automated, caution should be exercised, as errors can potentially be very serious.

pdf bib
Guiding the Release of Safer E2E Conversational AI through Value Sensitive Design
A. Stevie Bergman | Gavin Abercrombie | Shannon Spruit | Dirk Hovy | Emily Dinan | Y-Lan Boureau | Verena Rieser
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Over the last several years, end-to-end neural conversational agents have vastly improved their ability to carry unrestricted, open-domain conversations with humans. However, these models are often trained on large datasets from the Internet and, as a result, may learn undesirable behaviours from this data, such as toxic or otherwise harmful language. Thus, researchers must wrestle with how and when to release these models. In this paper, we survey recent and related work to highlight tensions between values, potential positive impact, and potential harms. We also provide a framework to support practitioners in deciding whether and how to release these models, following the tenets of value-sensitive design.

pdf bib
Policy-focused Stance Detection in Parliamentary Debate Speeches
Gavin Abercrombie | Riza Batista-Navarro
Northern European Journal of Language Technology, Volume 8

Legislative debate transcripts provide citizens with information about the activities of their elected representatives, but are difficult for people to process. We propose the novel task of policy-focused stance detection, in which both the policy proposals under debate and the position of the speakers towards those proposals are identified. We adapt a previously existing dataset to include manual annotations of policy preferences, an established schema from political science. We evaluate a range of approaches to the automatic classification of policy preferences and speech sentiment polarity, including transformer-based text representations and a multi-task learning paradigm. We find that it is possible to identify the policies under discussion using features derived from the speeches, and that incorporating motion-dependent debate modelling, previously used to classify speech sentiment, also improves performance in the classification of policy preferences. We analyse the output of the best performing system, finding that discriminating features for the task are highly domain-specific, and that speeches that address policy preferences proposed by members of the same party can be among the most difficult to predict.

pdf bib
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022
Gavin Abercrombie | Valerio Basile | Sara Tonelli | Verena Rieser | Alexandra Uma
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022

2021

pdf bib
Alexa, Google, Siri: What are Your Pronouns? Gender and Anthropomorphism in the Design and Perception of Conversational Assistants
Gavin Abercrombie | Amanda Cercas Curry | Mugdha Pandya | Verena Rieser
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

Technology companies have produced varied responses to concerns about the effects of the design of their conversational AI systems. Some have claimed that their voice assistants are in fact not gendered or human-like—despite design features suggesting the contrary. We compare these claims to user perceptions by analysing the pronouns they use when referring to AI assistants. We also examine systems’ responses and the extent to which they generate output which is gendered and anthropomorphic. We find that, while some companies appear to be addressing the ethical concerns raised, in some cases, their claims do not seem to hold true. In particular, our results show that system outputs are ambiguous as to the humanness of the systems, and that users tend to personify and gender them as a result.

pdf bib
ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI
Amanda Cercas Curry | Gavin Abercrombie | Verena Rieser
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present the first English corpus study on abusive language towards three conversational AI systems gathered ‘in the wild’: an open-domain social bot, a rule-based chatbot, and a task-based system. To account for the complexity of the task, we take a more ‘nuanced’ approach where our ConvAI dataset reflects fine-grained notions of abuse, as well as views from multiple expert annotators. We find that the distribution of abuse is vastly different compared to other commonly used datasets, with more sexually tinted aggression towards the virtual persona of these systems. Finally, we report results from bench-marking existing models against this data. Unsurprisingly, we find that there is substantial room for improvement with F1 scores below 90%.

2020

pdf bib
ParlVote: A Corpus for Sentiment Analysis of Political Debates
Gavin Abercrombie | Riza Batista-Navarro
Proceedings of the Twelfth Language Resources and Evaluation Conference

Debate transcripts from the UK Parliament contain information about the positions taken by politicians towards important topics, but are difficult for people to process manually. While sentiment analysis of debate speeches could facilitate understanding of the speakers’ stated opinions, datasets currently available for this task are small when compared to the benchmark corpora in other domains. We present ParlVote, a new, larger corpus of parliamentary debate speeches for use in the evaluation of sentiment analysis systems for the political domain. We also perform a number of initial experiments on this dataset, testing a variety of approaches to the classification of sentiment polarity in debate speeches. These include a linear classifier as well as a neural network trained using a transformer word embedding model (BERT), and fine-tuned on the parliamentary speeches. We find that in many scenarios, a linear classifier trained on a bag-of-words text representation achieves the best results. However, with the largest dataset, the transformer-based model combined with a neural classifier provides the best performance. We suggest that further experimentation with classification models and observations of the debate content and structure are required, and that there remains much room for improvement in parliamentary sentiment analysis.

2019

pdf bib
Policy Preference Detection in Parliamentary Debate Motions
Gavin Abercrombie | Federico Nanni | Riza Batista-Navarro | Simone Paolo Ponzetto
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Debate motions (proposals) tabled in the UK Parliament contain information about the stated policy preferences of the Members of Parliament who propose them, and are key to the analysis of all subsequent speeches given in response to them. We attempt to automatically label debate motions with codes from a pre-existing coding scheme developed by political scientists for the annotation and analysis of political parties’ manifestos. We develop annotation guidelines for the task of applying these codes to debate motions at two levels of granularity and produce a dataset of manually labelled examples. We evaluate the annotation process and the reliability and utility of the labelling scheme, finding that inter-annotator agreement is comparable with that of other studies conducted on manifesto data. Moreover, we test a variety of ways of automatically labelling motions with the codes, ranging from similarity matching to neural classification methods, and evaluate them against the gold standard labels. From these experiments, we note that established supervised baselines are not always able to improve over simple lexical heuristics. At the same time, we detect a clear and evident benefit when employing BERT, a state-of-the-art deep language representation model, even in classification scenarios with over 30 different labels and limited amounts of training data.

pdf bib
Semantic Change in the Language of UK Parliamentary Debates
Gavin Abercrombie | Riza Batista-Navarro
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We investigate changes in the meanings of words used in the UK Parliament across two different epochs. We use word embeddings to explore changes in the distribution of words of interest and uncover words that appear to have undergone semantic transformation in the intervening period, and explore different ways of obtaining target words for this purpose. We find that semantic changes are generally in line with those found in other corpora, and little evidence that parliamentary language is more static than general English. It also seems that words with senses that have been recorded in the dictionary as having fallen into disuse do not undergo semantic changes in this domain.

2018

pdf bib
Identifying Opinion-Topics and Polarity of Parliamentary Debate Motions
Gavin Abercrombie | Riza Theresa Batista-Navarro
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Analysis of the topics mentioned and opinions expressed in parliamentary debate motions–or proposals–is difficult for human readers, but necessary for understanding and automatic processing of the content of the subsequent speeches. We present a dataset of debate motions with pre-existing ‘policy’ labels, and investigate the utility of these labels for simultaneous topic and opinion polarity analysis. For topic detection, we apply one-versus-the-rest supervised topic classification, finding that good performance is achieved in predicting the policy topics, and that textual features derived from the debate titles associated with the motions are particularly indicative of motion topic. We then examine whether the output could also be used to determine the positions taken by proposers towards the different policies by investigating how well humans agree in interpreting the opinion polarities of the motions. Finding very high levels of agreement, we conclude that the policies used can be reliable labels for use in these tasks, and that successful topic detection can therefore provide opinion analysis of the motions ‘for free’.

pdf bib
‘Aye’ or ‘No’? Speech-level Sentiment Analysis of Hansard UK Parliamentary Debate Transcripts
Gavin Abercrombie | Riza Batista-Navarro
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
A Rule-based Shallow-transfer Machine Translation System for Scots and English
Gavin Abercrombie
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

An open-source rule-based machine translation system is developed for Scots, a low-resourced minor language closely related to English and spoken in Scotland and Ireland. By concentrating on translation for assimilation (gist comprehension) from Scots to English, it is proposed that the development of dictionaries designed to be used with in the Apertium platform will be sufficient to produce translations that improve non-Scots speakers understanding of the language. Mono- and bilingual Scots dictionaries are constructed using lexical items gathered from a variety of resources across several domains. Although the primary goal of this project is translation for gisting, the system is evaluated for both assimilation and dissemination (publication-ready translations). A variety of evaluation methods are used, including a cloze test undertaken by human volunteers. While evaluation results are comparable to, and in some cases superior to, those of other language pairs within the Apertium platform, room for improvement is identified in several areas of the system.

pdf bib
Putting Sarcasm Detection into Context: The Effects of Class Imbalance and Manual Labelling on Supervised Machine Classification of Twitter Conversations
Gavin Abercrombie | Dirk Hovy
Proceedings of the ACL 2016 Student Research Workshop