2022
pdf
bib
abs
Uncertainty and Inclusivity in Gender Bias Annotation: An Annotation Taxonomy and Annotated Datasets of British English Text
Lucy Havens
|
Melissa Terras
|
Benjamin Bach
|
Beatrice Alex
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Mitigating harms from gender biased language in Natural Language Processing (NLP) systems remains a challenge, and the situated nature of language means bias is inescapable in NLP data. Though efforts to mitigate gender bias in NLP are numerous, they often vaguely define gender and bias, only consider two genders, and do not incorporate uncertainty into models. To address these limitations, in this paper we present a taxonomy of gender biased language and apply it to create annotated datasets. We created the taxonomy and annotated data with the aim of making gender bias in language transparent. If biases are communicated clearly, varieties of biased language can be better identified and measured. Our taxonomy contains eleven types of gender biases inclusive of people whose gender expressions do not fit into the binary conceptions of woman and man, and whose gender differs from that they were assigned at birth, while also allowing annotators to document unknown gender information. The taxonomy and annotated data will, in future work, underpin analysis and more equitable language model development.
pdf
bib
abs
Beyond Explanation: A Case for Exploratory Text Visualizations of Non-Aggregated, Annotated Datasets
Lucy Havens
|
Benjamin Bach
|
Melissa Terras
|
Beatrice Alex
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022
This paper presents an overview of text visualization techniques relevant for data perspectivism, aiming to facilitate analysis of annotated datasets for the datasets’ creators and stakeholders. Data perspectivism advocates for publishing non-aggregated, annotated text data, recognizing that for highly subjective tasks, such as bias detection and hate speech detection, disagreements among annotators may indicate conflicting yet equally valid interpretations of a text. While the publication of non-aggregated, annotated data makes different interpretations of text corpora available, barriers still exist to investigating patterns and outliers in annotations of the text. Techniques from text visualization can overcome these barriers, facilitating intuitive data analysis for NLP researchers and practitioners, as well as stakeholders in NLP systems, who may not have data science or computing skills. In this paper we discuss challenges with current dataset creation practices and annotation platforms, followed by a discussion of text visualization techniques that enable open-ended, multi-faceted, and iterative analysis of annotated data.
2020
pdf
bib
abs
Situated Data, Situated Systems: A Methodology to Engage with Power Relations in Natural Language Processing Research
Lucy Havens
|
Melissa Terras
|
Benjamin Bach
|
Beatrice Alex
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing
We propose a bias-aware methodology to engage with power relations in natural language processing (NLP) research. NLP research rarely engages with bias in social contexts, limiting its ability to mitigate bias. While researchers have recommended actions, technical methods, and documentation practices, no methodology exists to integrate critical reflections on bias with technical NLP methods. In this paper, after an extensive and interdisciplinary literature review, we contribute a bias-aware methodology for NLP research. We also contribute a definition of biased text, a discussion of the implications of biased NLP systems, and a case study demonstrating how we are executing the bias-aware methodology in research on archival metadata descriptions.
pdf
bib
abs
Geoparsing the historical Gazetteers of Scotland: accurately computing location in mass digitised texts
Rosa Filgueira
|
Claire Grover
|
Melissa Terras
|
Beatrice Alex
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
This paper describes work in progress on devising automatic and parallel methods for geoparsing large digital historical textual data by combining the strengths of three natural language processing (NLP) tools, the Edinburgh Geoparser, spaCy and defoe, and employing different tokenisation and named entity recognition (NER) techniques. We apply these tools to a large collection of nineteenth century Scottish geographical dictionaries, and describe preliminary results obtained when processing this data.