Zeerak Talat


2024

pdf bib
Understanding “Democratization” in NLP and ML Research
Arjun Subramonian | Vagrant Gautam | Dietrich Klakow | Zeerak Talat
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent improvements in natural language processing (NLP) and machine learning (ML) and increased mainstream adoption have led to researchers frequently discussing the “democratization” of artificial intelligence. In this paper, we seek to clarify how democratization is understood in NLP and ML publications, through large-scale mixed-methods analyses of papers using the keyword “democra*” published in NLP and adjacent venues. We find that democratization is most frequently used to convey (ease of) access to or use of technologies, without meaningfully engaging with theories of democratization, while research using other invocations of “democra*” tends to be grounded in theories of deliberation and debate. Based on our findings, we call for researchers to enrich their use of the term democratization with appropriate theory, towards democratic technologies beyond superficial access.

pdf bib
Metrics for What, Metrics for Whom: Assessing Actionability of Bias Evaluation Metrics in NLP
Pieter Delobelle | Giuseppe Attanasio | Debora Nozza | Su Lin Blodgett | Zeerak Talat
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper introduces the concept of actionability in the context of bias measures in natural language processing (NLP). We define actionability as the degree to which a measure’s results enable informed action and propose a set of desiderata for assessing it. Building on existing frameworks such as measurement modeling, we argue that actionability is a crucial aspect of bias measures that has been largely overlooked in the literature.We conduct a comprehensive review of 146 papers proposing bias measures in NLP, examining whether and how they provide the information required for actionable results. Our findings reveal that many key elements of actionability, including a measure’s intended use and reliability assessment, are often unclear or entirely absent.This study highlights a significant gap in the current approach to developing and reporting bias measures in NLP. We argue that this lack of clarity may impede the effective implementation and utilization of these measures. To address this issue, we offer recommendations for more comprehensive and actionable metric development and reporting practices in NLP bias research.

pdf bib
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
Yi-Ling Chung | Zeerak Talat | Debora Nozza | Flor Miriam Plaza-del-Arco | Paul Röttger | Aida Mostafazadeh Davani | Agostina Calabrese
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

pdf bib
Subjective Isms? On the Danger of Conflating Hate and Offence in Abusive Language Detection
Amanda Cercas Curry | Gavin Abercrombie | Zeerak Talat
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Natural language processing research has begun to embrace the notion of annotator subjectivity, motivated by variations in labelling. This approach understands each annotator’s view as valid, which can be highly suitable for tasks that embed subjectivity, e.g., sentiment analysis. However, this construction may be inappropriate for tasks such as hate speech detection, as it affords equal validity to all positions on e.g., sexism or racism. We argue that the conflation of hate and offence can invalidate findings on hate speech, and call for future work to be situated in theory, disentangling hate from its orthogonal concept, offence.

pdf bib
The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels
Eve Fleisig | Su Lin Blodgett | Dan Klein | Zeerak Talat
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement–some challenged by perspectivist approaches, and some that remain to be addressed–as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement.

pdf bib
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon
Fajri Koto | Tilman Beck | Zeerak Talat | Iryna Gurevych | Timothy Baldwin
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Improving multilingual language models capabilities in low-resource languages is generally difficult due to the scarcity of large-scale data in those languages. In this paper, we relax the reliance on texts in low-resource languages by using multilingual lexicons in pretraining to enhance multilingual capabilities. Specifically, we focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets, and large language models like GPT–3.5, BLOOMZ, and XGLM. These findings are observable for unseen low-resource languages to code-mixed scenarios involving high-resource languages.

pdf bib
Classist Tools: Social Class Correlates with Performance in NLP
Amanda Cercas Curry | Giuseppe Attanasio | Zeerak Talat | Dirk Hovy
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The field of sociolinguistics has studied factors affecting language use for the last century. Labov (1964) and Bernstein (1960) showed that socioeconomic class strongly influences our accents, syntax and lexicon. However, despite growing concerns surrounding fairness and bias in Natural Language Processing (NLP), there is a dearth of studies delving into the effects it may have on NLP systems. We show empirically that NLP systems’ performance is affected by speakers’ SES, potentially disadvantaging less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.

pdf bib
Impoverished Language Technology: The Lack of (Social) Class in NLP
Amanda Cercas Curry | Zeerak Talat | Dirk Hovy
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Since Labov’s foundational 1964 work on the social stratification of language, linguistics has dedicated concerted efforts towards understanding the relationships between socio-demographic factors and language production and perception. Despite the large body of evidence identifying significant relationships between socio-demographic factors and language production, relatively few of these factors have been investigated in the context of NLP technology. While age and gender are well covered, Labov’s initial target, socio-economic class, is largely absent. We survey the existing Natural Language Processing (NLP) literature and find that only 20 papers even mention socio-economic status. However, the majority of those papers do not engage with class beyond collecting information of annotator-demographics. Given this research lacuna, we provide a definition of class that can be operationalised by NLP researchers, and argue for including socio-economic class in future language technologies.

2023

pdf bib
Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing
Lucie-Aimée Kaffee | Arnav Arora | Zeerak Talat | Isabelle Augenstein
Findings of the Association for Computational Linguistics: EMNLP 2023

Dual use, the intentional, harmful reuse of technology and scientific artefacts, is an ill-defined problem within the context of Natural Language Processing (NLP). As large language models (LLMs) have advanced in their capabilities and become more accessible, the risk of their intentional misuse becomes more prevalent. To prevent such intentional malicious use, it is necessary for NLP researchers and practitioners to understand and mitigate the risks of their research. Hence, we present an NLP-specific definition of dual use informed by researchers and practitioners in the field. Further, we propose a checklist focusing on dual-use in NLP, that can be integrated into existing conference ethics-frameworks. The definition and checklist are created based on a survey of NLP researchers and practitioners.

pdf bib
Mirages. On Anthropomorphism in Dialogue Systems
Gavin Abercrombie | Amanda Cercas Curry | Tanvi Dinkar | Verena Rieser | Zeerak Talat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automated dialogue or conversational systems are anthropomorphised by developers and personified by users. While a degree of anthropomorphism is inevitable, conscious and unconscious design choices can guide users to personify them to varying degrees. Encouraging users to relate to automated systems as if they were human can lead to transparency and trust issues, and high risk scenarios caused by over-reliance on their outputs. As a result, natural language processing researchers have investigated the factors that induce personification and develop resources to mitigate such effects. However, these efforts are fragmented, and many aspects of anthropomorphism have yet to be explored. In this paper, we discuss the linguistic factors that contribute to the anthropomorphism of dialogue systems and the harms that can arise thereof, including reinforcing gender stereotypes and conceptions of acceptable language. We recommend that future efforts towards developing dialogue systems take particular care in their design, development, release, and description; and attend to the many linguistic cues that can elicit personification by users.

pdf bib
A Federated Approach for Hate Speech Detection
Jay Gala | Deep Gandhi | Jash Mehta | Zeerak Talat
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Hate speech detection has been the subject of high research attention, due to the scale of content created on social media. In spite of the attention and the sensitive nature of the task, privacy preservation in hate speech detection has remained under-studied. The majority of research has focused on centralised machine learning infrastructures which risk leaking data. In this paper, we show that using federated machine learning can help address privacy the concerns that are inherent to hate speech detection while obtaining up to 6.81% improvement in terms of F1-score.

2022

pdf bib
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)
Kanika Narang | Aida Mostafazadeh Davani | Lambert Mathias | Bertie Vidgen | Zeerak Talat
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

pdf bib
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models
Paul Röttger | Haitham Seelawi | Debora Nozza | Zeerak Talat | Bertie Vidgen
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Hate speech detection models are typically evaluated on held-out test sets. However, this risks painting an incomplete and potentially misleading picture of model performance because of increasingly well-documented systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, recent research has thus introduced functional tests for hate speech detection models. However, these tests currently only exist for English-language content, which means that they cannot support the development of more effective models in other languages spoken by billions across the world. To help address this issue, we introduce Multilingual HateCheck (MHC), a suite of functional tests for multilingual hate speech detection models. MHC covers 34 functionalities across ten languages, which is more languages than any other hate speech dataset. To illustrate MHC’s utility, we train and test a high-performing multilingual hate speech detection model, and reveal critical model weaknesses for monolingual and cross-lingual applications.

pdf bib
On the Machine Learning of Ethical Judgments from Natural Language
Zeerak Talat | Hagen Blix | Josef Valvoda | Maya Indira Ganesh | Ryan Cotterell | Adina Williams
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Ethics is one of the longest standing intellectual endeavors of humanity. In recent years, the fields of AI and NLP have attempted to address issues of harmful outcomes in machine learning systems that are made to interface with humans. One recent approach in this vein is the construction of NLP morality models that can take in arbitrary text and output a moral judgment about the situation described. In this work, we offer a critique of such NLP methods for automating ethical decision-making. Through an audit of recent work on computational approaches for predicting morality, we examine the broader issues that arise from such efforts. We conclude with a discussion of how machine ethics could usefully proceed in NLP, by focusing on current and near-future uses of technology, in a way that centers around transparency, democratic values, and allows for straightforward accountability.

pdf bib
Directions for NLP Practices Applied to Online Hate Speech Detection
Paula Fortuna | Monica Dominguez | Leo Wanner | Zeerak Talat
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Addressing hate speech in online spaces has been conceptualized as a classification task that uses Natural Language Processing (NLP) techniques. Through this conceptualization, the hate speech detection task has relied on common conventions and practices from NLP. For instance, inter-annotator agreement is conceptualized as a way to measure dataset quality and certain metrics and benchmarks are used to assure model generalization. However, hate speech is a deeply complex and situated concept that eludes such static and disembodied practices. In this position paper, we critically reflect on these methodologies for hate speech detection, we argue that many conventions in NLP are poorly suited for the problem and encourage researchers to develop methods that are more appropriate for the task.

pdf bib
A Federated Approach to Predicting Emojis in Hindi Tweets
Deep Gandhi | Jash Mehta | Nirali Parekh | Karan Waghela | Lynette D’Mello | Zeerak Talat
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The use of emojis affords a visual modality to, often private, textual communication.The task of predicting emojis however provides a challenge for machine learning as emoji use tends to cluster into the frequently used and the rarely used emojis.Much of the machine learning research on emoji use has focused on high resource languages and has conceptualised the task of predicting emojis around traditional server-side machine learning approaches.However, traditional machine learning approaches for private communication can introduce privacy concerns, as these approaches require all data to be transmitted to a central storage.In this paper, we seek to address the dual concerns of emphasising high resource languages for emoji prediction and risking the privacy of people’s data.We introduce a new dataset of 118k tweets (augmented from 25k unique tweets) for emoji prediction in Hindi, and propose a modification to the federated learning algorithm, CausalFedGSD, which aims to strike a balance between model performance and user privacy. We show that our approach obtains comparative scores with more complex centralised models while reducing the amount of data required to optimise the models and minimising risks to user privacy.

pdf bib
You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings
Zeerak Talat | Aurélie Névéol | Stella Biderman | Miruna Clinciu | Manan Dey | Shayne Longpre | Sasha Luccioni | Maraim Masoud | Margaret Mitchell | Dragomir Radev | Shanya Sharma | Arjun Subramonian | Jaesung Tae | Samson Tan | Deepak Tunuguntla | Oskar Van Der Wal
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work. We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages. We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.