Eve Fleisig


2023

pdf bib
Incorporating Worker Perspectives into MTurk Annotation Practices for NLP
Olivia Huang | Eve Fleisig | Dan Klein
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Current practices regarding data collection for natural language processing on Amazon Mechanical Turk (MTurk) often rely on a combination of studies on data quality and heuristics shared among NLP researchers. However, without considering the perspectives of MTurk workers, these approaches are susceptible to issues regarding workers’ rights and poor response quality. We conducted a critical literature review and a survey of MTurk workers aimed at addressing open questions regarding best practices for fair payment, worker privacy, data quality, and considering worker incentives. We found that worker preferences are often at odds with received wisdom among NLP researchers. Surveyed workers preferred reliable, reasonable payments over uncertain, very high payments; reported frequently lying on demographic questions; and expressed frustration at having work rejected with no explanation. We also found that workers view some quality control methods, such as requiring minimum response times or Master’s qualifications, as biased and largely ineffective. Based on the survey results, we provide recommendations on how future NLP studies may better account for MTurk workers’ experiences in order to respect workers’ rights and improve data quality.

pdf bib
When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks
Eve Fleisig | Rediet Abebe | Dan Klein
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Though majority vote among annotators is typically used for ground truth labels in machine learning, annotator disagreement in tasks such as hate speech detection may reflect systematic differences in opinion across groups, not noise. Thus, a crucial problem in hate speech detection is determining if a statement is offensive to the demographic group that it targets, when that group may be a small fraction of the annotator pool. We construct a model that predicts individual annotator ratings on potentially offensive text and combines this information with the predicted target group of the text to predict the ratings of target group members. We show gains across a range of metrics, including raising performance over the baseline by 22% at predicting individual annotators’ ratings and by 33% at predicting variance among annotators, which provides a metric for model uncertainty downstream. We find that annotators’ ratings can be predicted using their demographic information as well as opinions on online content, and that non-invasive questions on annotators’ online experiences minimize the need to collect demographic information when predicting annotators’ opinions.

pdf bib
Centering the Margins: Outlier-Based Identification of Harmed Populations in Toxicity Detection
Vyoma Raman | Eve Fleisig | Dan Klein
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The impact of AI models on marginalized communities has traditionally been measured by identifying performance differences between specified demographic subgroups. Though this approach aims to center vulnerable groups, it risks obscuring patterns of harm faced by intersectional subgroups or shared across multiple groups. To address this, we draw on theories of marginalization from disability studies and related disciplines, which state that people farther from the norm face greater adversity, to consider the “margins” in the domain of toxicity detection. We operationalize the “margins” of a dataset by employing outlier detection to identify text about people with demographic attributes distant from the “norm”. We find that model performance is consistently worse for demographic outliers, with mean squared error (MSE) between outliers and non-outliers up to 70.4% worse across toxicity types. It is also worse for text outliers, with a MSE up to 68.4% higher for outliers than non-outliers. We also find text and demographic outliers to be particularly susceptible to errors in the classification of severe toxicity and identity attacks. Compared to analysis of disparities using traditional demographic breakdowns, we find that our outlier analysis frequently surfaces greater harms faced by a larger, more intersectional group, which suggests that outlier analysis is particularly beneficial for identifying harms against those groups.

pdf bib
FairPrism: Evaluating Fairness-Related Harms in Text Generation
Eve Fleisig | Aubrie Amstutz | Chad Atalla | Su Lin Blodgett | Hal Daumé III | Alexandra Olteanu | Emily Sheng | Dan Vann | Hanna Wallach
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

It is critical to measure and mitigate fairness-related harms caused by AI text generation systems, including stereotyping and demeaning harms. To that end, we introduce FairPrism, a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods, both of which we illustrate through case studies. Finally, the process we followed to develop FairPrism offers a recipe for building improved datasets for measuring and mitigating harms caused by AI systems.

2020

pdf bib
Bilingual Lexical Access and Cognate Idiom Comprehension
Eve Fleisig
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

Language transfer can facilitate learning L2 words whose form and meaning are similar to L1 words, or hinder speakers when the languages differ. L2 idioms introduce another layer of challenge, as language transfer could occur on the literal or figurative level of meaning. Thus, the mechanics of language transfer for idiom processing shed light on how literal and figurative meaning is stored in the bilingual lexicon. Three factors appear to influence how language transfer affects idiom comprehension: bilingual fluency, processing of literal-figurative vs. figurative cognate idioms (idioms with the same wording and meaning in both languages, or the same meaning only), and comprehension of literal vs. figurative meaning of a given idiom. To examine the relationship between these factors, this study investigated English-Spanish bilinguals’ reaction time on a lexical decision task examining literal-figurative and figurative cognate idioms. The results suggest that fluency increases processing speed rather than slow it down due to language transfer, and that language transfer from L1 to L2 occurs on the level of figurative meaning in L1-dominant bilinguals.