NLPositionality: Characterizing Design Biases of Datasets and Models

Sebastin Santy, Jenny Liang, Ronan Le Bras, Katharina Reinecke, Maarten Sap


Abstract
Design biases in NLP systems, such as performance differences for different populations, often stem from their creator’s positionality, i.e., views and lived experiences shaped by identity and background. Despite the prevalence and risks of design biases, they are hard to quantify because researcher, system, and dataset positionality is often unobserved. We introduce NLPositionality, a framework for characterizing design biases and quantifying the positionality of NLP datasets and models. Our framework continuously collects annotations from a diverse pool of volunteer participants on LabintheWild, and statistically quantifies alignment with dataset labels and model predictions. We apply NLPositionality to existing datasets and models for two tasks—social acceptability and hate speech detection. To date, we have collected 16,299 annotations in over a year for 600 instances from 1,096 annotators across 87 countries. We find that datasets and models align predominantly with Western, White, college-educated, and younger populations. Additionally, certain groups, such as non-binary people and non-native English speakers, are further marginalized by datasets and models as they rank least in alignment across all tasks. Finally, we draw from prior literature to discuss how researchers can examine their own positionality and that of their datasets and models, opening the door for more inclusive NLP systems.
Anthology ID:
2023.acl-long.505
Original:
2023.acl-long.505v1
Version 2:
2023.acl-long.505v2
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9080–9102
Language:
URL:
https://aclanthology.org/2023.acl-long.505
DOI:
10.18653/v1/2023.acl-long.505
Bibkey:
Cite (ACL):
Sebastin Santy, Jenny Liang, Ronan Le Bras, Katharina Reinecke, and Maarten Sap. 2023. NLPositionality: Characterizing Design Biases of Datasets and Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9080–9102, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
NLPositionality: Characterizing Design Biases of Datasets and Models (Santy et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.505.pdf
Video:
 https://aclanthology.org/2023.acl-long.505.mp4