Mind the data gap(s): Investigating power in speech and language datasets

Nina Markl


Abstract
Algorithmic oppression is an urgent and persistent problem in speech and language technologies. Considering power relations embedded in datasets before compiling or using them to train or test speech and language technologies is essential to designing less harmful, more just technologies. This paper presents a reflective exercise to recognise and challenge gaps and the power relations they reveal in speech and language datasets by applying principles of Data Feminism and Design Justice, and building on work on dataset documentation and sociolinguistics.
Anthology ID:
2022.ltedi-1.1
Volume:
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Bharathi Raja Chakravarthi, B Bharathi, John P McCrae, Manel Zarrouk, Kalika Bali, Paul Buitelaar
Venue:
LTEDI
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://aclanthology.org/2022.ltedi-1.1
DOI:
10.18653/v1/2022.ltedi-1.1
Bibkey:
Cite (ACL):
Nina Markl. 2022. Mind the data gap(s): Investigating power in speech and language datasets. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pages 1–12, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Mind the data gap(s): Investigating power in speech and language datasets (Markl, LTEDI 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.ltedi-1.1.pdf
Video:
 https://aclanthology.org/2022.ltedi-1.1.mp4
Data
Common Voice