Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties

Tessa Masis, Anissa Neal, Lisa Green, Brendan O’Connor


Abstract
The study of language variation examines how language varies between and within different groups of speakers, shedding light on how we use language to construct identities and how social contexts affect language use. A common method is to identify instances of a certain linguistic feature - say, the zero copula construction - in a corpus, and analyze the feature’s distribution across speakers, topics, and other variables, to either gain a qualitative understanding of the feature’s function or systematically measure variation. In this paper, we explore the challenging task of automatic morphosyntactic feature detection in low-resource English varieties. We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.
Anthology ID:
2022.fieldmatters-1.2
Volume:
Proceedings of the first workshop on NLP applications to field linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Elena Klyachko, Ekaterina Neminova, Ekaterina Vylomova, Tatiana Shavrina, Eric Le Ferrand, Valentin Malykh, Francis Tyers, Timofey Arkhangelskiy, Vladislav Mikhailov, Alena Fenogenova
Venue:
FieldMatters
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
11–25
Language:
URL:
https://aclanthology.org/2022.fieldmatters-1.2
DOI:
Bibkey:
Cite (ACL):
Tessa Masis, Anissa Neal, Lisa Green, and Brendan O’Connor. 2022. Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties. In Proceedings of the first workshop on NLP applications to field linguistics, pages 11–25, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):
Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties (Masis et al., FieldMatters 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.fieldmatters-1.2.pdf
Code
 slanglab/cgedit