Odette Scharenborg


2024

pdf bib
Finding Spoken Identifications: Using GPT-4 Annotation for an Efficient and Fast Dataset Creation Pipeline
Maliha Jahan | Helin Wang | Thomas Thebaud | Yinglun Sun | Giang Ha Le | Zsuzsanna Fagyal | Odette Scharenborg | Mark Hasegawa-Johnson | Laureano Moro Velazquez | Najim Dehak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI’s GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4’s performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4’s tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4’s performance.

2022

pdf bib
Using Mixed Incentives to Document Xi’an Guanzhong
Juhong Zhan | Yue Jiang | Christopher Cieri | Mark Liberman | Jiahong Yuan | Yiya Chen | Odette Scharenborg
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022

This paper describes our use of mixed incentives and the citizen science portal LanguageARC to prepare, collect and quality control a large corpus of object namings for the purpose of providing speech data to document the under-represented Guanzhong dialect of Chinese spoken in the Shaanxi province in the environs of Xi’an.