Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

Xinyan Yu, Trina Chatterjee, Akari Asai, Junjie Hu, Eunsol Choi


Abstract
While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.
Anthology ID:
2022.findings-emnlp.273
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3725–3743
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.273
DOI:
10.18653/v1/2022.findings-emnlp.273
Bibkey:
Cite (ACL):
Xinyan Yu, Trina Chatterjee, Akari Asai, Junjie Hu, and Eunsol Choi. 2022. Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3725–3743, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources (Yu et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.273.pdf
Video:
 https://aclanthology.org/2022.findings-emnlp.273.mp4