Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection

Alane Suhr, Clara Vania, Nikita Nangia, Maarten Sap, Mark Yatskar, Samuel R. Bowman, Yoav Artzi


Abstract
Crowdsourcing from non-experts is one of the most common approaches to collecting data and annotations in NLP. Even though it is such a fundamental tool in NLP, crowdsourcing use is largely guided by common practices and the personal experience of researchers. Developing a theory of crowdsourcing use for practical language problems remains an open challenge. However, there are various principles and practices that have proven effective in generating high quality and diverse data. This tutorial exposes NLP researchers to such data collection crowdsourcing methods and principles through a detailed discussion of a diverse set of case studies. The selection of case studies focuses on challenging settings where crowdworkers are asked to write original text or otherwise perform relatively unconstrained work. Through these case studies, we discuss in detail processes that were carefully designed to achieve data with specific properties, for example to require logical inference, grounded reasoning or conversational understanding. Each case study focuses on data collection crowdsourcing protocol details that often receive limited attention in research presentations, for example in conferences, but are critical for research success.
Anthology ID:
2021.emnlp-tutorials.1
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic & Online
Editors:
Jing Jiang, Ivan Vulić
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–6
Language:
URL:
https://aclanthology.org/2021.emnlp-tutorials.1
DOI:
10.18653/v1/2021.emnlp-tutorials.1
Bibkey:
Cite (ACL):
Alane Suhr, Clara Vania, Nikita Nangia, Maarten Sap, Mark Yatskar, Samuel R. Bowman, and Yoav Artzi. 2021. Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 1–6, Punta Cana, Dominican Republic & Online. Association for Computational Linguistics.
Cite (Informal):
Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection (Suhr et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-tutorials.1.pdf
Data
MultiNLINLVRQuAC