A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation

Jiyi Li, Fumiyo Fukumoto


Abstract
The target outputs of many NLP tasks are word sequences. To collect the data for training and evaluating models, the crowd is a cheaper and easier to access than the oracle. To ensure the quality of the crowdsourced data, people can assign multiple workers to one question and then aggregate the multiple answers with diverse quality into a golden one. How to aggregate multiple crowdsourced word sequences with diverse quality is a curious and challenging problem. People need a dataset for addressing this problem. We thus create a dataset (CrowdWSA2019) which contains the translated sentences generated from multiple workers. We provide three approaches as the baselines on the task of extractive word sequence aggregation. Specially, one of them is an original one we propose which models the reliability of workers. We also discuss some issues on ground truth creation of word sequences which can be addressed based on this dataset.
Anthology ID:
D19-5904
Volume:
Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP
Month:
November
Year:
2019
Address:
Hong Kong
Editors:
Silviu Paun, Dirk Hovy
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24–28
Language:
URL:
https://aclanthology.org/D19-5904
DOI:
10.18653/v1/D19-5904
Bibkey:
Cite (ACL):
Jiyi Li and Fumiyo Fukumoto. 2019. A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation. In Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP, pages 24–28, Hong Kong. Association for Computational Linguistics.
Cite (Informal):
A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation (Li & Fukumoto, 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-5904.pdf