Cost-efficient Crowdsourcing for Span-based Sequence Labeling:Worker Selection and Data Augmentation

Wang Yujie (誉杰 王); Huang Chao; Yang Liner (麟儿 杨); Fang Zhixuan; Huang Yaping; Liu Yang (刘扬); Yu Jingsi (余婧思); Yang Erhong (尔弘 杨)

Cost-efficient Crowdsourcing for Span-based Sequence Labeling:Worker Selection and Data Augmentation

Wang Yujie, Huang Chao, Yang Liner, Fang Zhixuan, Huang Yaping, Liu Yang, Yu Jingsi, Yang Erhong

Abstract

“This paper introduces a novel crowdsourcing worker selection algorithm, enhancing annotationquality and reducing costs. Unlike previous studies targeting simpler tasks, this study con-tends with the complexities of label interdependencies in sequence labeling. The proposedalgorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selec-tion, and a cost-effective human feedback mechanism. The challenge of dealing with imbal-anced and small-scale datasets, which hinders offline simulation of worker selection, is tack-led using an innovative data augmentation method termed shifting, expanding, and shrink-ing (SES). Rigorous testing on CoNLL 2003 NER and Chinese OEI datasets showcased thealgorithm’s efficiency, with an increase in F1 score up to 100.04% of the expert-only base-line, alongside cost savings up to 65.97%. The paper also encompasses a dataset-independenttest emulating annotation evaluation through a Bernoulli distribution, which still led to animpressive 97.56% F1 score of the expert baseline and 59.88% cost savings. Furthermore,our approach can be seamlessly integrated into Reinforcement Learning from Human Feed-back (RLHF) systems, offering a cost-effective solution for obtaining human feedback. All re-sources, including source code and datasets, are available to the broader research community athttps://github.com/blcuicall/nlp-crowdsourcing.”

Anthology ID:: 2024.ccl-1.96
Volume:: Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Month:: July
Year:: 2024
Address:: Taiyuan, China
Editors:: Maosong Sun, Jiye Liang, Xianpei Han, Zhiyuan Liu, Yulan He
Venue:: CCL
SIG:
Publisher:: Chinese Information Processing Society of China
Note:
Pages:: 1239–1256
Language:: English
URL:: https://aclanthology.org/2024.ccl-1.96/
DOI:
Bibkey:
Cite (ACL):: Wang Yujie, Huang Chao, Yang Liner, Fang Zhixuan, Huang Yaping, Liu Yang, Yu Jingsi, and Yang Erhong. 2024. Cost-efficient Crowdsourcing for Span-based Sequence Labeling:Worker Selection and Data Augmentation. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1239–1256, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):: Cost-efficient Crowdsourcing for Span-based Sequence Labeling:Worker Selection and Data Augmentation (Yujie et al., CCL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.ccl-1.96.pdf

PDF Cite Search Fix data