Wang Yujie
Also published as: 誉杰 王
2024
Cost-efficient Crowdsourcing for Span-based Sequence Labeling:Worker Selection and Data Augmentation
Wang Yujie
|
Huang Chao
|
Yang Liner
|
Fang Zhixuan
|
Huang Yaping
|
Liu Yang
|
Yu Jingsi
|
Yang Erhong
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“This paper introduces a novel crowdsourcing worker selection algorithm, enhancing annotationquality and reducing costs. Unlike previous studies targeting simpler tasks, this study con-tends with the complexities of label interdependencies in sequence labeling. The proposedalgorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selec-tion, and a cost-effective human feedback mechanism. The challenge of dealing with imbal-anced and small-scale datasets, which hinders offline simulation of worker selection, is tack-led using an innovative data augmentation method termed shifting, expanding, and shrink-ing (SES). Rigorous testing on CoNLL 2003 NER and Chinese OEI datasets showcased thealgorithm’s efficiency, with an increase in F1 score up to 100.04% of the expert-only base-line, alongside cost savings up to 65.97%. The paper also encompasses a dataset-independenttest emulating annotation evaluation through a Bernoulli distribution, which still led to animpressive 97.56% F1 score of the expert baseline and 59.88% cost savings. Furthermore,our approach can be seamlessly integrated into Reinforcement Learning from Human Feed-back (RLHF) systems, offering a cost-effective solution for obtaining human feedback. All re-sources, including source code and datasets, are available to the broader research community athttps://github.com/blcuicall/nlp-crowdsourcing.”
2023
人工智能生成语言与人类语言对比研究——以ChatGPT为例(A Comparative Study of Language between Artificial Intelligence and Human: A Case Study of ChatGPT)
Zhu Junhui (君辉 朱)
|
Wang Mengyan (梦焰 王)
|
Yang Erhong (尔弘 杨)
|
Nie Jingran (锦燃 聂)
|
Wang Yujie (誉杰 王)
|
Yue Yan (岩 岳)
|
Yang Liner (麟儿 杨)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics
“基于自然语言生成技术的聊天机器人ChatGPT能够快速生成回答,但目前尚未对机器作答所使用的语言与人类真实语言在哪些方面存在差异进行充分研究。本研究提取并计算159个语言特征在人类和ChatGPT对中文开放域问题作答文本中的分布,使用随机森林、逻辑回归和支持向量机(SVM)三种机器学习算法训练人工智能探测器,并评估模型性能。实验结果表明,随机森林和SVM均能达到较高的分类准确率。通过对比分析,研究揭示了两种文本在描述性特征、字词常用度、字词多样性、句法复杂性、语篇凝聚力五个维度上语言表现的优势和不足。结果显示,两种文本之间的差异主要集中在描述性特征、字词常用度、字词多样性三个维度。”
Search
Fix data
Co-authors
- Yang Erhong (尔弘 杨) 2
- Yang Liner (麟儿 杨) 2
- Huang Chao 1
- Nie Jingran (锦燃 聂) 1
- Yu Jingsi (余婧思) 1
- show all...
Venues
- ccl2