Efficient Data Labeling by Hierarchical Crowdsourcing with Large Language Models

Haodi Zhang (张昊迪); Junyu Yang; Jinyin Nie; Peirou Liang; Kaishun Wu; Defu Lian; Rui Mao; Yuanfeng Song

Efficient Data Labeling by Hierarchical Crowdsourcing with Large Language Models

Haodi Zhang, Junyu Yang, Jinyin Nie, Peirou Liang, Kaishun Wu, Defu Lian, Rui Mao, Yuanfeng Song

Abstract

Large language models (LLMs) have received lots of attention for their impressive performance in in-context dialogues and their potential to revolutionize service industries with a new business model, Model-as-a-Service (MaaS). Automated data labeling is a natural and promising service. However, labeling data with LLMs faces two main challenges: 1) the labels from LLMs may contain uncertainty, and 2) using LLMs for data labeling tasks can be prohibitively expensive, as the scales of datasets are usually tremendous. In this paper, we propose a hierarchical framework named LMCrowd that leverages multiple LLMs for efficient data labeling under budget constraints. The proposed LMCrowd framework first aggregates labels from multiple freely available LLMs, and then employs a large, paid MaaS LLM for relabeling selected instances. Furthermore, we formalize the core process as an optimization problem, aiming to select the optimal set of instances for relabeling by the MaaS LLM, given the current belief state. Extensive experimental evaluations across various real-world datasets demonstrate that our framework outperforms human labelers and GPT-4 in terms of both accuracy and efficiency.

Anthology ID:: 2025.coling-main.748
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11290–11303
Language:
URL:: https://aclanthology.org/2025.coling-main.748/
DOI:
Bibkey:
Cite (ACL):: Haodi Zhang, Junyu Yang, Jinyin Nie, Peirou Liang, Kaishun Wu, Defu Lian, Rui Mao, and Yuanfeng Song. 2025. Efficient Data Labeling by Hierarchical Crowdsourcing with Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 11290–11303, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Efficient Data Labeling by Hierarchical Crowdsourcing with Large Language Models (Zhang et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.748.pdf

PDF Cite Search Fix data