Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel Perry, Prashant Shiralkar, Zhengzheng Xing, Aaron Colak


Abstract
Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or ‘silver labels’. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.
Anthology ID:
2023.gem-1.5
Volume:
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, Hooman Sedghamiz
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
53–61
Language:
URL:
https://aclanthology.org/2023.gem-1.5
DOI:
Bibkey:
Cite (ACL):
Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel Perry, Prashant Shiralkar, Zhengzheng Xing, and Aaron Colak. 2023. Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 53–61, Singapore. Association for Computational Linguistics.
Cite (Informal):
Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP (Du et al., GEM-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.gem-1.5.pdf