Shuming Liu


2025

pdf bib
A Simple yet Efficient Prompt Compression Method for Text Classification Data Annotation Using LLM
Yiran Xie | Debin Xiao | Ping Wang | Shuming Liu
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Effectively balancing accuracy and cost is a critical challenge when using large language models (LLMs) for corpus annotation. This paper introduces a novel compression method based on keyword extraction (PCKE) that effectively reduces the number of prompt tokens in text classification annotation tasks, with minimal to no loss in accuracy. Our approach begins with an LLM that generates both category labels and relevant keywords from a small unannotated dataset. These outputs are used to train a BERT-based multi-task model capable of simultaneous classification and keyword extraction. For larger unannotated corpora, this model extracts keywords which are then used in place of full texts for LLM annotation. The significant reduction in prompt tokens result in substantial cost savings. Furthermore, the use of a few well-chosen keywords ensures that classification accuracy is maintained. Extensive experiments validate that our method not only achieves a superior compression rate but also maintains high accuracy, outperforming existing general-purpose compression techniques. Our approach offers a practical and cost-efficient solution for large-scale text classification annotation using LLMs, particularly applicable in industrial settings.