Debin Xiao
2025
A Simple yet Efficient Prompt Compression Method for Text Classification Data Annotation Using LLM
Yiran Xie
|
Debin Xiao
|
Ping Wang
|
Shuming Liu
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Effectively balancing accuracy and cost is a critical challenge when using large language models (LLMs) for corpus annotation. This paper introduces a novel compression method based on keyword extraction (PCKE) that effectively reduces the number of prompt tokens in text classification annotation tasks, with minimal to no loss in accuracy. Our approach begins with an LLM that generates both category labels and relevant keywords from a small unannotated dataset. These outputs are used to train a BERT-based multi-task model capable of simultaneous classification and keyword extraction. For larger unannotated corpora, this model extracts keywords which are then used in place of full texts for LLM annotation. The significant reduction in prompt tokens result in substantial cost savings. Furthermore, the use of a few well-chosen keywords ensures that classification accuracy is maintained. Extensive experiments validate that our method not only achieves a superior compression rate but also maintains high accuracy, outperforming existing general-purpose compression techniques. Our approach offers a practical and cost-efficient solution for large-scale text classification annotation using LLMs, particularly applicable in industrial settings.
2023
Emotion Cause Extraction on Social Media without Human Annotation
Debin Xiao
|
Rui Xia
|
Jianfei Yu
Findings of the Association for Computational Linguistics: ACL 2023
In social media, there is a vast amount of information pertaining to people’s emotions and the corresponding causes. The emotion cause extraction (ECE) from social media data is an important research area that has not been thoroughly explored due to the lack of fine-grained annotations. Early studies referred to either unsupervised rule-based methods or supervised machine learning methods using a number of manually annotated data in specific domains. However, the former suffers from limitations in extraction performance, while the latter is constrained by the availability of fine-grained annotations and struggles to generalize to diverse domains. To address these issues, this paper proposes a new ECE framework on Chinese social media that achieves high extraction performance and generalizability without relying on human annotation. Specifically, we design a more dedicated rule-based system based on constituency parsing tree to discover causal patterns in social media. This system enables us to acquire large amounts of fine-grained annotated data. Next, we train a neural model on the rule-annotated dataset with a specific training strategy to further improve the model’s generalizability. Extensive experiments demonstrate the superiority of our approach over other methods in unsupervised and weakly-supervised settings.