Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang


Abstract
For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, text grafting, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.
Anthology ID:
2024.emnlp-main.219
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3741–3752
Language:
URL:
https://aclanthology.org/2024.emnlp-main.219
DOI:
Bibkey:
Cite (ACL):
Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, and Jingbo Shang. 2024. Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3741–3752, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification (Peng et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.219.pdf