PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation

Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, Issam Laradji


Abstract
Data augmentation is a widely used technique to address the problem of text classification when there is a limited amount of training data. Recent work often tackles this problem using large language models (LLMs) like GPT3 that can generate new examples given already available ones. In this work, we propose a method to generate more helpful augmented data by utilizing the LLM’s abilities to follow instructions and perform few-shot classifications. Our specific PromptMix method consists of two steps: 1) generate challenging text augmentations near class boundaries; however, generating borderline examples increases the risk of false positives in the dataset, so we 2) relabel the text augmentations using a prompting-based LLM classifier to enhance the correctness of labels in the generated data. We evaluate the proposed method in challenging 2-shot and zero-shot settings on four text classification datasets: Banking77, TREC6, Subjectivity (SUBJ), and Twitter Complaints. Our experiments show that generating and, crucially, relabeling borderline examples facilitates the transfer of knowledge of a massive LLM like GPT3.5-turbo into smaller and cheaper classifiers like DistilBERT-base and BERT-base. Furthermore, 2-shot PromptMix outperforms multiple 5-shot data augmentation methods on the four datasets. Our code is available at https://github.com/ServiceNow/PromptMix-EMNLP-2023.
Anthology ID:
2023.emnlp-main.323
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5316–5327
Language:
URL:
https://aclanthology.org/2023.emnlp-main.323
DOI:
10.18653/v1/2023.emnlp-main.323
Bibkey:
Cite (ACL):
Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, and Issam Laradji. 2023. PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5316–5327, Singapore. Association for Computational Linguistics.
Cite (Informal):
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation (Sahu et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.323.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.323.mp4