Data Augmentation for Multiclass Utterance Classification – A Systematic Study

Binxia Xu, Siyuan Qiu, Jie Zhang, Yafang Wang, Xiaoyu Shen, Gerard de Melo


Abstract
Utterance classification is a key component in many conversational systems. However, classifying real-world user utterances is challenging, as people may express their ideas and thoughts in manifold ways, and the amount of training data for some categories may be fairly limited, resulting in imbalanced data distributions. To alleviate these issues, we conduct a comprehensive survey regarding data augmentation approaches for text classification, including simple random resampling, word-level transformations, and neural text generation to cope with imbalanced data. Our experiments focus on multi-class datasets with a large number of data samples, which has not been systematically studied in previous work. The results show that the effectiveness of different data augmentation schemes depends on the nature of the dataset under consideration.
Anthology ID:
2020.coling-main.479
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5494–5506
Language:
URL:
https://aclanthology.org/2020.coling-main.479
DOI:
10.18653/v1/2020.coling-main.479
Bibkey:
Cite (ACL):
Binxia Xu, Siyuan Qiu, Jie Zhang, Yafang Wang, Xiaoyu Shen, and Gerard de Melo. 2020. Data Augmentation for Multiclass Utterance Classification – A Systematic Study. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5494–5506, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Data Augmentation for Multiclass Utterance Classification – A Systematic Study (Xu et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.479.pdf
Data
CoQA