Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs

Sima Sharifirad, Borna Jafarpour, Stan Matwin


Abstract
Text classification models have been heavily utilized for a slew of interesting natural language processing problems. Like any other machine learning model, these classifiers are very dependent on the size and quality of the training dataset. Insufficient and imbalanced datasets will lead to poor performance. An interesting solution to poor datasets is to take advantage of the world knowledge in the form of knowledge graphs to improve our training data. In this paper, we use ConceptNet and Wikidata to improve sexist tweet classification by two methods (1) text augmentation and (2) text generation. In our text generation approach, we generate new tweets by replacing words using data acquired from ConceptNet relations in order to increase the size of our training set, this method is very helpful with frustratingly small datasets, preserves the label and increases diversity. In our text augmentation approach, the number of tweets remains the same but their words are augmented (concatenation) with words extracted from their ConceptNet relations and their description extracted from Wikidata. In our text augmentation approach, the number of tweets in each class remains the same but the range of each tweet increases. Our experiments show that our approach improves sexist tweet classification significantly in our entire machine learning models. Our approach can be readily applied to any other small dataset size like hate speech or abusive language and text classification problem using any machine learning model.
Anthology ID:
W18-5114
Volume:
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venues:
ALW | EMNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
107–114
Language:
URL:
https://aclanthology.org/W18-5114
DOI:
10.18653/v1/W18-5114
Bibkey:
Cite (ACL):
Sima Sharifirad, Borna Jafarpour, and Stan Matwin. 2018. Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 107–114, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs (Sharifirad et al., 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-5114.pdf
Data
ConceptNet