Sparse Victory – A Large Scale Systematic Comparison of count-based and prediction-based vectorizers for text classification

Rupak Chakraborty, Ashima Elhence, Kapil Arora


Abstract
In this paper we study the performance of several text vectorization algorithms on a diverse collection of 73 publicly available datasets. Traditional sparse vectorizers like Tf-Idf and Feature Hashing have been systematically compared with the latest state of the art neural word embeddings like Word2Vec, GloVe, FastText and character embeddings like ELMo, Flair. We have carried out an extensive analysis of the performance of these vectorizers across different dimensions like classification metrics (.i.e. precision, recall, accuracy), dataset-size, and imbalanced data (in terms of the distribution of the number of class labels). Our experiments reveal that the sparse vectorizers beat the neural word and character embedding models on 61 of the 73 datasets by an average margin of 3-5% (in terms of macro f1 score) and this performance is consistent across the different dimensions of comparison.
Anthology ID:
R19-1022
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
188–197
Language:
URL:
https://aclanthology.org/R19-1022
DOI:
10.26615/978-954-452-056-4_022
Bibkey:
Cite (ACL):
Rupak Chakraborty, Ashima Elhence, and Kapil Arora. 2019. Sparse Victory – A Large Scale Systematic Comparison of count-based and prediction-based vectorizers for text classification. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 188–197, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Sparse Victory – A Large Scale Systematic Comparison of count-based and prediction-based vectorizers for text classification (Chakraborty et al., RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1022.pdf
Code
 opennlp/Large-Scale-Text-Classification