TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study

Jakub Piskorski; Guillaume Jacquet

TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study

Abstract

Automating the detection of event mentions in online texts and their classification vis-a-vis domain-specific event type taxonomies has been acknowledged by many organisations worldwide to be of paramount importance in order to facilitate the process of intelligence gathering. This paper reports on some preliminary experiments of comparing various linguistically-lightweight approaches for fine-grained event classification based on short text snippets reporting on events. In particular, we compare the performance of a TF-IDF-weighted character n-gram SVM-based model versus SVMs trained on various of-the-shelf pre-trained word embeddings (GloVe, BERT, FastText) as features. We exploit a relatively large event corpus consisting of circa 610K short text event descriptions classified using a 25-event categories that cover political violence and protest events. The best results, i.e., 83.5% macro and 92.4% micro F1 score, were obtained using the TF-IDF-weighted character n-gram model.

Anthology ID:: 2020.aespen-1.6
Volume:: Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Ali Hürriyetoğlu, Erdem Yörük, Vanni Zavarella, Hristo Tanev
Venue:: AESPEN
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 26–34
Language:: English
URL:: https://aclanthology.org/2020.aespen-1.6/
DOI:
Bibkey:
Cite (ACL):: Jakub Piskorski and Guillaume Jacquet. 2020. TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study. In Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020, pages 26–34, Marseille, France. European Language Resources Association (ELRA).
Cite (Informal):: TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study (Piskorski & Jacquet, AESPEN 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.aespen-1.6.pdf

PDF Cite Search Fix data