An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection

Andres Garcia-Silva, Cristian Berrio, José Manuel Gómez-Pérez


Abstract
Fine-tuning pre-trained language models has significantly advanced the state of art in a wide range of NLP downstream tasks. Usually, such language models are learned from large and well-formed text corpora from e.g. encyclopedic resources, books or news. However, a significant amount of the text to be analyzed nowadays is Web data, often from social media. In this paper we consider the research question: How do standard pre-trained language models generalize and capture the peculiarities of rather short, informal and frequently automatically generated text found in social media? To answer this question, we focus on bot detection in Twitter as our evaluation task and test the performance of fine-tuning approaches based on language models against popular neural architectures such as LSTM and CNN combined with pre-trained and contextualized embeddings. Our results also show strong performance variations among the different language model approaches, which suggest further research.
Anthology ID:
W19-4317
Volume:
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | RepL4NLP | WS
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
148–155
Language:
URL:
https://aclanthology.org/W19-4317
DOI:
10.18653/v1/W19-4317
Bibkey:
Cite (ACL):
Andres Garcia-Silva, Cristian Berrio, and José Manuel Gómez-Pérez. 2019. An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 148–155, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection (Garcia-Silva et al., 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4317.pdf