Wojciech Szmyd
2023
TrelBERT: A pre-trained encoder for Polish Twitter
Wojciech Szmyd
|
Alicja Kotyla
|
Michał Zobniów
|
Piotr Falkiewicz
|
Jakub Bartczuk
|
Artur Zygadło
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT – the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.
Search