TrelBERT: A pre-trained encoder for Polish Twitter

Wojciech Szmyd, Alicja Kotyla, Michał Zobniów, Piotr Falkiewicz, Jakub Bartczuk, Artur Zygadło


Abstract
Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT – the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.
Anthology ID:
2023.bsnlp-1.3
Volume:
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Jakub Piskorski, Michał Marcińczuk, Preslav Nakov, Maciej Ogrodniczuk, Senja Pollak, Pavel Přibáň, Piotr Rybak, Josef Steinberger, Roman Yangarber
Venue:
BSNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17–24
Language:
URL:
https://aclanthology.org/2023.bsnlp-1.3
DOI:
10.18653/v1/2023.bsnlp-1.3
Bibkey:
Cite (ACL):
Wojciech Szmyd, Alicja Kotyla, Michał Zobniów, Piotr Falkiewicz, Jakub Bartczuk, and Artur Zygadło. 2023. TrelBERT: A pre-trained encoder for Polish Twitter. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), pages 17–24, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
TrelBERT: A pre-trained encoder for Polish Twitter (Szmyd et al., BSNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.bsnlp-1.3.pdf
Video:
 https://aclanthology.org/2023.bsnlp-1.3.mp4