SkoltechNLP at SemEval-2021 Task 5: Leveraging Sentence-level Pre-training for Toxic Span Detection

David Dale, Igor Markov, Varvara Logacheva, Olga Kozlova, Nikita Semenov, Alexander Panchenko


Abstract
This work describes the participation of the Skoltech NLP group team (Sk) in the Toxic Spans Detection task at SemEval-2021. The goal of the task is to identify the most toxic fragments of a given sentence, which is a binary sequence tagging problem. We show that fine-tuning a RoBERTa model for this problem is a strong baseline. This baseline can be further improved by pre-training the RoBERTa model on a large dataset labeled for toxicity at the sentence level. While our solution scored among the top 20% participating models, it is only 2 points below the best result. This suggests the viability of our approach.
Anthology ID:
2021.semeval-1.126
Volume:
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP | SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
927–934
Language:
URL:
https://aclanthology.org/2021.semeval-1.126
DOI:
10.18653/v1/2021.semeval-1.126
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.semeval-1.126.pdf
Optional supplementary material:
 2021.semeval-1.126.OptionalSupplementaryMaterial.zip