A Dutch Dataset for Cross-lingual Multilabel Toxicity Detection

Ben Burtenshaw, Mike Kestemont


Abstract
Multi-label toxicity detection is highly prominent, with many research groups, companies, and individuals engaging with it through shared tasks and dedicated venues. This paper describes a cross-lingual approach to annotating multi-label text classification on a newly developed Dutch language dataset, using a model trained on English data. We present an ensemble model of one Transformer model and an LSTM using Multilingual embeddings. The combination of multilingual embeddings and the Transformer model improves performance in a cross-lingual setting.
Anthology ID:
2021.bucc-1.10
Volume:
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Month:
September
Year:
2021
Address:
Online (Virtual Mode)
Editors:
Reinhard Rapp, Serge Sharoff, Pierre Zweigenbaum
Venue:
BUCC
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
75–79
Language:
URL:
https://aclanthology.org/2021.bucc-1.10
DOI:
Bibkey:
Cite (ACL):
Ben Burtenshaw and Mike Kestemont. 2021. A Dutch Dataset for Cross-lingual Multilabel Toxicity Detection. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 75–79, Online (Virtual Mode). INCOMA Ltd..
Cite (Informal):
A Dutch Dataset for Cross-lingual Multilabel Toxicity Detection (Burtenshaw & Kestemont, BUCC 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.bucc-1.10.pdf