LZ1904 at SemEval-2021 Task 5: Bi-LSTM-CRF for Toxic Span Detection using Pretrained Word Embedding

Liang Zou, Wen Li


Abstract
Recurrent Neural Networks (RNN) have been widely used in various Natural Language Processing (NLP) tasks such as text classification, sequence tagging, and machine translation. Long Short Term Memory (LSTM), a special unit of RNN, has the benefit of memorizing past and even future information in a sentence (especially for bidirectional LSTM). In the shared task of detecting spans which make texts toxic, we first apply pretrained word embedding (GloVe) to generate the word vectors after tokenization. And then we construct Bidirectional Long Short Term Memory-Conditional Random Field (Bi-LSTM-CRF) model by Baidu research to predict whether each word in the sentence is toxic or not. We tune hyperparameters of dropout rate, number of LSTM units, embedding size with 10 epochs and choose the best epoch with validation recall. Our model achieves an F1 score of 66.99 percent in test dataset.
Anthology ID:
2021.semeval-1.138
Volume:
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP | SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
1009–1014
Language:
URL:
https://aclanthology.org/2021.semeval-1.138
DOI:
10.18653/v1/2021.semeval-1.138
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.semeval-1.138.pdf