S-NLP at SemEval-2021 Task 5: An Analysis of Dual Networks for Sequence Tagging

The SemEval 2021 task 5: Toxic Spans Detection is a task of identifying considered-toxic spans in text, which provides a valuable, automatic tool for moderating online contents. This paper represents the second-place method for the task, an ensemble of two approaches. While one approach relies on combining different embedding methods to extract diverse semantic and syntactic representations of words in context; the other utilizes extra data with a slightly customized Self-training, a semi-supervised learning technique, for sequence tagging problems. Both of our architectures take advantage of a strong language model, which was fine-tuned on a toxic classification task. Although experimental evidence indicates higher effectiveness of the first approach than the second one, combining them leads to our best results of 70.77 F1-score on the test dataset.


Introduction
Social Network sites are an integral part of our society. These platforms are often designed to maximize user interaction without sufficient means to moderate such interactions. The amount of users being cyber-bullied by toxic comments has reached an alarming proportion (Chan et al., 2021). To efficiently maintain the health of online communities, an automatic online-content filtering tool needs to be developed. Numerous previous attempts to resolve this issue have focused on toxic comment classification (Georgakopoulos et al., 2018;Chu et al., 2017;Pham et al., 2020;Risch and Krestel, 2020). Although these classification models are capable of detecting toxic comments, their outputs are not interpretable (Mathew et al., 2020).
On the other hand, Toxic Spans Detection (Pavlopoulos et al., 2021) is a task of locating toxic * equal contribution segments in texts. With such a system, the moderators can easily highlight offensive words in comments, which is an essential and explainable assistance for automated comment rating. In this paper, we propose our two approaches to resolve the task. Our contributions are as follows: • We investigate the effectiveness of our slightly customized Self-training (Wei et al., 2021) technique for a sequence tagging problem -Toxic Spans Detection.
• We explore the benefits of combining different word representations including Byte Pair Encoding (Sennrich et al., 2015), contextual character-level (Akbik et al., 2018), FastText (Bojanowski et al., 2016) and RoBERTa (Liu et al., 2019) word embeddings in order to utilize different syntactic and semantic information learned by these embedding methods.
• Taking advantage of a well-domain-adaptive pre-trained language model on a classification task (Unbiased-toxic-RoBERTa (Hanu and Unitary team, 2020)), we successfully integrate our two above-mentioned methods to achieve a high F1-score of 70.77 and rank 2nd at the Semeval 2021 Task 5: Toxic Spans Detection.
• Numerous exciting insights of the system's performance have been drawn with detailed error analysis.
2 Related Work

Word representation learning
Word2Vec (Mikolov et al., 2013) is among the earliest models for extracting continuous word representations. Although there have been numerous modern pre-trained text embeddings that outperformed Word2Vec in downstream tasks, it is still widely used due to its simplicity and effectiveness (Akbik et al., 2018). However, Word2Vec fails to handle rare or out-of-vocabulary words. To address this problem, FastText (Bojanowski et al., 2016) learn a word representation as sum of its character n-grams embeddings. On the other hand, (Sennrich et al., 2015) utilizes Byte Pair Encoding, an alternative approach for learning sub-word representations. Recent pre-trained language models learn context-sensitive word representations by utilizing different pretext tasks namely autoregressive language modeling (Radford et al., 2019;Akbik et al., 2018), masked language modeling (Devlin et al., 2018) on a large amount of unlabeled data. Those methods have led to significant improvements in a wide range of downstream tasks, including Text Classification (Howard and Ruder, 2018), Question Answering (Devlin et al., 2018) and Named Entity Recognition (Akbik et al., 2019b).
Unbiased Toxic RoBERTa (Hanu and Unitary team, 2020) is a language model that utilizes general pre-trained RoBERTa (Liu et al., 2019) to continually pre-train a toxic comment classification task on Civil Comments Dataset 1 . This toxic-domain-adaptive language model can be successfully employed to Toxic Spans Detection task whose domain is a subset of Civil Comments.

Self-training
Self-training, a semi-supervised method, incorporates the prediction of teacher models on extra available in-domain unlabeled data into the training of a student model (Wei et al., 2021). Selftraining has been recently successfully applied in both Computer Vision and Natural Language Processing tasks, including Image Classification (He et al., 2018), Object Detection (Xie et al., 2020), Machine Translation (He et al., 2020), ect. Despite its merits, issues such as the lack of in-domain unlabeled data (Du et al., 2020) and unreliable-pseudo labels (Pham et al., 2021) are the main obstacles for the success of Self-training.
For sequence-tagging problems, there are various methods of coping with noisy-pseudo labels. Unlike classification tasks, noisy self-labeled data can be easily eliminated by removing those which have low confidence scores; there is a lack of a comprehensive means to determine this score for a sequence-labeling data point. In several recent re-search, a deep reinforcement learning (Chen et al., 2018) and meta-learning (Wang et al., 2020) has been proposed to reduce "error propagation from noisy pseudo-labels" for sequence labeling tasks.

Methodology
In this section, we describe our proposed framework in detail. Firstly, we develop a simple but strong baseline to discover the effectiveness of different backbone models. Consequently, we build, extend, and customize our two methods on top of the best backbone and baseline model.

Baseline
We consider this task as a word-level binary classification problem even though the label annotation of the dataset is at character-level. Therefore, we first align character annotations to word annotations. We utilize a straightforward architecture, with a pre-trained language model as the backbone and a simple classifier on top of it. Specifically, let denote w = {w 1 , w 2 , ..., w m } and y = {y 1 , y 2 , ..., y m } with w i , y i is the word and its label at position i respectively, and x = {x 1 , x 2 , ..., x n } with x j is the j th subword tokens. Notice here that m and n can be different because language models learns subword representations instead of word-embeddings. h = {h 1 , h 2 , ..., h n } is the set of contextual embedding for all tokens in x (taken from the last layer's output of the backbone) and p = {p i , p 2 , ..., p m } with p i is the set of all subword positions of the word at position i. To obtain a word-level embedding, we took the sum of its corresponding subword embeddings.
Then probability distribution of the word at position i is formulized as follow.
With W c , b c and W h , b h are the learnable weights and bias respectively. We optimize the model by minimize the Cross Entropy loss between the ground-truth and model predictions.

Method 1: Feature-based Learning
We customize and extend the baseline model by constructing a standard Named Entity Recognition Figure 1: An illustration of our method. We start with a baseline, a simple sequence tagger utilizing Toxic RoBERTa as the backbone. In the Self-training branch, the teacher-the best-scored baseline, generates soft pseudo labels for the student to learn. On the other hand, the Feature-based Learning model concatenates the input vector with different embedding methods i.e. Flair, FastText and BPE, then trains the Named Entity Recognition task. Predicted character offsets (for each sentence) of two models are combined using Intersection Union (Ensemble Section) to obtain the final prediction. model using Flair package (Akbik et al., 2019a) 2 in which each span is an entity encoded in IOB format. The model consists of two parts: input representation using diverse embeddings and a feature-based model.

Input Representation
In represent both syntactic and semantic information of a word, we combine embeddings extracted by different word embeddings methods. These representations strengthen advantages of each other while mutually easing their weaknesses.
These word embeddings and their usage in our works are as follows: • Flair: Contextual Flair model works on character level. We fine-tune two models 'newsforward' and 'news-backward', on the Next Character Prediction task (Akbik et al., 2018), with 600K toxic texts from the Civil Comment Dataset to adapt them to toxic comment domain.
• Toxic RoBERTa: To utilize contextual embeddings from Toxic RoBERTa, besides fea-2 https://github.com/flairNLP/flair tures derived from the last layer as our baseline, we concatenate two more layers: the first one (layer 1) and the middle one (layer 6). This choice allows the feature learning to understand three levels of context-specificity (Ethayarajh, 2019). The final word representation is obtained by taking the sum of its subword embeddings.
• FastText with Byte Pair Embedding: It has been practically proven that combining contextual embeddings with static embeddings improves the performance of many NLP downstream tasks (Peters et al., 2018). We discard subword part, take only word vector part of a FastText model (pre-trained on Common Crawl dataset) for word representation and utilize an external English Byte Pair Embedding for out-of-vocabulary functionality. This combination performs as well as the original FastText while effectively reduces memory usage.
All of the above embeddings are concatenated to form a long vector for each word, which is digested by a feature learning model.

Feature Learning
The feature learning part is a sequence-to-sequence model that takes a sequence of word vectors and learns higher-level features and inferences tags. We use a linear layer to reproject the word embeddings onto a vector space with dimensions equal to the length of concatenated word embeddings. Two follow-up BiLSTM (Hochreiter and Schmidhuber, 1997) (Dyer et al., 2015) blocks are added to learn high-level semantic-syntactic dependencies of the sequence. Finally, a Conditional Random Fields (Sutton and McCallum, 2010) layer, placed on top of the BiLSTMs, makes tag prediction for each word.
3.3 Method 2: Self-training With In-domain Unlabeled Data

In-domain data retrieval
In-domain unlabeled data is one of the determining factors for Self-training. The Toxic Spans Detection task's labeled dataset is a subset of toxic-andsevere-toxic-labeled data in Civil Comment Dataset (Pavlopoulos et al., 2021). To retrieve additional data, we first selected posts classified as toxic by at least half of its toxicity annotators. After removing texts in both train and trial labeled datasets from the retrieved data, we randomly select a subset of 30,000 unseen texts for the task. The choice of extra datasets' size is heuristic and limited due to low-computing resources.

Data filtering and soft label
We slightly customized the pseudo-labels distillation process applied in classification tasks (He et al., 2018) for the sequence-tagging problem. Instead of evaluating and selecting each text in unlabeled data, we use the teacher model's post-softmax class probabilities to evaluate and select each word in a context. Specifically, if each word's confidence score is greater than a threshold, we keep the backpropagation process through that word; otherwise, we ignore it. Notice here that the probabilities mentioned above are also utilized as confidence scores and pseudo-labels for the student training.

Combine generated-labeled data with original-labeled data
The student model is trained on a combination of original-labeled and synthetic-labeled datasets. It has the same architecture as the teacher model except for increases in dropout rates of dropout layers and the hidden size in the model's head classifier.
We chose the best checkpoint of the baseline model as teacher model.

Post-processing
For each continuous toxic-predicted span, we eliminate any existing punctuation at both its beginning and end. Additionally, to partially prevent our model from predicting common toxic comments' targets as toxic spans, we exclude any predicted span in our predefined list of targets (described in details in the Appendices section). This list is based on the identity-targets list of toxic comments in the Civil Comments Dataset.

Ensemble Learning
We combine our two approaches by taking intersection (Intersection Ensemble) or the union (Union Ensemble) of predicted character offsets generated by best model results, from each method, to obtain the final offsets for each sentence.
With S I , S U , S 1 ,S 2 are the intersection, union Feature-based Learning and Self-training offset predictions for one sentence of the ensemble model respectively. Figure 1 illustrates our composed framework: the two approaches, built and extent on top of the baseline, are combined for the final predictions.

Dataset
The original dataset contains 7939 annotated samples for training and 2000 unlabeled samples for testing. We use a small trial dataset, given by the task organizer which consists of 690 labeled samples, as our development set. We train our models on the training set, use the development set to find the best hyper-parameters, and finally make our submission on the private test set.

Experiment setup
This section focused on the hyper-parameters configurations of our two methods and is mentioned in the Appendices section.

System Configuration
Our experiments are conducted on a computer with Intel Core i7 9700K Turbo 4.9GHz, 32GB of RAM, GPU GeForce GTX 2080Ti, and 1TB SSD hard disk.

Evaluation Metric
The evaluation metric of our system is defined, by the task organizer (Pavlopoulos et al., 2021) , as follow: With: • S t A i : character offsets of toxic post t, output of system A i • G t : ground truth character offsets of toxic post t • F t 1 (A i , G) : F 1 score of system A i , with respect to ground truth G t of post t  Table 1 indicates the performances of our baseline model with two different backbones, RoBERTa (Liu et al., 2019) and Unbiased Toxic RoBERTa (which is refered as Toxic RoBERTa for the rest of the paper) (Hanu and Unitary team, 2020). The toxic domain-adaptive pre-trained language model outperforms general RoBERTa by a large margin (up to 0.68), which sheds light on the necessity of adapting universal representations to task-specific domains.   Table 2 shows the feature-based model's performance with different word embeddings and the gap in F1-score between feature-based and fine-tuning models. Our findings are as follows: Toxic RoBERTa was the best feature extractor since using it achieved a competitive F1-score of 69.89. On the other hand, using only Flair results in a slightly better performance than FastText with BPE (67.92 and 67.89 respectively).
Adding more features (learned by Flair or Fast-Text with BPE embeddings) to ones learned by Toxic RoBERTa improved F1-score (69.99 and 69.95 respectively). Ultimately, combining all the word-representations obtained the highest score at 70.26.
Fine-tuning RoBERTa dramatically decreased the performance (up to 3-4). Table 3 presents the performance result of the 2nd method. Our choice of the teacher was the bestperformed baseline model with 69.30 F1-score. Post-processing enhanced this performance, resulted in 69.44 F1-score. Self-training only leads to a better student with an improvement of 0.1 compared to the post-processed teacher model. We suspect that this unimpressive increase is due to the teacher model's confirmation bias and the unsolved issue of noisy-pseudo labels (Pham et al., 2021).

Backbone
Private test F1-score  Table 3: Performances of the teacher model with and without post-processing and student model. Table 4 illustrates the effectiveness of our ensemble methods. Intersection Ensemble results in a significant improvements of our system prediction (0.51 and 1.23 compared to Feature-based Learning and Self-training respectively) while Union Ensemble leads to a substantial decrease of F1-score (-1.14 and -0.42 compared to method 1 and 2 respectively). This exciting finding indicates that Intersection Ensemble can rule out numerous falsely positive tokens of our two models whereas Union Ensemble worsen the performance by integrate these false positives.

Error Analysis
Carefully analyzing errors made by our ensemble model on the test dataset has shed light on our sys- tem's weaknesses. The following terms are defined to aid the reading of this section: • Toxic span: ground-truth toxic span.
• Long span: contains at least two words.
• True positive span: all words in the span are correctly predicted as toxic.
• False negative span: at least one word in the toxic span are predicted as non-toxic.

Unsuccessful detection of lengthy toxic spans
Among false negative spans in our system's predictions, 218 spans are short spans, which constitutes a small portion of the total short spans of the test set. In contrast, the other 158 spans are long spans while the total number of long spans of the test set is 200. This means our model had a bad performance on lengthy toxic spans with false negative rate up to 79%. Figure 2 further investigates on long span failure. While 90% (38 spans/42 spans) of our correctly predicted long spans contains only two word, this numbers for false-negative spans decrease to 62% (98 spans/158 spans). Longer spans are harder to predict. Failing to detect spans with more than three words contributes approximately 38% to the model's false-negative. Table 5 demonstrates several instances of the failure.

False negatives due to post-processing
Although experimentally showed its effectiveness, post-processing after Self-training inevitably excluded continuous predicted spans such as "NIG-GERS" and "MUSLIMS" which are potentially a part of ground-truth toxic spans.

Ground-truth Prediction
Bunch of cowards cowards burn them at the stake burn them big, fat, orange buffoon None you're a fecking write-off None Table 5: Several instances of errors in detecting long-spans I wish this moron would have been shot to death by the US soldier instead of the other way around. I hope that 10 million is given to the soldier's family who actually deserve it.
Our ignorant, clueless legislators are working hard to ensure that Alaskans are as ignorant and clueless as the legislators themselves are. Stupid Alaskans are the only way these moronic legislators will continue to get re-elected.
Trash, and just not the regular bigoted flatulence, this crap you write is evil trash Note: Underlines are the prediction of our models and bold text are our manually annotated toxic-spans.

Failure due to mis-annotated spans
We notice our model predicted false positive tokens in 469 toxic comments and most of them (308 comments) are humanly annotated with no toxic spans. In our opinion, many of these texts are misannotated, which potentially lower the precision of our system. Table 6 presents several examples of this issue. The underlines are our model predictions, while the bold text spans are our opinion of what toxic annotations should be for the given text. All these texts contain no toxic spans, according to the dataset's annotators.

Conclusion
In this paper, we proposed a system to resolve the SemEval task 5: Toxic Spans Detection. Our method utilized a pre-trained language model in toxic-domain and successfully combined two approaches Self-training and Feature-based Learning to achieve a high F1-score of 70.77. Finally, we provided insights into failure of the system and the task's potential falsely-negative annotations issue with careful error analysis.
Despite our success on the leader board, in future research, we determine to improve our model as follow: • Investigate a solution for the noisy-pseudo label issue to enhance the performance of the Self-training method.
• Combine Self-training with Feature-based Learning to learn a more robust toxic-span detection model.

A Appendices
To form our target list in 3.