UAntwerp at SemEval-2021 Task 5: Spans are Spans, stacking a binary word level approach to toxic span detection

This paper describes the system developed by the Antwerp Centre for Digital humanities and literary Criticism [UAntwerp] for toxic span detection. We used a stacked generalisation ensemble of five component models, with two distinct interpretations of the task. Two models attempted to predict binary word toxicity based on ngram sequences, whilst 3 categorical span based models were trained to predict toxic token labels based on complete sequence tokens. The five models’ predictions were ensembled within an LSTM model. As well as describing the system, we perform error analysis to explore model performance in relation to textual features. The system described in this paper scored 0.6755 and ranked 26th.


Introduction
SemEval 2021 Task 5: Toxic Spans Detection was organised by John Pavlopoulos and colleagues, and described in detail in their task description paper (Pavlopoulos et al., 2021). Competing teams were asked to develop systems capable of detecting spans of toxic text. Predictions were evaluated using a pairwise F1-score of toxic character offset predictions, described in section 5.1.
Initial analysis of the development data revealed that toxic spans were varied in content and not limited to single words. Though most examples contained single toxic words or phrases, others contained longer spans and complete sentences. Figure 1 illustrates this phenomena. With this in mind, we sort a strategy that combined longer span based detection with binary word classification. Table 1 reveals that toxic spans were on averagẽ 3 times longer in the development set, whilst stop words were4 times more frequent. Figures 8 and 9 shows the frequency of these features in relation to model performance.  Strategy We combined models that used antithetical contexts, i.e. full sequences, and shorter ngram sequences before and after a given word. This approach is based on the hypothesis that their predictions would have a low correlation, and in turn, they would create ideal ensemble components.

Results
The system described in this paper scored 0.6755 and ranked 26 th . We discovered that model correlation did play a factor in the accuracy of an ensemble approach; however, much of this performance increase was lost in transition to test data, where correlation increased on the most frequent type of examples. In section 5.3 we analyse model performance and correlation in relation to textual features.

Background
Toxic span detection is a development of binary toxicity detection which has garnered recent attention, in the form of shared-tasks and datasets (Wulczyn et al., 2017;Zampieri et al., 2019).
Features Teams were supplied with development data consisting of 7939 text samples in varying lengths up to 1000 characters, and tested on 2000 text samples.
Target Span detection asks systems to detect which specific series of characters are toxic, irrespective of the text's overall toxicity. Figure 2 illustrates the target value for SemEval 2021 Task 5. Unlike Named Entity Recognition, systems were not scored on their performance at negative, beginning, middle, or end token detection. This target definition led to a focus on positive optimisation, where false positives were of more importance than true negatives. In section 5.3 on error analysis we compare model scores using a binary word level representation of toxicity, that scores both positive and negative prediction.   Task Interpretations We used two types of component models; binary word level models and categorical span based models, and combined those in an LSTM network (Hochreiter and Schmidhuber, 1997). We used two word based models [GLOV, BERT] and three span based models [ALBE, ROBE, ELEC], the softmax output of all models were concatenated and supplied to an LSTM model [ENSE].
Motivation We intended for the word based models to learn local features in the tokens nearest the target word, and for the span based to learn the overall features that affected sub and multi word toxicity.

Baselines
To interpret the task we relied on the Spacy implemented baseline shared by the organizers and described in the task description paper (Pavlopoulos et al., 2021;Honnibal et al., 2020). The approach retrained the RoBERTa based en_core_web_trf model's ner, trf_wordpiecer, and trf_tok2vec components, producing f1-scores of 0.5630 on the development data and 0.6305 on test data. To Interpret the problem further, we implemented two simple baselines.
Lexical Lookup Using a subset of samples from the development data, we created a toxic words list from all words within toxic spans, except for stop words 1 . On the test data, we then classified words as toxic if they appeared within the aforementioned toxic words list. We then converted word offsets into character offsets. This approach achieved an F1-score of 0.4161 on the test data.
SVM Using Term Frequency to Inverse Document Frequency we created two document vector representations of toxic and non-toxic spans. Using a Support Vector Machine, we predicted the probability that a word vector appeared within a toxic or non-toxic document (Salton and McGill, 1986;Wu et al.). We then used a binary threshold of 0.5 and class weights based on relative label frequency to predict whether a word was toxic. This approach achieved an F1-score of 0.5489 on the test data.

Span Prediction
Span prediction models used the complete sequence of words, up to a maximum length, to pre- dict toxic character offsets. Sequences were represented as token reference indexes, described in section 4.1. The target sequence was processed from character offsets into categorical arrays for toxic, non-toxic, and padding tokens. 4.1.
Transformer Models We selected three pretrained transformer models (ALBERT, RoBERTa, ELECTRA) and fine-tuned them for this task with extra linear layers. We performed separate hyperparameter optimisation for each model, detailed in section 4.2. ALBERT is a lightweight implementation of a BERT model (Lan et al., 2020;Devlin et al., 2019) that uses feature reduction to reduce training time. ELECTRA is a further development of the BERT model that pre-trains as a discriminator rather than a generator (Clark et al., 2020). RoBERTa develops the BERT model approach for robustness, (Liu et al., 2019). During development we found that these three transformer models achieved the highest f1-scores in relation model correlation compared to alternatives. All models used the Adam optimizer (Kingma and Ba, 2017). The binary word level models treated the task as word toxicity prediction based on a sequences of words before and after the target word. Figure 5 illustrates this approach. The target word toxicity was represented as a binary value. The sequence length before and after the target word was optimised for each model, and described in section 4.2.

Binary Word Prediction
Siamese-LSTM with Glove Word Embeddings A Siamese LSTM model used two networks based on separate glove embeddings of the sequence of  Figure 6: Input features and target labels for an example sequence, comparing a BERT specific token representation with the character offset representation defined by organisers (Pavlopoulos et al., 2021).
LSTM Finetuning BERT-base An LSTM model was trained based on the output of a BERT-base model. The words before and after the target word were used as model features, and the target word toxicity was represented as a binary value (Devlin et al., 2019).

Ensemble Model
A Bidirectional LSTM model was used to predict token toxicity based on tokenised word features and component model predictions. The model used transformer style feature representations to predict a sequence of categorical representations for token toxicity, as described in section 4.1. The ensemble model relied on five fold cross validation, as described in section 4.2.

Component model Predictions
Component model predictions were concatenated together as categorical representations of labels (not toxic, toxic, padding : 0,1,2). Each model's 3 dimensional output (number of samples, sequence length, number of labels) was permuted into a 4 dimensional matrix (number of samples, sequence length, number of labels, number of models).

Experimental setup 4.1 Pre-Processing
Tokenisation Text sequences were tokenised into character sequences using a BERT tokenizer and excess characters were replaced with a # character, as shown in Figure 6 (Devlin et al., 2019). Sequences were padded and truncated for uniformity to a length of 200 tokens. Longer sequences were handled separately, and predictions were combined in post-processing, described in section 4.4.
Target Label Representation To best suit the component models, we used a target representation based on the character sequences from the BERT tokenizer. Each word-like sequence was given a label based on its word-id, and converted into categorical binary arrays, or one-hot vectors. This is illustrated in Figure 6.

Training and Optimisation
Cross Validation We used stratified k fold validation of the development data to train all models. After optimisation, each component model's predictions on the test portion of fold k were added to the train portion of the other folds. Producing unseen training features for the ensemble model. This process avoids overfitting in component models, and facilitates training an ensemble model on the complete development data (Fushiki, 2011;Pedregosa et al., 2011).

Hyper-Parameter Optimisation
Model parameters were optimised for each fold of the development data and the best models were used by the ensemble model. Table 2 shows the optimum parameters for each model used on the test data. We used Bayesian optimization for each fold of the development data to find optimum parameters (Snoek et al.). Component models were selected based on their f1-score and prediction correlation to other models. The ensemble model was trained on the predictions of the optimum model for each fold of the development data, expanded on in Section 4.3.

Prediction
To predict spans for submission, a version of each component model optimised for each fold of the development data was supplied the test data and their outputs were averaged. The ensemble model was then supplied component model predictions and tokenised text sequences.

Post-processing
Model output was converted from 2 dimensional token-level categorical arrays (n tokens, n labels) into character offsets. The character offsets of each positively labeled token was then added to a list, as illustrated in Figure 6. The predictions of sequences that had been truncated during preprocessing, were combined and duplicates were removed.

Task Specific Evaluation Metrics
Systems are evaluated with an F1 score of character offsets (Pavlopoulos et al., 2021) . In cases where predicted spans are empty, 1 is given when true spans are empty and 0 is given if there are any true spans.

Error Analysis
We performed error analysis to interpret the hypothesis that there are multiple annotation rationales; single toxic words, and longer offensive sentences, illustrated in Figure 1. Figure 8 reveals that the length of toxic spans had an impact on model performance. Models were less accurate at detecting longer spans on both development and test data. Furthermore, the impact of this effect on test data was decreased as there were fewer longer toxic spans.

Stop Words in Toxic Spans
The frequency of stop words in toxic spans also affected model performance. Figure 9 reveals that, where present, spans with more stop words caused lower model accuracy. Binary Token Level Evaluation By using token level scoring we are able to reveal how the models perform on both positive and negative tokens. Here, the target labels are represented as binary arrays; 1 for toxic tokens and 0 for non-toxic. We can not expect these calculations to align with character offsets, due to variance in tokenisation and parsing. Figure 10: Binary token level scores for precision, recall, and f1-score.

Conclusion
Our initial hypothesis, that combining word based and span based approaches would yield a significant performance boost, did not stand up. We measured a 5% increase in f1-score on development data, but this was not transferred to test data.
In future work, we would look to a strategy that incorporated model transferability in component model selection, with the intention of better handling fluctuations in annotation rationale. Drawing on recent work (Fortuna et al., 2021).