Amsqr at SemEval-2020 Task 12: Offensive Language Detection Using Neural Networks and Anti-adversarial Features

This paper describes a method and system to solve the problem of detecting offensive language in social media using anti-adversarial features. Our submission to the SemEval-2020 task 12 challenge was generated by an stacked ensemble of neural networks fine-tuned on the OLID dataset and additional external sources. For Task-A (English), text normalisation filters were applied at both graphical and lexical level. The normalisation step effectively mitigates not only the natural presence of lexical variants but also intentional attempts to bypass moderation by introducing out of vocabulary words. Our approach provides strong F1 scores for both 2020 (0.9134) and 2019 (0.8258) challenges.


Introduction
Keeping social media platforms free from unwanted publications such as spam, scam, phishing, hate speech, targeted attacks and fake news is still an active research topic nowadays. This is due not only to the relative low cost of creating fake accounts, bots (Albadi et al., 2019) and forging online identities but also to the large amount of personal information made available on the Internet which makes targeting certain groups and individuals easier than ever. While some of these threats were seen before affecting traditional messaging platforms such as email and SMS, the reach and adoption of social media applications have amplified their impact, requiring additional cost and effort to mitigate.
The use of offensive language as a vehicle to attack individuals and communities poses challenges not only for humans, which are prone to subjective and biased judgement (Sap et al., 2019) but also for automatic moderation systems. The inherent ambiguity when dealing with messages which are often short, can be written in mixed languages, contain informal words and are usually subject to adversarial modifications render naïve filtering approaches such as word lists or re-purposed spam detection models ineffective. For this reason, in order to solve this problem more sophisticated approaches such as state of the art natural language processing (NLP) is needed.
In order to keep track and measure the progress in the area on offensive language detection several English datasets with annotations for hate (Davidson et al., 2017), (Waseem and Hovy, 2016), targeted (Zampieri et al., 2019a) and personal attacks (Wulczyn et al., 2016) were released over the last years. Likewise, public evaluations such as HatEval (Basile et al., 2019) and OffenseEval (Zampieri et al., 2019b) were recently introduced highlighting the need for stronger baselines to assess the performance of more complex systems.
This paper evaluates the method and system submitted to the shared task 12 of SemEval-2020: Multilingual Offensive Language Identification in Social Media  for the subtask A (English) based on an stacked ensemble of neural networks. The rest of the document is organised as follows: In section 2, we review related work relevant for detecting abusive language. In Section 3 we describe our layered model approach including our anti-adversarial strategy based on text normalisation and stacking-based ensembling. In Section 4 we show the results obtained in the test and evaluation datasets. Finally, in Section 5 we draw our conclusions and outline future work.

Related Work
Previous work on automatic hate speech and offensive language detection made use of linear models over word n-grams (Malmasi and Zampieri, 2017) and sentiment lexicons (Davidson et al., 2017). However most recent research is dominated by netural network architectures: Liu et al. (2019) and Zhu et al. (2019) applied bidirectional transformers (BERT) (Devlin et al., 2018) with success showing that pre-trained models fine-tuned for this task can outperform other approaches. On the other hand, convolutional neural networks (CNN) and bidirectional LSTMs (bi-LSTMs) provided strong results (Mahata et al., 2019) when paired with pre-trained embeddings such as FastText (Bojanowski et al., 2017), GloVe (Pennington et al., 2014) or word2vec (Mikolov et al., 2013).
While adding more complexity, combining several models can effectively reduce classification bias and variance. We have seen good results using voting ensembles (Seganti et al., 2019) and stacked generalisation (Malmasi and Zampieri, 2018) when applied to this particular problem.

Methodology
The goal of Subtask-A is determining if a tweet is either offensive or not offensive, which conceptually translates to a binary classifier using F1-macro as scoring function. However, during the exploratory data analysis of the training set we've identified group of instances where users intentionally crafted offensive messages to bypass profanity and moderation filters. For this reason, our design choices have an anti-adversarial strategy in mind.
Best performing models in previous benchmarks (Basile et al., 2019), (Zampieri et al., 2019b) were based on popular pre-trained embeddings and architectures, either using transfer learning or leveraging these directly. While this is quite convenient in terms of computing cost it also introduces potential weaknesses which can be exploited in a black-box scenario. By guessing the base architecture the model was built upon, since there is a reduced set of high-quality pretrained models, an attacker could launch more successful black-box attacks (Wang et al., 2018). This is usually performed via input perturbation such as introducing synonyms (Jin et al., 2019), flipping characters (Pruthi et al., 2019) or including targeted keywords and typos (Shi et al., 2020). Being even possible to steal the whole model altogether (Krishna et al., 2019) in more sophisticated attacks.

Text Normalisation
Lexical normalisation techniques are particularly effective against black-box adversarial attacks (Alshemali and Kalita, 2019), while they also can increase the performance of NLP tools and applications when working with informal text (Mosquera and Moreda, 2013).
For this reason, we have applied to some of our inputs a text normalisation filter in order to reduce out-of-vocabulary words (OOV). This is not only effective against some adversarial perturbations but also replaces common typos and informal lexical variants commonly found in microblogs with their canonical version. This is performed at 2 levels: lexical and graphical. At lexical level we follow a similar modular architecture as TENOR (Mosquera et al., 2012) where a high-precision, low recall normalization dictionary is recursively combined with shortening/lengthening and re-casing rules. See table 1. Likewise, unicode homographs and near-homographs are translated to their ASCII equivalent by using a lookup table. See last entry in table 1.

Original
Normalised Then these dumba$$es vote Democrat!?!!! then these dumb asses vote democrat @USER Again another b******* story no one is again another bullshit story no one is watching football because of this a****** watching football because of this asshole theyre abso shite quality tho they are absolute shit quality though Gets Period* You are the cause of my gets period you are the cause of my dysphoria Table 1: Text fragments where after normalisation a label flip was observed during validation.

Ensembling
Aiming to minimize the impact of adversarial attacks targeting popular models we have designed a 2level classifier based on stacked generalisation as shown in Figure 1.  (2006) The first level (L1) comprises of 42 models trained over several lexical resources using the OLID (Zampieri et al., 2019a) dataset and labels. This effectively encapsulates different models and training datasets, having more chances to thwart off-the-shelf attacks for specific architectures. Details of individual models and datasets for level 1 at can be found at Table 2  Capsule network (CapsNet) + GloVe. CapsNets (Sabour et al., 2017) have been shown as an alternative to Convolutional Neural Networks (CNNs) but more robust against white-box adversarial attacks (Frosst et al., 2018). CapsNets have been also seen outperforming CNNs in offensive text identification (  . Only the highest offensive score (in case more than emoji appears in a message) of the maximum offensive score per emoji is considered OffensEval 2020 charhasoc Character n-gram (3-6) + logistic regression HASOC (Mandl et al., 2019) chartrac Character n-gram (3-6) + logistic regression TRAC (Kumar et al., 2018) hateval.* Word n-gram (1-3) + logistic regression over the 3 different labels, providing wordhate, wordtarget and wordag inputs for hate spech, targeted attack and aggression respectively at word level. The same were also trained at characted level (3-6), resulting the another 3 inputs ( charhate, chartarget and charag) HatEval (Basile et al., 2019) charinsult Character n-gram (3-6) + logistic regression Kaggle insults 3

Results
Our offensive text classification system obtained strong results across different datasets which are summarized in  Table 3: Results of individual models (L1) and the final ensemble (L2) versus the best public scoring approach for each task.
Interestingly, BERT models fine tuned on Kaggle toxic dataset had a high correlation with the test set for this year challenge, even improving slightly final ensemble results when compared against the identity hate and toxic classes. Such correlation is not present in the previous year test set, where models trained on OLID outperformed the rest by a considerable margin.
There is another apparent trend reversal observed in normalised models, on the 2019 test set individual models with normalisation outperformed their non-normalised equivalent while in the current test set results were comparable for both normalised and not normalised.
Labelling shifts of certain keywords that caused the system to FP may be worth of further analysis: 79% of the tweets containing the pattern "sick|disgusting|sucks" were labelled as offensive in OLID, in comparison with a 55% when considering test set gold labels. Some examples of this disagreement can be found at Table 4.

Tweet
Dataset Label @USER That sucks {thumbs down} OLID OFF @USER The game sucks OLID OFF @USER man that sucks unreal OLID OFF @USER Oh god, that sucks :/ Test NOT ldr doesn't really works it sucks Test NOT Honestly they're not even pretty and the music sucks.... What do people see?? Test NOT Table 4: Similar tweets with different labels across OLID and test set.

Conclusion and Future Work
In this paper we describe our system and method for detecting offensive tweets built for SemEval-2020 Task 12 -subtask A. Our design choices had an adversarial environment in mind and therefore we've made use of anti-adversarial features such as text normalisation and ensemble learning, obtaining strong results in 2 evaluation datasets. In a future work we would like to explore different attack and defence scenarios for this particular problem.