LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification

This paper presents our system entitled ‘LIIR’ for SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2). We have participated in sub-task A for English, Danish, Greek, Arabic, and Turkish languages. We adapt and fine-tune the BERT and Multilingual Bert models made available by Google AI for English and non-English languages respectively. For the English language, we use a combination of two fine-tuned BERT models. For other languages we propose a cross-lingual augmentation approach in order to enrich training data and we use Multilingual BERT to obtain sentence representations.


Introduction
Nowadays, with an exponential increase in the use of social media platforms such as Facebook and Twitter by people from different educational and cultural backgrounds, the need for automatic methods for recognizing and filtering offensive languages is necessary (Chen et al., 2012;Nobata et al., 2016). Different types of offensive content like hate speech , aggression (Kumar et al., 2018) and cyberbullying (Dinakar et al., 2011) can be very harmful to the user's mental health, especially to children and youth (Xu and Zhu, 2010).
The OffensEval 2019 competition (Zampieri et al., 2019b) was an attempt to build systems capable of recognizing offensive content in social networks for the English language. The OffensEval 2019 organizers defined three Subtasks: whether a message is offensive or not (Subtask A), what is the type of the offensive message (Subtask B), and who is the target of the offensive message (Subtask C). This year, they have extended the competition to several languages while the Subtasks remain the same as in OffensEval 2019. OffensEval 2020  features a multilingual dataset with five languages including English, Danish, Turkish, Greek, and Arabic.
This article presents our approaches to SemEval-2020 Task 12: OffensEval 2 -Multilingual Offensive Language Identification in Social Media. We have participated in Subtask A for all languages. The goal of Subtask A is recognizing if a sentence is offensive or not. For the English language, we separately finetune two bidirectional encoder representations of the BERT transformer architecture (Devlin et al., 2018) on two different datasets, and we use the combination of these two models for training our classifier. Also, we perform an extensive preprocessing for the English language. For other languages, we enhance the provided training dataset for each language using a cross-lingual augmentation approach, then, we train a classifier by fine-tuning a multilingual BERT (mBERT) (Devlin et al., 2018) using the augmented dataset with a linear classification layer on top. Our proposed augmentation approach, inspired by the works of (Lample and Conneau, 2019;Singh et al., 2019), translates each training sample into three other languages, then, adds the original training sample concatenated with every translation to the training set. LIIR has achieved a competitive performance with the best methods proposed by participants in the competition for the English track. Also, empirical results show that our cross-lingual augmentation approach is effective in improving results.
The rest of the article is organized as follows. The next section reviews related works. Section 3 describes the methodology of our proposed models. We will discuss experiments in Section 4 and the results are presented in Section 5. Finally, the last section contains the conclusion of our work.

Offensive Language Identification
Earlier works for addressing Offensive Language Identification relied on manually extracting the different types of features (Schmidt and Wiegand, 2017) such as token and character n-grams, word clusters, sentiment analysis outcomes, lexical and linguistic features, knowledge-based features, and multimodal information Warner and Hirschberg, 2012;Gitari et al., 2015;Dinakar et al., 2012;Hosseinmardi et al., 2015). The extracted features were used to train machine learning methods like a support vector machine (SVM), naive Bayes, logistic regression, random forest classifier, or a neural network.
With the success of transfer learning enabled by pre-trained language models such as BERT (Devlin et al., 2018), GPT (Radford et al., 2018), and ULMFiT (Howard and Ruder, 2018), researchers have resorted to using these methods for addressing the Offensive Language Identification task. In the Of-fensEval 2019 competition (Zampieri et al., 2019b), among the top-10 teams participated in Subtask A, seven used BERT with variations in the parameter settings and the preprocessing steps (Liu et al., 2019;Nikolov and Radivchev, 2019;Pelicon et al., 2019).

Multilingual Methods
There is a substantial body of work that investigates how to leverage multilingual data to improve the performance of a monolingual model or even to enable zero-shot classification. The XLM model (Lample and Conneau, 2019) extended BERT to a cross-lingual setting in which instead of monolingual text, it used concatenated parallel sentences in the pretraining procedure. This method achieved strong results in machine translation, language modeling, and cross-lingual Natural Language Inference. XLDA (Singh et al., 2019) is a cross-lingual data augmentation method that simply replaces a segment of the input text with its translation in another language. The authors observed that most languages are effective as cross-lingual augmenters in cross-lingual Natural Language Inference and Question Answering tasks. The MNCN model (Ghadery et al., 2019) utilized multilingual word embeddings as word representations and augmented training data by combining training sets in different languages for aspect-based sentiment analysis. This method is capable of classifying sentences in a specific language when there is no labeled training data available.

Methodology
In this section, we present the proposed methods in more detail. We have participated in Subtask A -categorizing a given sentence as 'Offensive' or 'Not-offensive' -for the English, Turkish, Arabic, Danish, and Greek languages. This year, OffensEval organizers have provided labeled training data for all the languages except for English where they just have provided unlabeled training data. Therefore, we propose two different approaches in this paper, one for the English language, and one for the other languages. For the English language, we fine-tune two BERT models separately on two different datasets, and we use the combination of these two models in training our classifier. For the other languages, a cross-lingual augmentation approach is used for enriching each language's training set, and we fine-tune an mBERT model to obtain sentence representations. In the following subsections, we describe our cross-lingual augmentation technique, and detail the proposed models.

Cross-lingual Augmentation
Given X = {x , y } n =1 as a training set, where x is a training sentence and y is the corresponding label to x, and n is the number of train sentences, we create the augmented training setX in three steps as follows. First, x is translated to English, French, and German languages using Google Translate 2 . In the second step, given the obtained translations x −en , x −f r , and x −de as English, French, and German translations respectively, we generate three new samples as follows: where ; is the concatenation operand. Finally, we create the augmented training setX by adding the original training samples and their three generated samples toX. Choosing these three languages as translation candidates is because they are the top three languages used in Wikipedia. Since we know that the mBERT model is trained on a huge amount of Wikipedia page texts in different languages, by translating each training sample into the top three languages in this fine-tuning procedure, we make the representation of a sentence more informative. In other words, to predict the target label, the model can leverage the translated context if the original context is not sufficient (Lample and Conneau, 2019). It is fair to say that the proposed cross-lingual augmentation quality depends on the quality of the translation.

English
For the English language, first of all, we automatically label the provided unlabeled dataset to obtain a weakly labeled training set. The OffensEval organizers have provided a confidence score for each sentence instead of a gold label, where the scores are the average confidence of belonging to the 'Offensive' class produced by several learning methods. We have investigated different threshold values for using the confidence scores for weakly labeling the sentences as 'Offensive' or 'Not-offensive' samples. In our experiments, we have realized that the precise determination of a threshold for the confidence score is not an important factor in the performance but the important factor is the number of weakly labeled training samples. In order to decrease noise samples and since precision is a more important factor than recall in acquiring true training samples, we label sentences with confidence score more than 0.8 as 'Offensive' and sentences with confidence score less than 0.2 as 'Not-offensive'. Then, we randomly sample 300k 'Offensive' sentences and 300k 'Not-offensive' sentences as our final weakly labeled dataset. In the next step, we adapt and fine-tune two separate BERT models on the Offensive Language Identification dataset (OLID) (Zampieri et al., 2019a) and our weakly labeled offensive dataset. Then, we train a feed-forward layer to classify a given sentence as 'Offensive' or 'Not-Offensive,' while the input of the classifier is the concatenation of the sentence representations extracted from the two fine-tuned BERT models. The representation of the 'CLS' token, which is the first token of every input sequence, is considered as the sentence representation.

Other Languages
For other languages, we augment the training set using the proposed cross-lingual augmentation technique, then, using the augmented dataset, we train a classifier by fine-tuning a pre-trained mBERT model topped with a feed-forward classification layer. The 'CLS' token representation is fed to the classification layer as the sentence representation.

Datasets
In this section, we present an overview of the datasets used in this article for training our models for the OffensEval-2020 competition. For the English language, we use the large unlabeled dataset provided by the organizers  to create the weakly labeled dataset. Also, we use the OLID (Zampieri et al., 2019a) dataset for training the English model. For other languages, we utilize the provided labeled datasets by the organizers for Turkish (Çöltekin, 2020), Danish (Sigurbergsson and Derczynski, 2020), Greek (Pitenis et al., 2020), and Arabic (Mubarak et al., 2020) languages. The detailed statistics of the datasets are summarized in Table 1.  Language  Train  Test  OFF NOT Total OFF NOT Total  English  300k 300k  600k 1080 2807 3887  Danish  307  2061  2368  41  288  329  Turkish  4837 20184 25021 716 2812 3528  Arabic  1371 5468  6839  402 1598 2000  Greek  1989 5005  6994  242 1302 1544   Table 1: Datasets statistics

Experimental Settings
For the English language, an extensive preprocessing is conducted including emoji to text projection 3 , hashtag segmentation 4 , replacing slang and abbreviations (Effrosynidis et al., 2017), replacing @USER by <user> and 'URL' by 'http', and removing numbers. As the evaluation set, we held out 20 percent of the training set for Danish, Greek, and Turkish. For the Arabic language, the evaluation set is provided by the organizers and for the English language, the OffensEval-2019 test data is utilized as the evaluation set. The HuggingFaces library (Wolf et al., 2019) is used for obtaining pre-trained BERT and mBERT models. All the hyperparameters of our models are tuned using the validation data via grid search. We trained our models for 4 epochs with batch sizes of 8, 16, 24, 32, and 16 for English, Danish, Arabic, Greek, and Turkish, respectively. Adam optimizer is used with learning rates of 2e-05, 1e-05, 3e-05, 2e-05, and 2e-05 for English, Danish, Arabic, Greek, and Turkish, respectively.

Results
In this section, we present the results obtained by our methods on the test sets for Subtask A. Table 2 shows the results obtained by the submitted final models for each language on the test sets. All results are provided in terms of macro-F1. Furthermore, we provide the results obtained by two baselines, the 'Majority' baseline, and the 'Best system' baseline, for the sake of comparison. In the 'Majority' baseline, the classifier simply predicts the majority class, and for the 'Best system' baseline, we report the best result obtained by participant teams for each language. The best result for each language is marked in bold. Results show that LIIR has achieved a competitive performance compared to the best results obtained by the best teams in almost all the languages. These results demonstrate that our models can effectively identify offensive content contained in a given tweet. Also, we can observe that our method shows a weak performance in the Danish language which we believe is because of the problem of the mBERT model in processing some Danish characters (Strømberg-Derczynski et al., 2020). Not that the best systems mostly use either language-specific resources or ensemble of several models, while we just fine-tune an mBert model using augmented training data for non-English languages.

Ablation Analysis
In this part, we provide an ablation study on the models proposed for the different languages on the validation set. We show the effect of using two pre-trained models on two different datasets for classifying the English tweets. Furthermore, we examine how well the final results of other languages were influenced by the cross-lingual augmentation technique. Table 3 shows the ablation study results of the English language. The first observation is that the pre-trained model on the OLID dataset contributes to better performance compared to the pre-trained model on our weakly labeled dataset. The result is what we expected since the weakly labeled dataset unavoidably contains noise samples that will negatively affect model performance. The best result is obtained by a combination of the two models in the training procedure. In Table 4

Conclusion
In this paper, we have presented our models for recognizing offensive content in the SemEval-2020 task 12 Subtask A for all languages. We have fine-tuned two BERT models on two different datasets for the English language. Moreover, we have fine-tuned an mBERT model on the augmented training sets for the other languages by implementing a cross-lingual augmentation approach. The evaluation results show that the proposed systems are capable of effectively recognizing offensive content in language. As future work, we intend to investigate other augmentation techniques and utilize language-specific resources to improve the performance of our method in languages other than the English language. Furthermore, we plan to address the problem of imbalance in the training set.

Acknowledgments
This research was funded by the DAME project from the KU Leuven with grant number C32/17/015.