Team Rouges at SemEval-2020 Task 12: Cross-lingual Inductive Transfer to Detect Offensive Language

With the growing use of social media and its availability, many instances of the use of offensive language have been observed across multiple languages and domains. This phenomenon has given rise to the growing need to detect the offensive language used in social media cross-lingually. In OffensEval 2020, the organizers have released the multilingual Offensive Language Identification Dataset (mOLID), which contains tweets in five different languages, to detect offensive language. In this work, we introduce a cross-lingual inductive approach to identify the offensive language in tweets using the contextual word embedding XLM-RoBERTa (XLM-R). We show that our model performs competitively on all five languages, obtaining the fourth position in the English task with an F1-score of 0.919 and eighth position in the Turkish task with an F1-score of 0.781. Further experimentation proves that our model works competitively in a zero-shot learning environment, and is extensible to other languages.


Introduction
The prevalence of social media has made public commentary a critical aspect in shaping public opinion. Although freedom of speech is often advocated, offensive language in social media is unacceptable. Nevertheless, social media platforms and online communities are laden with offensive comments. This phenomenon results in the need for computationally identifying offense, aggression, and hate-speech in user-generated content in multiple languages.
This paper addresses the challenge put forward in the Multilingual Offensive Language Identification in Social Media shared task-organized at SemEval 2020 . The theme of the problem is to identify the offensive language in tweets in Arabic, Danish, English, Greek, and Turkish. This shared task is further divided into three sub-tasks. The first task consists of the identification of offensive tweets in a multilingual setting, whereas the other two tasks consist of the categorization of offensive tweets and identification of targets in English.
Transfer learning is a methodology to utilize the knowledge acquired from one or more tasks to solve other related tasks. It is ubiquitous in the domain of Natural Language Processing. It can be classified into multiple types, like transductive transfer learning and inductive transfer learning. The former is used when the tasks are the same, but the corresponding domains are different, such as in the case of the cross-lingual learning for similar tasks. On the other hand, the latter is used when the tasks are different, but the domains are similar such as in the case of finetuning of pretrained contextual word embeddings.
Cross-lingual inductive learning has been used for many downstream tasks like multi-lingual variants of question answering, text classification, and text generation (Artetxe and Schwenk, 2018;Lample and Conneau, 2019;. The rationale behind this is that the language with limited resources benefits from joint training over many languages. It also helps in performing zero-shot learning and handling of code-switched text, which otherwise are difficult to tackle. In this work, we propose the use of cross-lingual inductive learning to detect the offensive language in the given five languages. We use pretrained XLM-R  cross-lingual embeddings and train a single cross-lingual model for all five languages in the multilingual Offensive Language Identification Dataset. Xu et al. (2012) proposed the task of identifying bullying in social media using NLP. They presented benchmarking results for text classification among different NLP tasks using off-the-shelf solutions. Further, Nobata et al. (2016) proposed a machine learning-based method incorporating linguistic features, including n-gram, distributional semantic, and syntactic features, to detect the abusive language in user-generated online content. They also released a first of its kind corpus of user comments annotated for offensive language sampled from different domains. Zampieri et al. (2019) proposed the Shared Task OffenseEval 2019 for detecting offensive language in the tweets sampled from Twitter. They released the Offensive Language Identification Dataset (OLID) consisting of 14, 100 tweets in English annotated for offensive content using a fine-grained three-layer annotation scheme. OffensEval 2020  uses the same dataset for the English task and similar data creation methodology for other languages. Artetxe and Schwenk (2018) introduced an architecture for learning joint multilingual sentence representations, using a single BiLSTM encoder with a BPE vocabulary shared by all the languages. They use 93 languages, spanning across 30 different language families and 28 different scripts for training the sentence representation embeddings. Experimental results on the XNLI dataset (Conneau et al., 2018) for cross-lingual natural language inference and MLDoc dataset for cross-lingual document classification show that the cross-lingual transfer of learned features helps in improving the performance of the classification models in a cross-lingual setting.

Related Work
Chen and Cardie (2018) proposed a fully unsupervised framework for learning multilingual word embeddings using only monolingual corpora. Unlike prior work in multilingual word embeddings, these embeddings exploit the relations between all the language pairs. By mapping all monolingual embeddings into a shared multilingual embedding space via a two-stage algorithm consisting of Multilingual Adversarial Training and Multilingual Pseudo-Supervised Refinement, the authors propose an effective method for learning the representation. Recent works for the task of text classification exploit pretrained contextualized word representations rather than context-independent word representations. These pretrained contextualized word embeddings, like BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2020), have outperformed many existing techniques on most NLP tasks with minimal task-specific architectural changes and training resources. Contextualized word embeddings have been used for detecting subjective bias (Pant et al., 2020), and detecting hate speech in tweets (Mozafari et al., 2019).
To train multilingual variants of the contextualized word embeddings effectively, Lample and Conneau (2019) proposed a cross-lingual language modeling methodology termed as XLM. In their paper, they proposed two variants of their model: supervised and unsupervised multilingual language models. The supervised variant used parallel data with a new cross-lingual language model objective.  proposed XLM-R extending XLM. XLM-R differs from XLM in its ability to allow pretraining of the cross-lingual language models at scale using larger datasets and better-tuned hyperparameters. Their experiments on XNLI, MLQA, and NER show significant increases over the previous state-of-the-art. We further finetune XLM-R, pretrained on 2.5 TB Common Crawl corpus spanning 100 languages.

Methodology
In this section, we describe the preliminaries for the work: XLM-R, a cross-lingual contextual word representation, and cross-lingual inductive transfer. We then illustrate our sequential approach for finetuning the models.

XLM-R
XLM-R is a transformer-based cross-lingual model pretrained using a multilingual masked language model objective on 2.5 TB of CommonCrawl data in 100 languages. XLM-R obtains the state-of-the- Figure 1: Architecture of our cross-lingual model. art performance in cross-lingual classification, sequence labeling, and question answering. It obtains competitive results when compared with monolingual models on GLUE (Wang et al., 2018) and XNLI tasks, showing that it is possible to have a single large model for all languages without sacrificing perlanguage performance.
XLM-R, like other multilingual contextual word embeddings, suffers from the curse of multilinguality . While low-resource language performances can be improved by adding high resource languages using pre-training, the overall downstream task may suffer from capacity dilution (Arivazhagan et al., 2019). This degradation happens because model capacity is constrained due to practical considerations including memory and speed during training. Moreover, for a fixed size model, the per-language capacity decreases with an increase in the number of languages. Despite these limitations, XLM-R provides competitive results on downstream tasks when compared with mono-lingual models.

Cross-lingual inductive transfer
Multilingual contextual word embeddings retain a partial level of alignment, as has been observed in (Cao et al., 2020). In XLM and XLM-based embeddings, multiple languages are specifically trained together with a sentence encoder leading to a higher degree of similarity in the alignment of corresponding words in different languages. Consequently, XLM word embeddings have shown to perform competitively in SemEval'17 cross-lingual word similarity task (Lample and Conneau, 2019). Thus, XLM achieves a higher degree of alignment while learning cross-lingual representation.
OffensEval 2020 provides a multilingual dataset for offensive language detection in five different languages, allowing one language's learning to aid another. Therefore, cross-lingual inductive transfer learning using XLM-R with relatively aligned embeddings allows for the inductive transfer of linguistic features among different languages. Figure 1 illustrates our chain-like model for detecting offensive detection on all five languages trained in a sequential manner. While fine-tuning for every language, we validate the model only with the current language upon which it is being fine-tuned.

Experiments and Results
In this section, we present details on the multilingual OffenseEval dataset, experimental settings for the validation experiments, and specifications of the system runs.

Dataset
In this subsection, we provide a comprehensive statistical analysis of the multilingual Offensive Language Identification Dataset (mOLID). It comprises tweets in five different languages: English , Turkish (Çöltekin, 2020), Greek (Pitenis et al., 2020), Arabic (Mubarak et al., 2020), andDanish (Sigurbergsson and. The tweets in English are annotated using a finegrained three-layer annotation scheme, whereas the other four languages are annotated using a coarsegrained annotation scheme. For the English language, they use a three-level hierarchical annotation schema to distinguish between whether the language is offensive or not, the type of offensive language, and the target.
Statistical analysis of the dataset in Table 1

Experimental Setting
In this subsection, we outline the experimental setup for the task and present the results obtained on both the validation dataset and the blind test set. For experiments, we used XLM-R having 16, 550M parameters and 250K vocabulary size. For validation, we train and evaluate our model using 90 − 10 train-validation data split for the five languages in the given order: English, Turkish, Greek, Arabic, and Danish. For models of all the languages, we finetune XLM-R with a learning rate of 1 * 10 −5 for 2 epochs each with a maximum sequence length of 50 and a batch size of 32. We evaluate all the models on the following metrics for the binary classification: F1, Precision, Recall, Accuracy.     Table 3 shows the performance of our proposed model on the test dataset held out by the organisers for all the five languages. Organisers used Macro-F1 score for evaluating the model performance on test data. We observe that F1-score varies from language to language, from 0.919 from English to 0.759 for Danish.

Discussion
XLM-R is an unsupervised cross-lingual representation pretrained using transformers on an enormous scale across 100 languages. It can be fine-tuned for different downstream tasks in multiple languages, allowing one to take advantage of cross-lingual transfer learning. Like other transformer-based contextual word embeddings, XLM-R is highly scalable, and the process of fine-tuning takes minimum training efforts and resources.
Our proposed model, extending XLM-R, was trained in five different languages for the detection of offensive language. Owing to the training of XLM-R on hundred languages, our model allows a direct extension to other commonly-used languages. Our chain-like model exploits learnings from one language for detecting offensive language on another unrelated language.
We further performed experiments for detecting offensive language in a language different from the ones that we have trained on without supervision. This form of learning, known as zero-shot learning, has been demonstrated in recent related works (Artetxe and Schwenk, 2018;Lample and Conneau, 2019). To evaluate the efficacy of our model to perform zero-shot learning, we detect offensive language text in German using our final cross-lingual model. We perform the experiments in the same experimental setting as the GermEval Shared Task on the Identification of Offensive Language's Subtask 1 (Wiegand et al., 2018), which entails coarse-grained binary classification of offensive language.  Our experiments, illustrated in Table 4 show results competitive to the best-performing model by Paraschiv and Cercel (2019), which used supervised training data. Thus our model, despite not using supervised data in German, achieves competitive performance with a difference of 1.3% in Accuracy, further showing an increase in the metric of Precision by a margin of 4.06%.

Conclusion
This work presents a cross-lingual inductive transfer learning approach to detect the offensive language in tweets across five languages: English, Turkish, Greek, Arabic, and Danish. Our proposed model performs competitively for all five languages. We further performed comprehensive experiments to show that our model performs well in a zero-shot setting, which can be useful in the case of low resource languages. Our proposed model can be easily extended to other languages with minimum training efforts and resources, performing competitively even with significantly imbalanced data. Future work may involve including multiple unrelated languages to enable a universal offensive language detection model and the application of the proposed approach in other cross-lingual tasks.