HOMADOS at SemEval-2021 Task 6: Multi-Task Learning for Propaganda Detection

Among the tasks motivated by the proliferation of misinformation, propaganda detection is particularly challenging due to the deficit of fine-grained manual annotations required to train machine learning models. Here we show how data from other related tasks, including credibility assessment, can be leveraged in multi-task learning (MTL) framework to accelerate the training process. To that end, we design a BERT-based model with multiple output layers, train it in several MTL scenarios and perform evaluation against the SemEval gold standard.


Introduction
Fine-grained propaganda detection is a new approach to tackling online misinformation, highlighting instances of propaganda techniques on the word level. These techniques are used in textual communication in order to encourage certain beliefs, but instead of straightforward presentation of arguments, they rely on psychological manipulation, logical fallacies or emotion elicitation.
There are general-purpose natural language processing (NLP) methods that could be used for automatic detection of such text fragments. The challenge here is that they require large amounts of training data, which are laborious to produce. However, propaganda techniques are often related to other misinformation challenges, for which large datasets do exist, e.g. credibility assessment or fake news detection.
In the present study we aim to investigate how this connection can be used in the multi-task learning (MTL) framework. We show how the performance of multi-label token-level propaganda detection within shared task 6 at SemEval-2021 can be improved by building neural architectures that are also trained to solve other tasks: singlelabel propaganda detection from SemEval-2020 and document-level credibility assessment based on a fake news corpus. We check different MTL scenarios (parallel and sequential) and show which aspects of the model benefit the most from this approach.

Problem Statement
We participate in SemEval-2021 Task 6 (,,Detection of Persuasion Techniques in Texts and Images"), subtask 2 (Dimitrov et al., 2021), where the goal is to identify all propaganda techniques within a provided fragment of text. Specifically, given a character sequence c 0 , c 1 , . . . , c N , we aim to produce a set of annotations {(b 0 , e 0 , t 0 ), (b 1 , e 1 , t 1 ), . . . , (b k , e k , t k )}, where each triple consists of the character offsets of the span it covers (0 ≤ b i < e i ≤ N ) and an indication which one from the set of 20 known techniques is detected there (t i ∈ T ). We can see it as a multi-label sequence classification task (Read et al., 2009), where each character (or token) can be assigned from 0 to 20 labels.

Related Work
Propaganda has been observed in text for a long time, but the problem of automatic detection of such techniques was posed just recently. Initially, a lack of word-level datasets confined the analysis to document-level classification, e.g. based on stylometric features (Rashkin et al., 2017;Barrón-Cedeño et al., 2019). Classification on the word level became possible with the dataset (Da San Martino et al., 2019b) released for the ,,Fine-Grained Propaganda Detection" shared task at the NLP4IF 2019 workshop (Da San Martino et al., 2019a). The corpus includes 550 news articles annotated with propaganda techniques on the word level. Among the submissions, the best performing models were based on word embeddings and pretrained lan-guage models, such as BERT (Devlin et al., 2018). To tackle the data sparsity problem, participants employed various over-sampling methods or trained their models on auxiliary data. Similar objectives were pursued at SemEval 2020 Task 11, consisting of two subtasks: binary sequence tagging task and multi-class classification task. The majority of tasks' participants based their solutions on the Transformers architecture (Vaswani et al., 2017) and word embeddings, combining them with other neural architectures (e.g. CNNs or LSTMs) or models such as CRF and logistic regression. Ensemble models were able to achieve satisfactory results when performing both tasks jointly.

Data Description
We make use of three datasets in English. In all the following approaches, the main focus is on the corpus released for SemEval-2021 Task 6 (Dimitrov et al., 2021) (S21). Additionally, we utilise the corpora from SemEval-2020 Task 11 (S20) (Da San Martino et al., 2020) and news credibility research (FN) (Przybyła, 2020). S21 consists of text of 870 memes (607, 63 and 200 in the training, development and test subsets, respectively) annotated with 1550 spans a few words long (40 characters on average), each from one of 20 propaganda techniques. Most commonly occurring techniques are Loaded language (35%), Name Calling/Labelling (19%) and Smears (12%). S20 corpus consists of 446 press articles (371 and 75 in the training and development subsets, respectively) annotated with 14 propaganda categories on a word level. Among the 7192 annotated spans, Loaded language (34%), Name Calling/Labelling (17%) and Repetion (12%) are most common categories. Given that very few spans overlap (8%), we represent the task as single-label classification by merging these spans according to their order in corresponding label files. Finally, we exclude sentences that do not contain any propaganda annotations.
To obtain the FN data, from the original corpus of 103,219 news articles classified as either credible or non-credible based on their source, we randomly select a sample of fifty thousand sentences with a binary credibility label. Figure 1 shows the architecture designed to fulfil the MTL objectives. A text document (usually one sentence) is first processed by BERT, resulting in 768-dimensional vectors: h i for the i-th token and h 0 for the whole document, using the [CLS] token. These vectors are processed by classification modules D x , each consisting of a linear dense layer and a softmax activation function. Three types of such operations are considered:

Multi-Task Architecture
, indicating class probabilities in binary single-label classification, token-level representation is used to produce k-dimensional score vector (s i ), indicating class probabilities in multiclass single-label classification, token-level representation is used to produce l×2-dimensional score vector (m i ), indicating class probabilities in multiclass multi-label classification.
The following subsections describe several scenarios of using these three output types to improve the accuracy of propaganda detection.

Single Task
In the primary method we use BERT-Base-Uncased with the token-level multi-label classification layer, trained using only S21 data (SINGLE S21). The output for the i-token, denoted by m i , is a 20 × 2 matrix, in which the j-th row reflects the probability of the j-th propaganda technique being present in this token. If the token does not participate in any propaganda techniques, the first column of the matrix will be filled with ones and the second one with zeros. Since the S21 corpus is annotated at the character level, during preprocessing we map the initial annotation into tokens obtained via Word-Piece tokenisation.

Sequential Multi-Task Learning
In case of sequential MTL, the main training described in previous section is preceded with training for one of two auxiliary tasks: • single-label classification task on S20 corpus (MULTI-S S20-S21).
• document-level classification task with FN corpus (MULTI-S FN-S21). For MULTI-S S20-S21, we involve the tokenlevel single-label classification layer to produce 16dimensional s i vector. This allows to classify each token in S20 corpus into one of 16 categories (14 propaganda + non-propaganda + padding). MULTI-S FN-S21 uses the document-level single-label classification output (d 0 ) layer for classifying sentences from FN corpus as coming from credible (d 0 = (0, 1)) or non-credible (d 0 = (1, 0)) articles.
For each model the learning procedure is the same: first, during an auxiliary task, only the additional classification layer is trained using crossentropy loss and the auxiliary data. In the second phase, the training continues as a regular task on S21 data, as described in the previous section. Weights of all trainable variables are being updated in both phases.

Parallel Multi-Task Learning
In the parallel MTL objective, the auxiliary task and the target task are learnt jointly. Similarly to sequential MTL, we devise two models, each consisting of BERT with two separate classification layers on top: • single-label and multi-label classification on S20 and S21 corpora (MULTI-P S20-S21), • document-level and multi-label classification tasks on FN and S21 corpora (MULTI-P FN-S21).
Every batch of data consists of sentences coming from both datasets: four sentences from S20 or FN and four sentences from S21 are sent through their corresponding classification layers to produce outputs, and then count losses and update weights based on appropriate losses.

Experimental setup
We train our models according to multi-task scenarios, and use development subset of S21 to choose optimal number of training epochs of the final phase. The model trained up to this point is applied to test data to produce final predictions. In case of sequential MTL scenarios, this is preceded by training on additional corpora: on S20 for 10 epochs or on FN for 1 epoch. In case of parallel multi-task scenarios, the difference of training set sizes requires a special approach. For S20-S21, one epoch of training covers the whole S21 and 1/9 of S20. For FN-S21, we choose a balanced subsample of 18 thousand sentences and each training epoch covers the whole S21 and 1/30 of this subsample.
We use cross-entropy as the loss function, and compute it only for for non-padding tokens. For all experiments we use a maximum sequence length of 210 tokens, Adam optimizer (Kingma and Ba, 2015) with the learning rate of 3 × 10 −5 and batch size of four sentences. We use the L1 regularisation (Ng, 2004) with α = 0.01. During fine-tuning of the model, weights of all trainable variables, including those in BERT, are being updated. During inference, we translate token-level labels back to character-level labels, including spaces and punctuation marks between adjacent tokens with identical labels. All experiments are conducted within the TensorFlow framework. The best F1 scores are highlighted. The run submitted to the shared task is underlined.

Evaluation measures
To evaluate our results we use character-level F1 measure prepared for the shared task (Dimitrov et al., 2021). It compares model's results with the golden annotations, accounting for the imbalance of categories and partial overlaps between fragments with the same label. Table 1 shows the performance of the considered approaches on the development and test set. The highest F1 score on the development set was obtained by the MULTI-P S20-S21 model. Hence, this model was used to generate the predictions on the test set submitted to the shared task (underlined). However, we can see that the single task approach is not far behind on the development set and actually provides the best performance on the test set. The differences between approaches are relatively modest and no single model outperforms others on each set and metric. This is mostly due to the small size of the propaganda datasets. Specifically, choosing the approach and number of training epochs based on the development set, which contains just 63 documents, may lead to overfitting. In order to better understand how the introduction of MTL influences the models, we perform additional experiments. Firstly, in Table 2 we show F1 score for the recognition of each technique in single task and sequential MTL scenarios using both auxiliary datasets. One could expect the usage of S20, annotated with a similar set of propaganda techniques, to improve performance for overlapping labels, but the data do not confirm this. For example, the performance for the relatively large (12.7%) Smears (Smr) category improves noticeably, even though it was not present in S20. At the same time, we see F1 drop in case of some techniques present in both datasets, such as Appeal to authority (AtA) or Slogans (Slg). Clearly, the  Table 2: Per-technique F1 score on test set for different auxiliary datasets: S20 propaganda (S20) and fake news (FN) used in sequential multi-task scenario (techniques with no performance differences omitted for brevity).

Results
language constructions covered by these labels in case of press articles and memes are too different to offer clear advantage of MTL. The relationship with fake news detection is even weaker, resulting in many techniques not being recognised. Secondly, in Figure 2 we show how F1 on test set changes during training on S21 for the single task configuration and two scenarios based on S20 data: sequential and parallel. As expected, we see that pre-training allows our model to obtain good performance much faster, e.g. reaching F1=0.4 after 7 epochs instead of 14. But after longer training, the single-task approach catches up and beyond 20th epoch, when all version reach stable results, it outperforms the MTL variants.

Conclusion
In this work we explore how detection of propaganda techniques in text of memes can be facil- Figure 2: F1 scores during training for single task approach and multi-task learning using S20 data. itated using external data in multi-task learning framework. The results show that the auxiliary tasks indeed influence the results, both in terms of accelerating the learning process and changing the set of recognised techniques. Nevertheless, these modifications do not offer clear advantages over the basic BERT-based solution.
We hypothesise this is because the link between main and auxiliary tasks is not strong enough to deliver benefits through multi-task learning. Additionally, propaganda is rarely a straightforward phenomenon and different techniques may require tailored approaches. We treat this effort as a preliminary study and aim to further investigate MTL's relevance in detecting propaganda by extending the auxiliary tasks portfolio with corpora reflecting other related issues, such as hate speech or hyperpartisan language.