FPAI at SemEval-2021 Task 6: BERT-MRC for Propaganda Techniques Detection

The objective of subtask 2 of SemEval-2021 Task 6 is to identify techniques used together with the span(s) of text covered by each technique. This paper describes the system and model we developed for the task. We first propose a pipeline system to identify spans, then to classify the technique in the input sequence. But it severely suffers from handling the overlapping in nested span. Then we propose to formulize the task as a question answering task by MRC framework which achieves a better result compared to the pipeline method. Moreover, data augmentation and loss design techniques are also explored to alleviate the problem of data sparse and imbalance. Finally, we attain the 3rd place in the final evaluation phase.


Introduction
Fake news detection has attracted the attention of many researchers. But most of the conventional fake news detection methods focuse on long-form journalism, and little attention has been paid to the propaganda techniques. Memes have become the most popular type of content on social media platforms, and it can easily mislead the audience to agree with the speaker through propaganda techniques. The SemEval-2021 Task 6 focuses on detecting the use of rhetorical and psychological techniques in memes without (subtask 1, subtask 2) and with (subtask 3) visual content.
We first adopted a model to identify the span and techniques sequentially, and made many attempts to optimize the model, but the results were not satisfactory. Then we tried some end-to-end method, e.g. MRC framework. * These authors contributed equally to this work This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/ We found that the pipeline model can't handle the span overlapping issue effectively, but the MRC framework with an informative query is well performing in this scenario. In addition, the data augmentation and a carefully designed loss can alleviate the impact of data sparsity and imbalance. We attain an F1 score of 0.3974 and the 3 rd place in the final evaluation phase

Related Work
There was also a related previous task on finegrained propaganda detection (Da San Martino et al., 2019), where the participants used Transformer-style models, LSTMs and ensembles (Fadel et al.,2019;Hou and Chen,2019;Hua,2019). Some approaches further used non-contextualized word embeddings, e.g., based on FastText and GloVe (Gupta et al.,2019;Al-Omari et al., 2019), or handcrafted features such as LIWC, quotes and questions (Alhindi et al., 2019). Moreover, Martino et al.2020 analysed computational propaganda detection from Text Perspective and Network Perspective, argued for the need of combined efforts blending Natural Language Processing, Network Analysis, and Machine Learning.

Pipeline model for Span and Technique Detection
Inspired by the method (Chernyavskiy et al., 2020) proposed for NER task, we construct a pipeline model with RoBERTa (Liu et al., 2019) as the backbone to identify spans and techniques in input sequence. Figure 1 depicts our proposed pipeline model.

Span Identification
We treat the span identification as a binary sequence tagging task. The span identification model  is fed with chunks of sentences encoded by Byte-Pair-Encoding (BPE) and the Conditional Random Field (CRF) layer (Lafferty et al., 2001) receives the logits for each input token, and makes a BIO prediction for the entire input sequence, finally got the spans in input sequence.

Technique Classification
We take the result of span identification model as a part of the input: where <context> is the sentence from which the span is extracted. The input of softmax layer includes three parts, (i) context embedding, is extracted from the last two layers of RoBERTa model; (ii) span embedding, is the average of span tokens embedding; (iii) span length embedding, is constructed with the length of the span, as different propaganda techniques have significant differences in span length. Finally, the model output the technique category for the given input sequence.
As the pipeline model suffers from incapable of handling overlapping issue in nested span, significantly inspired by (Li et al., 2019), we proposed to utilize the MRC framework to identify the span and corresponding techniques.

Span Detection as MRC
We formulize the task as a question answering task. Each span is characterized by a query, and span are extracted by answering these queries given the context. For example, the task of assigning the Name calling/Labeling to "CALM DOWN [LITTLE TRUMP HATER]\nI FOUND YOUR BINKY\n " is formalized as answering the question "Find the words and phrases with strong emotional implications (either positive or negative) that in-fluence an audience". This strategy naturally tackles the span overlapping issue in nested span: the detection of different spans that overlap requires answering different independent questions.

BERT Encoder
Given the question q y , we need to extract the text span x start,end from input text sequence X = {x 1 , x 2 , ...x n } given q y using MRC frameworks. We use BERT as encoder with the concatenated informative query q y and input sequence X as input by adding special token [CLS] and [SEP], and get the whole representation matrix H from BERT.

START-Predictor
The START-Predictor output the probability of each token being a start token of a span, and E start is the parameter to learn:

END-Predictor
The END-Predictor output the probability of each token being an end token of a span, E end is the parameter to learn:

SPAN-Predictor
As there could exist more than one span in given input sequence, we need to predict multiple start token and end token. Deciding which start-end pair that consist of continuous token sequence between start token and end token. We get the possible start token indexes I start and end token indexes I end by applying argmax on P start and P end . Each i in I start and j in I end construct a continuous token sequences x i,j with constriction i ≤ j. A binary classifier is used to compute the probability of x i,j as a valid span, and E span is the parameter to learn:

Train
During the training stage, input sequence X is with two label sequence Y start and Y end , representing the ground-truth label of each token x i being the start index and end index. Y start,end the groundtruth span. And the loss for each predictor as follows: The model is end-to-end fine-tuned by minimizing the loss as follows:

Predict
For every example to be predicted, we duplicate the example 20 times and pair each example with one technique description as model input. The model output spans detected for each example, and techniques can be mapped from those examples with spans detected.

Experiment and Result
We use the provided training dataset for training, and evaluate the model on the development dataset during evaluation phase. We implemented several experiments to explore the effect of different methods. The following is the detail for each experiment.

Data augmentation
As there are only 290 records in the training dataset, we explored two data augmentation methods to increase the amount of relevant data. We did the data augmentation based on PTC corpus 1 . And Enlarge Training Dataset get a better result.

Two-Stage Fine-Tune
Train a best-performing model using BERT as backbone with PTC corpus, then finetune the trained model on the training dataset. Table 1 shows the result of this method.

Enlarge Training Dataset
Train a best-performing model using BERT as backbone with training dataset and annotate PTC corpus automatically. After that we train the second model using the filtered samples according to annotated labels and training dataset. Table 1 shows the result of this method.

Focal Loss
By setting γ ≥ 0 in Eq.12, the contribution of easy samples can be down-weighted in the loss function, enabling to focus more on harder samples during training (Lin et al., 2018).

Asymmetric Loss
Asymmetric loss can perform hard thresholding of very easy samples, meaning fully discard negative samples when their probability is low enough(Ben-    Baruch et al., 2020).
Where the probability margin m ≥ 0 is a tunable hyperparameter.

BERT-MRC Span Detection
In the training dataset, some examples do not contain any technology. We duplicate each example 20 times and pair same example with different technique description as model input.  Table 3 shows the result of different sampling strategy. When sample 4 negative examples, the model BERT-MRC neg=4 DA achieve the best result. The performance decrease when sample more negative examples, which may be caused by introducing too much noise into the model. The model is trained with hyperparameter learning rate 5e-5, batch size 8, max sequence length 128, and bert-base-cased as the backbone.

Conclusion
We create a pipeline model and an end-to-end MRC model to identify the span and techniques. We attain an F1 score of 0.3974 and the 3 rd place at in final evaluation phase with BERT-MRC neg=4 DA model. Our implementation shows that 1) reformulating the task into an MRC task to detect the overlapping span is effective; 2) data augmentation is about to increase model generalization ; 3) careful loss design can alleviate the effect of data imbalance.