LeCun at SemEval-2021 Task 6: Detecting Persuasion Techniques in Text Using Ensembled Pretrained Transformers and Data Augmentation

We developed a system for task 6 sub-task 1 for detecting propaganda in memes. An external dataset and augmentation data-set were used to extend the official competition data-set. Data augmentation techniques were applied on the external data-set and competition data-set to come up with the augmented data-set. We trained 5 transformers (DeBERTa, and 4 RoBERTa) and ensembled them to make the prediction. We trained 1 RoBERTa model initially on the augmented data-set for a few epochs and then fine-tuned it on the competition data-set which improved the f1-micro up to 0.1 scores. After that, another initial RoBERTa model was trained on the external data-set merged with the augmented data-set for few epochs and fine-tuned it on the competition data-set. Furthermore, we ensembled the initial models with the models after fine-tuning. For the final model in the ensemble, we trained a DeBERTa model on the augmented data-set without fine-tuning it on the competition data-set. Finally, we averaged the output of each model in the ensemble to make the prediction.


Introduction
The definition of Memes was constantly changing since it was first conceived. But, Memes eventually got an academic definition, called an "Internet Meme". As Davison (2012) Internet Meme can roughly be defined as "a piece of culture, typically a joke, which gains influence through online transmission". But what makes Internet memes unique is the speed of their transmission and the fidelity of their form. Therefore the Internet meme can act as a powerful medium for persuasion techniques that preach an ideology or way of thinking. (Moody-Ramirez and Church, 2019) On the other hand, the term "propaganda" is defined as a form of communication that employs persuasive strategies and attempts to achieve a response that furthers the desired intent of the propagandist (Jowett and O'donnell, 2018). With the rise of social media, a new form of propaganda rises called "Computational Propaganda." The author in (Woolley and Howard, 2017) defined Computational Propaganda as "The use of algorithms, automation, and human curation to purposefully distribute misleading information over social media networks".
Task 6 at SemEval-2021 (Dimitrov et al., 2021), detection of persuasion techniques in text and images, defined three subtasks. The first two subtasks deal with the textual contents of memes that ask the participants to identify which of the 20 propaganda techniques are in the text. While the third subtask encourages the participants to determine which of the 22 techniques are in the meme's textual and visual content. This paper proposes a solution for subtask1 that uses pre-trained language models to detect propaganda and possibly even identify the persuasion strategy that the propaganda sample employs.
The rest of the paper is broken down as follows. Section 2 discusses related-work to the task of propaganda identification. Section 3 provides a description of the data and the pre-processing techniques used. Section 4 describes the proposed system and architecture. Section 5 presents system analysis. Finally, the conclusion and future work are provided in Section 6.

Related Work
There have been efforts in persuasion techniques identification and classification using machine and deep learning-based approaches. The authors in (Al-Omari et al., 2019) used word embeddings with BERT (Devlin et al., 2019) and BiLSTM (Schuster * Equal Contribution and Paliwal, 1997) for binary detection of propaganda spans. Authors in (Altiti et al., 2020) experimented with a CNN (LeCun et al., 1999), BiLSTM and BERT and showed BERT to have the best accuracy on classifying persuasion techniques in propaganda spans. Also, the authors in (Jurkiewicz et al., 2020) used a RoBERTa model (Liu et al., 2019), a class-dependent re-weighting method and used a semi-supervised learning technique of self-training and demonstrated the effects of these techniques in an ablation study. A group of researchers (Morio et al., 2020) experimented with a variety of PLMs (pre-trained language models), including BERT, GPT-2 (Radford et al., 2019), RoBERTa, XLM-RoBERTa , XLNet (Yang et al., 2019) and XLM (CONNEAU and Lample, 2019). And have demonstrated that RoBERTa and XLNet generally perform better for propaganda detection.

Data Description
In this section, we describe the data and the task and the preprocessing step

Data
The dataset used during our experiments has been provided by the SemEval-2021 Task 6 (Dimitrov et al., 2021). The dataset, "Competition dataset", consists of short text samples that were extracted from Memes. We have also resorted to using an external dataset (Da San Martino et al., 2019) that is comprised of news articles with propaganda spans, "External Dataset". To use the External dataset effectively, we needed to chop down the news articles closer to the text's length in the current dataset and take only the text fragment that contained the propaganda and the corresponding label representing the propaganda technique in that text fragment.

Data Preprocessing
Our data preprocessing pipeline consists of two components, 1) Data cleaning 2) Data augmentation. In this section, we will describe the techniques we used in each component.

Data Cleaning
To increase performance accuracy, some data preprocessing techniques have been tested. We have See https://github.com/jasonwei20/eda_ nlp for the data augmentation code experimented with typical pre-processing techniques, such as "Stop-Words Removal", which refers to removing commonly used words (such as "the", "a", "an", "in") to eliminate noise that may otherwise hinder the model's ability to learn and predict sequences. We have also experimented with "Stemming" which refers to the process of reducing inflection in words (e.g., connect, connected, connection) to their root form (e.g., connect). The specific Stemming algorithm that was used is Porters Algorithm (Porter, 1980).

Data Augmentation
We experimented with Data Augmentation (Wei and Zou, 2019). This is the process of using the original given data to produce more data to increase the given dataset size. Data Augmentation has been proved to be useful when dealing with small datasets. Although this technique is more prevalent in computer vision tasks, there are some versions of the technique that are specifically tailored to work with text data as described at (Wei and Zou, 2019). These techniques include Synonym Replacement, Random Insertion, Random Swap, Random Deletion, Back-translation. Table 1 shows examples on generating data using data-augmentation. This was done by using the "Easy Data Augmentation" library (Wei and Zou, 2019).
Back-translation was only applied on the Competition dataset and the other four techniques on the External dataset. For each sentence in the External dataset, 0.1 percent of Synonym Replacement, Random Insertion, Random Swap, and Random deletion was applied. We did that nine times per sentence for each sample. In the Back-translation, AWS API was used to translate the text from English to Dutch back to English. Also, from English to Dutch to Russian, back to English. For each sample, we generate two additional samples. The Competition dataset has a size of 487. After merging the External dataset and the Competition dataset, we ended up with a dataset of size 18,571. This data will be referred to as the "Competition + External Dataset". After applying data augmentation on the Competition + External dataset, we ended up with a dataset of size 52,966. This data will be referred to as the "Augmented Dataset." 4 System Description Different model architectures have experimented with different pre-processing techniques. The final system ended up ensembling five models, each

Original
This paper will describe our system in detecting propaganda in memes Synonym Replacement This theme will describe our arrangement in detecting propaganda in memes

Random Insertion
This key out paper will describe our system meme in detecting propaganda in memes Random Swap This paper in describe our system in detecting propaganda will memes Random Deletion This paper will describe system in detecting propaganda in memes Back-translation This document describes our Memorandum Propaganda Detection System Table 1: Generated Samples using Data-augmentation We have two training approaches we used in training our models. The first one is a typical finetuning. The second approach consists of two iterations. In the first iteration, the model is trained on the pre-processed dataset. In the second iteration, the model from the first iteration is fine-tuned on the Competition dataset exclusively. Figure 1 demonstrates the second approach.

Proposed System
The system is an ensemble model of 5 classifiers. One of them is using the DeBERTa large classifier, and the rest are RoBERTa large classifiers. Each classifier is trained on a different approach/preprocessing. For the DeBERTa large classify, the Augmented dataset was used with the stop words removed and lowered text case. Then we trained it on the first approach for six epochs. It achieved F1 micro of 0.554 on the development set. As mentioned earlier, there are 4 RoBERTa large classifiers; the first classifier is trained on the Competition dataset and the External dataset without augmentation. We dropped samples that do not have any propaganda technique and trained the model on the first approach for four epochs. It achieved an F1 micro score of 0.550. The second RoBERTa classifier has the same pre-processing as the first RoBERTa classifier but is trained on the second approach. It achieved an F1 micro score of 0.602 on the development set. The third RoBERTa classifier is trained on the Augmented dataset with the stop words and trained using the first approach with four epochs. It achieved F1 micro of 0.54. The fourth RoBERTa classifier is the same as the third model but fine-tuned on the Competition dataset and achieved an f1 score of 0.62. Table 2 summarizes the performance of the classifiers of LeCun's ensemble model.

Ensemble Analysis
This section will be analyzing different combinations of the models that lead to the final proposed system. In the second approach, we noticed that fine-tuning the classifiers from the first iteration on the Competition dataset will always boost the performance up to 0.1 additional f1 micro scores on the development set. However, when it came to the ensemble model, it turned out that the ensemble model with classifiers from the second training approach doesn't increase the overall performance, and sometimes it decreased it.    (5) isn't optimal. In the final ensemble model (G), we noticed that the overall score would decrease if we removed one of these models. For example, in ensemble (F), classifier (5) was dropped, and both f1 micro and macro decreased.

Error Analysis
This section examines the ensemble model weaknesses to give insight on what to do next to improve the model performance. We have generated the confusion matrix and scores for each class for the test set (see Appendix). In the confusion matrices, the "None" class indicates that either the model predicted an incorrect class (not in the ground-truth labels set) or it didn't predict the correct class (in the ground-truth labels set but not in the predicted labels set). It is worth noting that the correctly classified "None" (not in the predicted labels set and not in the ground-truth labels set) is not provided in the model evaluation confusion matrix.
We noticed that the model performs poorly at detecting the classes (last column) in the input test (see Figure A2). One possible explanation for this is that the model is trained on data that has a lot of samples without propaganda (count of labels = 0) (see Figure A1). In addition to that, the label matrix is sparse (zero is the dominant label in a one-hot vector). One possible solution is to remove samples with zero labels and rely on the sparsity of vectors in detecting samples without propaganda. Another possible solution is to train a Two-Stage model, where the first stage filters out non-propaganda samples and the second stage classify the propaganda samples.

Conclusion and Future Work
In this paper, we presented our proposed system, LeCun, for detecting propaganda in contextual content. We have used an external dataset from a previous SemEval competition and performed a data-augmentation on the external dataset to expand the dataset size. We have also investigated different ensemble combinations for state-of-theart pre-trained language models. However, there are many questions we got throughout our participation in this competition which made us curious to investigate. These questions are:-What is the influence of the data augmentation on the model performance? What is the influence of using an external dataset? How can the model weaknesses be improved? How can span identification help in improving the score of technique classification? For future work, we will be working on answering these questions. We plan to do more in-depth experimenting with different augmentation techniques and different model architectures. We will also investigate the influence of the external dataset by training models on the competition dataset, external dataset separately and compare the final results of each.  Table 1: Classification report of the submitted system on the test set Figure A1: Labels count per sample distribution of Competition + official dataset Figure A2: Confusion matrix of the submitted system on the test set -i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class