SkoltechNLP at SemEval-2020 Task 11: Exploring Unsupervised Text Augmentation for Propaganda Detection

This paper presents a solution for the Span Identification (SI) task in the “Detection of Propaganda Techniques in News Articles” competition at SemEval-2020. The goal of the SI task is to identify specific fragments of each article which contain the use of at least one propaganda technique. This is a binary sequence tagging task. We tested several approaches finally selecting a fine-tuned BERT model as our baseline model. Our main contribution is an investigation of several unsupervised data augmentation techniques based on distributional semantics expanding the original small training dataset as applied to this BERT-based sequence tagger. We explore various expansion strategies and show that they can substantially shift the balance between precision and recall, while maintaining comparable levels of the F1 score.

2 Related Work Yoosuf and Yang (2019) proposed a solution for Fragment Level Classification (FLC) task in the Fine Grained Propaganda Detection competition at the NLP4IF'19 workshop. The participants had a similar task as in "Detection of Propaganda Techniques in News Articles" competition of SemEval 2020 to detect text fragments with propaganda. The difference is that the markup was at the level of whole sentences. As a result, the authors solved the problem of determining the sentence to one of the 19 classes (without propaganda or one of the 18 types of propaganda). To solve this problem they used model based on BERT Language Model with linear classification head for token classification. They also tried several techniques to overcome the lack of data and classes imbalance: 1) weighting rarer classes with higher probability; 2) sample propaganda sentences with a higher probability than non-propaganda sentences. Ek and Ghanimifard (2019) describe solution also for the same competition at the NLP4IF'19 workshop. As a classification model they use BiLSTM. In addition to the model development, the authors investigate different augmentation techniques for balancing classes. They used synthetic-minority over-sampling (Chawla et al., 2002) algorithm to generate token embeddnings for the minority classes in the dataset. They used three models for contextual embeddings -ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and GROVER (Zellers et al., 2019). Out of these models, ELMo showed the overall best F1-score for classes in the FLC task. However, for individual classes, the best model varies.
Since in the previous competition the participant with successful solutions focused more on pre-trained contextualized models, we also decided in our approach to focus on such models, BERT in particular. Moreover, as data augmentation applications were used in previous works in propaganda detection (Ek and Ghanimifard, 2019) and also have shown significant results in other fields like computer vision (Krizhevsky et al., 2012), it seems promising to continue research in this direction.

Method
We approach the problem as Named Entity Recognition (NER) problem with two classes -inside and outside of propaganda. Since models for such a task usually use token classification and the markup was made on a character level, firstly, we made a preprocessing step that converts char-level markup into token-level markup. At the post-processing step, we did the reverse transformation of the markup. The pipeline of our final solution is presented in Figure 1.

Model
During the competition, we conducted experiments with different models. The first one was BiLSTM-CNN-CRF model (Ma and Hovy, 2016) as implemented by Chernodub et al. (2019). This is a commonly used approach for NER and sequence labelling task: it uses both word-level and char-level embeddings that are fed to BiLSTM-CNN-CRF module. The second one, denoted as BERT-Linear, relies on a linear classifier on the top of BERT-based token representations. The implementation of this sequence tagger is based on BertForTokenClassification class from the transformers library 2 as done by Shelmanov et al. (2019). Finally, we also tried BERT-CRF model 3 : after BERT classifier a Viterbi decoder is used for better tags sequence approximation. Our final submission is based on the second model architecture as it yielded overall better results, as described below.

Preprocessing
Our solution performs token-level classification, while the data labels are at the character level. Standard libraries for tokenization did not work for us, as it was noticed that during the reverse transformation from token-level to character level markup index shift occurred. Therefore, we developed our method for conversion to the correct markup. We can distinguish such features of preprocessing: we did not exclude stopwords and special characters (such as, for example, quotes), because they are quite often related to the propaganda class; we tried contexts of different sizes as the input, however, the context of 3 sentences turned out to be the most beneficial for our solution.

Implementation
We trained our models with Nvidia RTX 2080 Ti graphic cards. Our best solution was based on BERT-Base, Cased model. We used Adam optimizer with a learning rate of 3 · 10 −5 . We fine-tuned such hyperparameters as the number of epochs, batch size, maximum length of sequence with Facebook Ax 4 library. For our best solution we chose the number of epochs 7, batch size 16, and sequence length 120.
As we were only provided with the train set, we trained models with 3-Fold cross-validation.

Results
Submission results are presented in Table 1. Unfortunately, for BiLSTM-CNN-CRF model the amount of data was not enough. Although the solution based on this model overcame the baseline, it showed the worst result. BERT-CRF showed the best Precision but lost a few points in Recall. BERT achieved the best Recall as well as the F1 score outperforming the baseline by a large margin. The disadvantage of this model was that it did not combine neighboring words that could have been as a single phrase related to propaganda, which, in theory, BERT-CRF should have done. One of the approaches was to artificially combine nearby words into a single span. However, the best solution came without such post-processing.

Data Augmentation
We hypothesized that relatively low results obtained by the baseline model could be due to the reasons that, (i) one the one hand the semantics of the phenomenon at hand is complex and (ii) on the other hand, the training dataset is too small for even fine-tuning. Therefore we decided to perform automatic augmentation on the provided dataset to try reaching better generalization and more stable training of the baseline model.

Hypothesis
The hypothesis tested in our experiments was the following: the increase of the articles number with different data augmentation techniques will help to achieve a better generalization of the model due to more diverse training examples.

Method
We focused on the model that gave us the best F1 score on the dev set leaderboard. We tried several strategies for dataset expansions: Original Even though the number of those infected has dropped in recent weeks, the plague will never truly be gone.
GloVe n Even though the several of those infected has dropped in recent ago, the outbreak will never truly be gone.
n, adj Even though the one of those infected has dropped in earlier last, the pneumonic will never truly be gone.
n, adj, adv Even though the other of those infected has dropped in last month, the cholera will ever indeed be gone.
n, adj, adv, v Even though the some of those hiv has slipped in earlier days, the bubonic will once quite be nothing. fastText n Even though the total of those infected has dropped in recent days, the pestilence will never truly be gone.
n, adj Even though the amount of those infected has dropped in latest month, the scourge will never truly be gone.
n, adj, adv Even though the size of those infected has dropped in last years, the bubonic will seldom hardly be gone.
n, adj, adv, v Even though the quantity of those infested has fell in previous hours, the epidemic will rarely fully be went. BERT n Even though the majority of those infected has dropped in recent years, the disease will never truly be gone.
n, adj Even though the fate of those infected has dropped in three decades, the infection will never truly be gone.
n, adj, adv Even though the population of those infected has dropped in two months, the virus will now really be gone.
n, adj, adv, v Even though the percentage of those dead has been in these times, the epidemic will soon fully be over. (2) word POS used for that substitution, e.g. "n" for noun expansion. Red color denotes replaced nouns, green is adjectives, violet is adverbs, and blue color denotes replaced verbs.
Finally, yellow box denotes the target propaganda span annotation.
• Substitution model. In order to find a replacement for the word, we used the search for the nearest word vector representations. In this research we decided to investigate research on GloVe • Choice of words to replace. We chose candidates for replacement based on their parts of speech (POS). At the same time, we did not replace stop words, as well as words with a high frequency of occurrence in the language. This was done not to replace the pronouns, common nouns (everything, nothing), numerals, common adverbs (very), etc. As a results, we combine several strategies for substitution: only nouns (n); nouns and adjectives (n, adj); nouns, adjectives, and adverbs (n, adj, adv); nouns, adjectives, adverbs and verbs (n, adj, adv, v).
• Classes. We also created combinations based on classes from which we chose words to substitute: only from propaganda class, only from neutral, or from both.
• The increase of dataset ratio. Another tested parameter is the output size of the final augmented dataset. We ran experiments with making two (x2), five (x5) and ten (x10) fold augmentation. The increase of the dataset was done as follows: 1) the corresponding number of times the substitution algorithm was run for a sentence; 2) from the sentence at each run iteration 70% words from all candidates were randomly selected for substitution 5 ; 3) for the selected words a replacement was randomly chosen from the top-5 list of the synonyms. These manipulations allowed us to obtain various combinations of substitutions, and, accordingly, more diverse contexts in the data.  Table 3: Results of our augmentation strategies on the development set varying by the following parameters: (1) the number of times the dataset was increased, e.g. "x5" for five-fold expansion; (2) the class that took part in substitution -one can expand words either from "Propaganda" class, "Neural" class or both of them; (3) word vector model for synonyms search, e.g. GloVe; (4) POS used for that substitution, e.g. "n" for noun expansion or "adj" for adjectives. The top row shows result of the baseline BERT-based sequence tagger trained on the original dataset. The bold font denotes improvements with respect to this baseline while the underlined text denotes the best results outperformed the baseline.

Results
An example of our sentence augmentation method is presented in Table 2. We can see several quite successful replacements: for instance, the word plague was substituted with synonyms cholera, epidemic.
Although substitutions are not always accurate in context, in general, the meaning of the sentence is preserved.
As BERT-Linear model showed the best results on dev set, we decided to focus on this model in our experiments. The results on dev set submissions for BERT-Linear model trained on augmented datasets are presented in Table 3. Expansion of neutral class allows us to boost Recall, and in some cases even without the loss of Precision (e.g. Glove, x2). Using this strategy we got the result better than the selected baseline (F 1 = 0.3683). The increase of Recall is also observed when using fastText with replacing all parts of speech. So, it is disadvantageously to perform the expansions to nouns only -the majority of improvements occurred in POS combinations. In the case of Propaganda and Both classes, augmentation improves Precision of the model, especially when large number (x5, x10) of expansions is performed. BERT-based expansions show worse results than Glove and fastText. The reason for that can be the availability of sufficient information about vector embeddings in the language model itself.
Therefore, the following conclusion can be made: the increase of dataset with several augmentation strategies, unfortunately, did not give a strong improvement to the model performance. However, some applied methods for data extension gave a significant improvement in Recall metric.

Conclusion
We presented the solution of "SkoltechNLP" team for the Span Identification task in the SemEval-2020 task 11 competition. Our final solution is based on the BERT masked language model, specially pretrained for the NER task, which showed strong performance out-of-the-box with respect to the baseline. In addition, we investigated various strategies for the dataset augmentation on the public set of our best model, trained on the expanded text datasets. Unfortunately, this approach did not give a significant increase of the F1 score. However, it was shown that the proposed strategies can substantially improve precision if words from the target "Propaganda" class are expanded and improve recall if substitutions for the neutral class is used to generate new training examples. Therefore, the developed expansion methods could be useful for shifting "sweet spot" of a classification model between precison and recall maintaining similar F1 level.
As future work, more careful search on hyperparameters for augmentation can be considered. For example, in this work we just manually selected the ratio of dataset increase (x2, x5, x10), however, this parameter can be searched on the scale of natural numbers. The same can be done with the ratio of words selected for substitution. We selected 70% ratio from our own conclusions of the diversity of the new contexts obtained, but this number can also vary. Another aspect not covered in the work is the imbalance of classes. In this case, expanding the data to balance examples in both classes, as well as redistributing class weights, can be useful for obtaining a more stable model.