3218IR at SemEval-2020 Task 11: Conv1D and Word Embedding in Propaganda Span Identification at News Articles

In this paper, we present the result of our experiment with a variant of 1 Dimensional Convolutional Neural Network (Conv1D) hyper-parameters value. We describe the system entered by the team of Information Retrieval Lab. Universitas Indonesia (3218IR) in the SemEval 2020 Task 11 Sub Task 1 about propaganda span identification in news articles. The best model obtained an F1 score of 0.24 in the development set and 0.23 in the test set. We show that there is a potential for performance improvement through the use of models with appropriate hyper-parameters. Our system uses a combination of Conv1D and GloVe as Word Embedding to detect propaganda in the fragment text level.


Introduction
News articles can have many forms of presentation. A news article often tends to influence the reader to have some particular view following the author's agendas. The method usually used is to add some bias to the article being written (Baeza-Yates, 2018). On a different level, news management sometimes has lacking neutrality shapes information by emphasizing positive or negative aspects purposefully (Jowett and O'Donnell, 2006). Identifying bias is not easy. Often the writer is not realized that he has included his bias in his writing. In another case, the writer deliberately introduces bias into his writings, intending to influence his readers. The last condition is known as propaganda. Propaganda is a shape of opinion or action by individuals or groups deliberately designed to change the perspective or actions of other individuals or groups regarding predetermined ends (Jackall, 1995).
Da San Martino et al. (Da San Martino et al., 2020) organized the SemEval 2020 Task 11: Detection of Propaganda Techniques in News Articles. This propaganda refers to whenever information is shaping purposefully to foster a predetermined agenda. This research focuses on subtask 1 about span identification that indicated the location of the text fragment of propaganda in news articles.
Propaganda span identification is a problem of detecting text fragment in a sentence or paragraph containing propaganda techniques. The purpose of this task is to detect the presence of words or phrases that cause bias in the information on purpose or called propaganda. This study proposes a deep learning approach, specifically 1 Dimensional Convolutional Neural Network (Conv1D) combined with word embedding (GloVe) in detecting text fragment related to propaganda. Conv1D, as part of Convolutional Neural Network (CNN), is one of the deep learning approaches that has been widely used. Convolutional Neural Network is widely used in computer vision, especially in image classifications where data is processed in the form of 2 dimensions (Hussain et al., 2018). Whereas in text processing, 1-dimensional Convolutional Neural Network (Conv1D) is used in relation to its suitability for the characteristics of text that only has 1 dimension in the form of word sequence. Conv1D was chosen because it has been known to have excellent performance in the natural language processing (NLP) task such as sentiment analysis and classification of text (Kim, 2014) (Kuttala et al., 2018) especially in the categorization of text fragments (Hughes et al., 2017) and classification in word and character level (Mo et al., 2018). Word Embedding is a language modelling and feature learning technique in (NLP) that maps words into a low dimensional continuous space (Li et al., 2015). Word embedding as input representation has been known to increase the performance of tasks in NLP through mapping words in the form of a low-dimensional vector that allows a more expressive and efficient representation by maintaining the contextual similarity between words (Naili et al., 2017). GloVe or global vector is one type of word embedding that is known to have an excellent performance. GloVe combines global matrix factorization and local context window methods. Rather than training on the entire sparse matrix or individual context windows in the corpus, GloVe model efficiently captures statistical information by training only on the nonzero elements in a word-word co-occurrence matrix. Which then becomes a solution to the problem of computational costs and context constraints (Pennington et al., 2014).
In our work, we explain the motivation of using Conv1D and word embedding (GloVe) in the related work section. In the system overview section, we describe the system that we used in our submission. We then describe the variant of the model with Conv1D that we use to reach the best performance of our classification model in the experimental setup section. We also describe our results on the development and test data, as well as a short analysis in the experiment result and analysis section.

Related Work
Some research has been done in detecting propaganda in news articles. Previous research on the language of news media in the context of political fact-checking and fake news detection trying to compare the language of real news with that of satire, hoaxes, and propaganda by finding linguistic characteristics of untrustworthy text (Rashkin et al., 2017). Other research proposes a model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and readability level to the presence of specific keywords (Barrón-Cedeño et al., 2019). Another research is trying to utilize BERT in classifying propaganda techniques in news articles (Hua, 2019).
However, the detection of propaganda related to this task tends to be challenging because they have to be done to the level of a text fragment, and this kind of task at least was first competed on SemEval. In other words, this task still leaves opportunities in terms of performance improvement or analysis related to the possibility of using other approaches than previously done. Conv1D in several previous studies has been shown to have an excellent performance related to the classification of texts in word and character level (Mo et al., 2018) and also in text fragment level (Hughes et al., 2017). This can be achieved by utilizing the process of convolution and pooling in the word, character and text fragment level by pick out stand out features like tokens or token sequences in a way that is not fixed in their position in the input sequence (Goldberg, 2016). The use of word embedding as word vector representation has also been shown to have better performance than the use of one-hot vector directly (Zhang and Wallace, 2017). These facts motivate us to explore the potential of using Conv1D combined with word embedding in text classification, especially at text fragment level by utilizing automatic feature extraction in news articles related to propaganda.

Dataset
The dataset used in this task is provided by SemEval 2020 Task 11. This dataset consists of the corpus with about 550 English news articles in which fragments containing one out of 18 propaganda techniques have been annotated. This dataset consists of pairs of files where the first file contains articles where rows separate each sentence. While the second file is labelled with a separated line tab format containing the article id, begin and end offset to show the location of word fragments that indicate propaganda as shown in Figure 1.

Methodology
We use the text fragment classification approach as a solution for SemEval 2020 sub-task 1 span identification. The classification process involves two classes, text fragments that indicate propaganda and text fragments that do not indicate propaganda. The workflow that we use starts with data preparation, model architecture determination, training and testing. We conducted a series of experiments on the model that we used by making changes to the hyper-parameters values and then comparing them to find out which scenario produced the best-performing models. Figure 2 shows the overall process of this research. Data preparation was used to change the dataset into some new form. The output of this process made it available for the classification model to understand the dataset as input. For this research, we used several preparation steps, including load data and data separation. Load data is combining articles with their labels. Data separation is the separation of text in each article into sentences accompanied by label data in the form of binary information indicating the location of word fragments related to propaganda. Box in the sentence part shows the word which indicates propaganda in the sentence. The position of this word captured in binary format label. This data format will then be entered into the model for training to extract features related to propaganda in each sentence. Figure 3 shows a sample of output data from the data preparation process.

Figure 3: Example of data train after preparation
In model determination, we will modify each layer that we use to determine the best performance model. We made this model based on the application example of Conv1D for the sequence classification from Keras Documentation as a model baseline. 1 We made a slight modification to this model by adding the embedding layer as the first layer to place word embedding as a weight on the model. The arrangement of layers is shown in Figure 4.
A series of experiments on hyper-parameters values will be conducted in determining the model with the best performance. Furthermore, this model will be used in testing step.
In this research, we use GloVe as weight in the embedding layer. This GloVe is a model from Wikipedia corpus as a word embedding or pre-trained word vector. This type has 6 million tokens and several dimensional variants. This dimensional variant is related to the number of vectors used as a form of representation for each word. In this research, we use a glove variant with 200 dimensions, assuming this variant has dimensions that are already large enough but has a lighter computational cost compared to the largest variants with 300 dimensions. Convolution layer used in this research is Conv1D. Conv1D creates a convolution kernel that is convolved with the input over a single spatial dimension to produce outputs. The number of different convolution layers and also variations of filter values and kernel sizes will be tested in the experimental phase related to hyper-parameters tuning. The initial values for these hyper-parameters will follow the values in the example application that are used as the model baseline. The pooling layer used in this research is MaxPooling1D. This layer performs pooling operations with maximum values in the single spatial dimension data in producing output. The pooling layer is usually applied in pairs with the convolution layer. MaxPooling1D has several parameters such as pool size, strides and others. Different value of pool size will be used as one of the hyper-parameters. Dropout layer is used as a method to overcome the possibility of overfitting that often occurs when using the deep learning approach. The dropout layer randomly decreases the number of neuron units in the network to reduce connections at each iteration in the training process. The dense layer in the model functioning as a classifier by only containing 1 unit of neural network with sigmoid activation function. Classification is done to distinguish which text fragment are related or not related to propaganda.
We will use macro average F1-scores as evaluation measurements in this set of experiments according to the type of measurement evaluation used by the organizer of this task.

Experiment Setup and Analysis
This section described the setup and analysis for the approaches used in this research, such as data, standard parameters settings, hyper-parameters tuning, and analysis of the experiment results.

Experiment Setup
The data we use in the training and testing process is data that has been through the process of loading data and data preparation that is ready to be processed by the model. There are 16670 pairs of sentences with the label which are then separated for training for 13338 pairs and validation for 3335 pairs.
In this research, we try to focus only on hyper-parameters tuning related to convolution layer. Several hyper-parameters will be tested. Values of these hyper-parameters will be given in a range of slightly above the values at the model baseline. These hyper-parameters including the number of convolution layer (convolution and pooling layer as pair), number of filters, size of the kernel and pool size. With fixed parameter values such as learning rate of 0.001, dropout value of 0.5, batch size of 32, the number of epochs 5 times with binary crossentropy as a loss function and Adam as an optimizer. All scenarios in this experiment were carried out using a development dataset to determine which scenario produced the best performance. Scenario with the best performance will be used in processing the testing dataset to produce output for submission.

Experiment Result and Analysis
Based on the experiments, the scenario with the best performance is shown in Table 1. This table also contains the results of several other scenarios, including the baseline scenario.
Sce 16 scenario as the best performing scenarios uses 3 layers of convolution layers with the same number of filters and variations on kernel size and pooling size. This scenario shows a pattern of increase in the kernel size and a decreasing pattern in the pooling size from the initial layer to the next layer. The next scenario variations after Sce 16 that using 3 convolution layers with the different number of convolution filters has shown a decrease in the value of F1-score achieved. This also happens in scenarios where the number of convolution layers exceeds the value of 3. Separation of scenario based on the number of convolution layers used, the best-performing scenario almost has the same pattern. This pattern includes the increasing of filters number, increasing kernel size number and decreasing pooling size number for each additional layers. Because the same pattern does not apply to Sce 16 scenarios with 3 convolution layers, it can be concluded that the pattern used in this scenario is the most optimal pattern related to the experimental arrangements in this research. Finally, the scenarios with the best performance in the development set are then applied to the test set, which then outputs are used in the submission of the SemEval 2020 sub-task 1. From this scenario submission, it is known that the F1-score obtained in the test set is 0.2347 with a ranking of 32 from 36 participants with the highest F1-score is 0.5155.

Conclusions
Based on the results obtained from this research, it was concluded that Conv1D combined with word embedding could be used as a model in propaganda span identification problem. The best results are generated from scenarios that use several convolution layers with the same number of filters but have different kernel sizes and pooling sizes for each layer. This scenario also shows an increasing pattern in the kernel size and decreasing pattern in the pooling size from the initial layer to the next layer. This pattern can be used as consideration related to hyper-parameters tuning at the convolution layer.
Furthermore, the limited combination of hyper-parameters values used in this research leaves the possibility of achieving better performance by using combinations that have not yet been tested. Other parameters whose values are made static are also likely hiding the potential for improved performance.