CASE 2021 Task 2 Socio-political Fine-grained Event Classification using Fine-tuned RoBERTa Document Embeddings

We present our submission to Task 2 of the Socio-political and Crisis Events Detection Shared Task at the CASE @ ACL-IJCNLP 2021 workshop. The task at hand aims at the fine-grained classification of socio-political events. Our best model was a fine-tuned RoBERTa transformer model using document embeddings. The corpus consisted of a balanced selection of sub-events extracted from the ACLED event dataset. We achieved a macro F-score of 0.923 and a micro F-score of 0.932 during our preliminary experiments on a held-out test set. The same model also performed best on the shared task test data (weighted F-score = 0.83). To analyze the results we calculated the topic compactness of the commonly misclassified events and conducted an error analysis.


Introduction
Event detection and classification as Natural Language Processing (NLP) tasks can be used to analyze data gathered in the information space. The findings of this analysis can then be connected to events in the physical world and contribute to situational awareness, particularly when they are related to socio-political events. The sheer amount of data that is generated and stored in the information space every day, means that strategies need to be developed to be able to efficiently and effectively process this data. Given the large amounts of data, deep learning strategies are often preferred. However, time and computational constraints may play a role in deciding how to extract and analyze data.
Task 2 in the Socio-political and Crisis Events Detection Shared Task at the CASE @ ACL-IJCNLP 2021 workshop aims at the fine-grained classification of events (Haneczok et al., 2021). The task is based on data extracted from the Armed Conflict Location & Event Data (ACLED) database (Raleigh et al., 2010). It consist of socio-political events that have been annotated based on the ACLED event taxonomy, and includes 6 event types and 25 event subtypes. The aim of this task is to label event snippets using a model trained on data from the ACLED dataset, in order to see how robust models are when presented with data that is not directly covered by ACLED or contains unseen event classes. The results presented in this paper pertain only to subtask 1, where the task is the classification of 25 different event subtypes with ACLED-compliant labels. In other words, all the classes are seen classes from the ACLED dataset. The second and third subtasks are zero-short learning tasks that contain unseen classes. This paper proceeds by first describing the data collection process in section 3. Section 4 contains the system description and the following section contains the experimental results. Section 6 provides an overview of the results based on the test data provided by the organizers. Finally, in section 7 the error analysis provides an insight into the system results and the data.

Related Work
Previous research in event detection and classification shows that there are numerous approaches to solve the problem of detecting events in texts. Xiang and Wang (2019) give a coherent overview of suitable strategies, starting with earlier approaches like pattern matching, and describing methods of machine learning as well as deep learning. There have been a number of shared tasks that have taken place in previous years that contribute to research conducted in this area. Specifically, the shared tasks CLEF 2019 Protest News (Hürriyetoglu et al., 2019), AESPEN 2020 (Hürriyetoglu et al., 2020), and CASE 2021 (task 1) (Hürriyetoglu et al., 2021) focus on event detection at both the sentence and document level, as well as event co-reference resolution.
Currently, not much research has been conducted that further analyzes event data once the events have been identified. There are a handful of studies across different domains. Peng et al. (2019) achieve state of the art results detecting and classifying social event data with a Pairwise Popularity Graph Convolutional Network (PP-GCN) with an external knowledge base. Nugent et al. (2017) compare different supervised classification methods for detecting a range of different events, and achieve good results with Support Vector Machines (SVM) and Convolutional Neural Networks (CNN). A benchmark corpus for fine-grained political event classification was created by the organizers of this task and an initial exploration and classification of the data is reported on in  and . The findings reported that BERT transformer models achieved a micro F1 of (0.943-0.949) and a macro F1 of (0.860-0.889). More simple TF-IDF-weighted character n-gram models also achieved good results. A large dataset of 600,000 annotated ACLED event snippets was used as training data.

Data collection
Due to copyright reasons, the data used in this paper was collected directly from the ACLED website. 1 To create the corpus, all data from each available region was downloaded and then filtered using the following steps.
Firstly, all events with less than 25 tokens and more than 1000 tokens were removed. The next step was to balance the corpus based on the 25 different fine-grained event classes. Originally, the largest class in the corpus consisted of 36.69% of the events, compared to the smallest with 0.001%. To create a more balanced version of the corpus, we extracted a sample of events per class, with the smallest classes being fully represented and extracting only a percentage of the largest classes. Note that it was not possible to fully balance all of the classes as there was only a very small amount of data for classes such as CHEM WEAP. A random sample of this balanced corpus was then split into a train (n=94000), development (n=9000), and test (n=2500) corpus, which also all contain the balanced class distribution. We observed that randomizing the order of events was crucial, to avoid 1 https://acleddata.com/data-export-tool/ introducing a bias based on the different ACLED regions. Figure 2 illustrates the distribution of the corpus. A more detailed table can be found in appendix A.
In a further step, we created three different versions of the original corpus. The first version, referred to as ACLED N, contains the original text from the ACLED download, an example of which can be found below.
{text: CPI(M) activists attacked a BJP rally in Hrishyamukh on 18 January 2018., subtype: FORCE_AGAINST_PROTEST} Based on the results presented by , where the BERT transformer model performed slightly better on the corpus with less preprocessing, we decided to include a version with little to no pre-processing. In ACLED L, we replaced all locations from the text using the Flair Named Entity Recognition (NER) tagger (Akbik et al., 2018) with the generic token 'LOC'. The third version, ACLED T, contains a pre-processed version of the original text, but without any time stamps. All dates and times were removed from the text and replaced with 'TIME'. These two alternative versions of the corpus were created to analyse whether or not the information specific to one particular event or set of events would be transferable to the classification of other events.

System Description
We submitted five system runs for evaluation. The systems differ slightly from each other, either in the model or the way the used data was pre-processed. The general approach for all submitted systems was to use fine-tuned pre-trained transformer document embeddings. All experiments were conducted using the Flair framework (Akbik et al., 2019).

System 1 -RoBERTa ACLED L
For system 1, we fine-tuned the RoBERTa base model (Liu et al., 2019), and trained the embeddings using a learning rate of 3e-5, a batch size of 16. Based on our experiments, we trained the model for 2 epochs, because we found that the model overfits if we trained for more than 2 epochs. After each epoch the training data was shuffled and this was also done in the subsequent systems. Additionally, we assigned weights to the different event classes. This was done to smooth out any

System 2 -RoBERTa ACLED N
System 2 again uses the RoBERTa base model (Liu et al., 2019) and the previously mentioned parameters for learning rate (3e-5), batch size (16) and number of epochs (2). The difference to system 1 is, that the text that was used during the fine-tuning of the model was not pre-processed. This means that the text snippets that were obtained from the ACLED (Donnay et al., 2019) database were fed into the system in their original state and, therefore, all information included in the text was kept.

System 3 -BERT ACLED L
For system 3, we used the pre-trained BERT basecased model (Devlin et al., 2019) along with a learning rate of 3e-5, a batch size of 16 and 2 epochs for training. As in system 1, we used the ACLED L corpus.

System 4 -BERT ACLED N
System 4 used the same settings as system 3, meaning, the pre-trained BERT base-cased model (Devlin et al., 2019), a learning rate of 3e-5, a batch size of 16 and 2 epochs for training. The input data for system 4 consisted of the original text from ACLED N.

System 5 -BERT ACLED T
Our last system, system 5, made use of the pretrained BERT base-cased model (Devlin et al., 2019). The learning rate was set to 3e-5, the batch size to 32. It was trained for 2 epochs. For the text input we used the text from ACLED T, where all time and date stamps have been removed.

Preliminary experiments
Preliminary model evaluations on 10 held-out test sets show that each of the systems performed comparatively well. The RoBERTa model with the normal ACLED text as input performed slightly better than the other systems. Table 1 below shows the range of Macro and Micro F1 scores across the 10 test sets. Model performance increased or decreased slightly, depending on the samples in the individual test sets. The results also illustrate that the removal of the location or time mentions in the event snippets, does not greatly influence system performance. Rather, the preliminary tests indicate that the fine-tuned RoBERTa embeddings benefit from the inclusion of the more detailed ACLED specific information.
An analysis of the results of the individual classes, shows that each of the 25 subtypes achieve f1-scores of over 0.800. The two lowest scoring classes are HQ ESTABLISHED and  Table 7 shows the results of our five system submissions. The systems were tested on a test set provided by the organizers, consisting of 829 samples for subtask 1. We find that System 2, using the RoBERTa base model (Liu et al., 2019) and ACLED N as input, performs best with an average weighted F-score of 0.83, average macro F-score of 0.794 and average micro F-score of 0.829. A more detailed overview can be found in appendix A.

Results
Additionally, the second model that uses original ACLED text, System 4, achieves the second best result. As was the case in our preliminary experiments, we see that the inclusion of specific location and timestamps in the training data, does not greatly influence the ability of the system to predict the different classes correctly or incorrectly.

Error Analysis
To get a better insight into the workings of our systems, we conducted an error analysis on the test data provided by the organizers for all five submissions. In order to investigate misclassifications made by the models, we decided to look at the performance of the system with regard to the individual classes.

Analysis of Word Frequencies
As can be seen in Table 3, all models score low F-scores for either the class OTHER or the class PROPERTY DISTRUCT, or both. The results obtained for these classes substantially lower the overall average F-scores of the models.  All models achieve the highest scores for the classes SUIC BOMB, GRENADE and CHEM WEAP. We looked at the word distribution these classes have in our training data as can be seen in figure 2 and 3.
Considering these distributions, it can be stated that a specific vocabulary, as can be found in the class SUIC BOMB, is advantageous for a correct classification, while a heterogeneous vocabulary, as can be found in the class OTHER, is disadvantageous. One can tell that while the by far most frequently occurring word in texts regarding the event type SUIC BOMB, namely "suicide", is clearly indicative for the given class, the most frequently used words in connection with the event type OTHER, namely "activity", "violent", "area" and "force" are rather generic. Furthermore, they can also be found frequently in a number of texts connected to other classes (e.g., "area": AIR STRIKE, CHANGE TO GROUP ACT, "force": NON STATE ACTOR OVERTAKES TER, NON VIOL TERRIT TRANSFER). This does not hold true for the word "suicide".

Frequent Errors
Looking further at the errors, we see that 65 samples of the test data were classified incorrectly by all five models. This makes up between 37% and 45% of errors for the respective systems. It is noticeable that all five models frequently predict the class MOB VIOL for sentences that are gold labeled as PROPERTY DISTRUCT (between 5 and 9 times for the respective systems). No other two classes are confused this often, and to investigate further we analysed these two classes with regard to their topic compactness. We calculated the topic distances of the sentences in comparison to the topic centroids per class in the training data.  We see that both classes, MOB VIOL and PROP-ERTY DISTRUCT, are quite compact. There are some outliers, but most of the document vectors are clustered close to each other and the topic centroid. However, if we combine the classes into one topic and again analyse the distribution of document vectors to the topic centroid, we find that there are also very few outliers, as can be seen in figure 6. This means that the examples for MOB VIOL and PROPERTY DISTRUCT in our training data are similar to each other, which may explain why our models consistently confuse these two classes with regard to the test data provided by the organizers.
Looking at the test samples, we further find that due to the large number of classes, it is also difficult for human annotators to distinguish between the different classes in some cases. An example for this is the following: {text: Police said two groups from different communities in Chhabra town of Rajasthan's Baran district pelted stones on each other and torched vehicles parked around after putting six shops afire, guess: MOB_VIOL, gold: PROPERTY_DISTRUCT} All our models consistently predict the event class MOB VIOL for this example, the gold standard annotation is, however, PROP-ERTY DISTRUCT. It can be argued that the given example actually includes both event classes, with the first part of the sentence, "Police said two groups from different communities in Chhabra town of Rajasthan's Baran district pelted stones on each other" being an instance of MOB VIOL, while the second part, "and torched vehicles parked around after putting six shops afire", belongs to the class PROPERTY DISTRUCT. Test instances like this pose a challenge for the models.

Conclusion
In this study we proposed the use of fine-tuned RoBERTa transformer document embeddings for the fine-grained classification of socio-political events. We balanced the corpus to ensure that the 25 subtypes were represented as equally as possible. Compared to the results that were achieved during the preliminary experiments, we observed a drop in performance on the test set provided by the organizers. However, compared to the baseline figures provided by the organizers in , we achieved very similar results with less training data. This suggests that balancing the training data data had a positive effect on model performance.
Our analysis of the results of both different test sets, the set created for preliminary experiments and the set provided by the organizers for system evaluation, show that there is definitely a difference in performance in the various classes. It also highlighted the issue of events that could be classed as more than one different subtype, and the challenge that these events pose for fine-grained classification. Depending on the given use case, parts of our system could already be implemented in a real world setting in order to analyze the flow of data in the information space and achieve situational awareness in the physical world, as clear cut classes like CHEM WEAP and GRENADE are identified reliably. In a military setting, for example, these classes are far more relevant than occurrences of PROPERTY DISTRUCT.
In future work, it would be interesting to evaluate if the use of more training data, while still trying to obtain a more even distribution of classes, would further increase performance. Particularly, it raises the question if more training data would increase performance for the classes that currently do not perform as well. A more thorough class analysis, which would contribute to understanding why there seem to be systematic errors in specific classes, could provide insight into answering this question.