DAAI at CASE 2021 Task 1: Transformer-based Multilingual Socio-political and Crisis Event Detection

Automatic socio-political and crisis event detection has been a challenge for natural language processing as well as social and political science communities, due to the diversity and nuance in such events and high accuracy requirements. In this paper, we propose an approach which can handle both document and cross-sentence level event detection in a multilingual setting using pretrained transformer models. Our approach became the winning solution in document level predictions and secured the 3rd place in cross-sentence level predictions for the English language. We could also achieve competitive results for other languages to prove the effectiveness and universality of our approach.


Introduction
With technological advancements, today, we have access to a vast amount of data related to social and political factors. These data may contain information on a wide range of events such as political violence, environmental catastrophes and economic crises which are important to prevent or resolve conflicts, improve the quality of life and protect citizens. However, with the increasing data volume, manual efforts for event detection have become too expensive making the requirement of automated and accurate methods crucial (Hürriyetoglu et al., 2020).
Considering this timely requirement, CASE 2021 Task 1: Multilingual protest news detection is designed (Hürriyetoglu et al., 2021). This task is composed of four subtasks targeting different data levels. Subtask 1 is to identify documents which contain event information. Similarly, subtask 2 is to identify event described sentences. Subtask 3 targets the cross-sentence level to group sentences which describe the same event. The final subtask is to identify the event trigger and its arguments at the entity level. Since a news article can contain one or more events and a single event can be described together with some previous or relevant details, it is important to focus on different data levels to obtain more accurate and complete information.
This paper describes our approach for document and cross-sentence level event detection including an experimental study. Our approach is mainly based on pretrained transformer models. We use improved model architectures, different learning strategies and unsupervised algorithms to make effective predictions. To facilitate the effortless generalisation across the languages, we do not use any language-specific processing or additional resources. Our submissions achieved the 1 st place in document level predictions and 3 rd place in crosssentence level predictions for the English language. Demonstrating the universality of our approach, we could obtain competitive results for other languages too.
The remainder of this paper is organised as follows. Section 2 describes the related work done in the field of socio-political event detection. Details of the task and datasets are provided in Section 3. Section 4 describes the proposed approaches. The experimental setup is described in Section 5 followed by results and evaluation in Section 6. Finally, Section 7 concludes the paper. Additionally, we provide our code to the community which will be freely available to everyone interested in working in this area using the same methodology 1 .

Related Work
In early work, the majority of event detection approaches were data-driven and knowledge-driven (Hogenboom et al., 2011). Since the data-driven approaches are only based on the statistics of the underlying corpus, they missed the important semantical relationships. The knowledge-driven or rule-based approaches were proposed to tackle this limitation, but they highly rely on the targeted domains or languages (Danilova and Popova, 2014).
Later, there was a more focus on traditional machine learning-based models (e.g. support vector machines, decision trees) including different feature extraction techniques (e.g. natural language parsing, word vectorisation) (Schrodt et al., 2014;Sonmez et al., 2016). Also, there was a tendency to apply deep learning-based approaches (e.g. CNN, FFNN) too following their success in many information retrieval and natural language processing (NLP) tasks (Lee et al., 2017;Ahmad et al., 2020). However, these approaches are less expandable to low-resource languages, due to the lack of training data to fine-tune the models.
Targeting this major limitation, in this paper we propose an approach which is based on pretrained transformer models. Due to the usage of general knowledge available with the pretrained models and their multilingual capabilities, our approach can easily support event detection in multiple languages including low-resource languages.

Subtasks and Data
CASE 2021 Task 1: Multilingual protest news detection is composed of four subtasks targeting event information at document, sentence, cross-sentence and token levels (Hürriyetoglu et al., 2021). Mainly the socio-political and crisis events which are in the scope of contentious politics and characterised by riots and social movements are focused. Among these subtasks, we participated in subtask 1 and subtask 3 which are further described below.
Subtask 1: Document Classification Subtask 1 is designed as a document classification task. Participants need to predict a binary label of '1' if the news article contains information about a past or ongoing event and '0' otherwise. To preserve the multilinguality of the task, four different languages English, Spanish, Portuguese and Hindi have been considered for data preparation. Comparatively, a high number of training instances were provided with English than Spanish and Portuguese. No training data were provided for the Hindi language. For final evaluations, test data were provided without labels. The data split sizes in each language are summarised in Table 1.

Methodology
The main motivation behind the proposed approaches for event document identification and event sentence coreference identification is the recent success gained by transformer-based architectures in various NLP and information retrieval tasks such as language detection (Jauhiainen et al., 2021) question answering (Yang et al., 2019) and offensive language detection (Husain and Uzuner, 2021;. Apart from providing strong results compared to RNN based architectures, transformer models like BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) provide pretrained language models that support more than 100 languages which is a huge benefit when it comes to multilingual research. The available models have been trained on general tasks like language modelling and then can be fine-tuned for downstream tasks like text classification (Sun et al., 2019). Depending on the nature of the targeted subtask, we involved different transformer models along with different learning strategies to extract event information as mentioned below.

Subtask1: Document Classification
Document classification can be considered as a sequence classification problem. According to recent literature, transformer architectures have shown promising results in this area (Ranasinghe et al., 2019b;Hettiarachchi and Ranasinghe, 2020). Transformer models take an input of a sequence and output the representations of the sequence. The input sequence could contain one or two segments separated by a special token [SEP]. In this approach, we considered a whole document or a news article as a single sequence and no [SEP] token is used. As the first token of the sequence, another special token [CLS] is used and it returns a special embedding corresponding to the whole sequence which is used for text classification tasks (Sun et al., 2019). A simple softmax classifier is added to the top of the transformer model to predict the probability of a class. The architecture of the transformer-based sequence classifier is shown in Figure 1. Unfortunately, the majority of transformer models such as BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) fails to process documents with a higher sequence length than 512. This limitation is introduced due to the self-attention operation used by these architectures which scale quadratically with the sequence length (Beltagy et al., 2020). Therefore, we specifically focused on improved transformer models targetting long documents: Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020). Longformer utilises an attention mechanism that scales linearly with sequence length and BigBird utilises a sparse attention mechanism to handle long sequences.

Data Preprocessing:
We applied a few preprocessing techniques to data before inserting them into the models. All the selected techniques are language-independent to support multilingual experiments. Analysing the datasets, there were documents with very low sequence length (< 5) and they were removed. Further, URLs were removed and repeating symbols more than three times (e.g. =====) were replaced by three occurrences (e.g. ===) because they are uninformative.

Subtask3: ESCI
Event Sentence Coreference Identification (ESCI) can be considered as a clustering problem. If a set of sentences are assigned to clusters based on their semantic similarity, each cluster will represent separate events. To perform clustering, each sentence needs to be mapped to an embedding which preserves its semantic details.

Sentence Embeddings
Different approaches were proposed to obtain sentence embeddings by previous research. Based on the word embedding models such as GloVe (Pennington et al., 2014), the average of word embeddings over a sentence was used. Later, more improved architectures like InferSent (Conneau et al., 2017) which is based on a siamese BiLSTM network with max pooling, and Universal Sentence Encoder (Cer et al., 2018) which is based on a transformer network and augmented unsupervised learning were developed. However, with the improved performance on NLP tasks by transformers, there was a tendency to input sentences into models like BERT and get the output of the first token ([CLS]) or the average of output layer as a sentence embedding (May et al., 2019;Qiao et al., 2019). These approaches were found as worse than average GloVe embeddings due to the architecture of BERT which was designed targeting classification or regression tasks (Reimers et al., 2019).
Considering these limitations and characteristics of transformer-based models, Reimers et al. (2019) proposed a new architecture named Sentence Transformer (STransformer), a modification to the transformers to derive semantically meaningful sentence embeddings. According to the experimental studies, STransformers outperformed average GloVe embeddings, specialised models like InferSent and Universal Sentence Encoder, and BERT embeddings (Reimers et al., 2019). Considering these facts, we adopt STransformers to generate sentence embeddings in our approach.
STransformer creates a siamese network using transformer models like BERT to fine-tune the model to produce effective sentence embeddings. A pooling layer is added to the top of the transformer model to generate fixed-sized embeddings for sentences. The siamese network takes a sentence pair as the input and passes them through the network to generate embeddings (Ranasinghe et al., 2019a). Then compute the similarity between embeddings using cosine similarity and compare the value with the gold score to fine-tune the network. The architecture of STransformer is shown in Figure 2. Data Formatting: To facilitate the STransformer fine-tuning or training, we formatted given sentences into pairs and assigned the similarity of '1' if both sentences belong to the same cluster and '0' if not. During the pairing, the order of sentences is not considered. Thus, for n sentences, (n × (n − 1))/2 pairs were generated. For example, sentence pairs and labels generated for the data sample given in Listing 1 are shown in Table 3.

Clustering
As clustering methods, we focused on hierarchical clustering and the pairwise prediction-based clustering approach proposed by Örs et al. (2020). Hierarchical clustering is widely used with event detection approaches over flat clustering because flat clustering algorithms (e.g. K-means) require the number of clusters as an input which is unpredictable . Considering the availability of training data and recent successful applications, the pairwise prediction-based clustering approach is focused.
Hierarchical Clustering: For the hierarchical clustering algorithm, we used Hierarchical Agglomerative Clustering (HAC). Each sentence is converted into embeddings to input to the clustering algorithm. HAC considers all data points as separate clusters at the beginning and then merge them based on cluster distance using a linkage method. The tree-like diagram generated by this process is known as a dendrogram and a particular distance threshold is used to cut it into clusters (Manning et al., 2008). For the distance metric, cosine distance is used, because it proved to be effective for measurements in textual data (Mikolov et al., 2013;Antoniak and Mimno, 2018) and a variant of it is used with STransformer models. For the linkage method, single, complete and average schemes were considered for initial experiments and the average scheme was selected among them because it outperformed others. We picked the optimal distance threshold automatically using the training data. If training data is further split into training and validation sets to use with STransformers, only the validation set is used to pick the cluster threshold, because the rest of the data is known to the embedding generated model.

Pairwise Prediction-based Clustering:
We used the pairwise prediction-based clustering algorithm proposed by Örs et al. (2020) which became the winning solution of the ESCI task in the AESPEN-2020 workshop (Hürriyetoglu et al., 2020). Originally this algorithm used the BERT model to predict whether a certain sentence pair belongs to the same event or not. In this research, we used STransformers to make those predictions except general transformers. Since a STransformer model is designed to obtain embeddings, to derive labels (i.e. '1' if the sentence pair belong to the same event and '0' if not) from them we used cosine similarity with a threshold. The optimal value computed during the model evaluation process is used as the threshold.

Experimental Setup
This section describes the learning configurations, transformer models and hyper-parameters used for the experiments.

Learning Configurations
We focused on different learning configurations depending on data and model availability, and multilingual setting. Considering the availability of data and models, we used the following configurations for the experiments.
Pretrained (No Learning): Pretrained models are used without making any modifications to them to make the predictions. In this case, models pretrained using a similar objective to the target objective need to be selected.
Fine-tuning: Under fine-tuning, we retrain an available model to a downstream task or the same task model already trained. This learning allows the model to be familiar with the targeted data.
From-scratch Learning: Models are built from scratch using the targeted data. This procedure helps to mitigate the unnecessary biases made by the data used to train available models.
Language Modelling (LM): In LM, we retrain the transformer model on the targeted dataset using the model's initial training objective before fine-tuning it for the downstream task. This step helps increase the model understanding of data (Hettiarachchi and Ranasinghe, 2020).
For multilingual data, the following configurations are considered to support both high-and lowresource languages.
Monolingual Learning: In monolingual learning, we build the model from the training data only from that particular language.
Multilingual Learning: In multilingual learning, we concatenate available training data from all languages and build a single model.

Zero-shot Learning:
In zero-shot learning, we use the models fine-tuned for the same task using training data from other language(s) to make the predictions. The multilingual and cross-lingual nature of the transformer models has provided the ability to do this Hettiarachchi and Ranasinghe, 2021).

Transformers
We used monolingual and multilingual general transformers as well as pretrained STransformers for our experiments.   et al., 2020) models which are variants of the BERT model were considered. As multilingual models, BERT multilingual version and XLM-R (Conneau et al., 2020) models were used. Among these models, a higher sequence length than 512 is only supported by BigBird and Longformer models available for English. We used HuggingFace's Transformers library (Wolf et al., 2020) to obtain the models.
Sentence Transformers: STransformers provide pretrained models for different tasks 2 . Among them, we selected the best-performed models trained for semantic textual similarity (STS) and duplicate question identification, because these areas are related to the same event prediction.

Hyper-parameter Configurations
We used a Nvidia Tesla K80 GPU to train the models. Each input dataset is divided into a training 2 Sentence Transformer pretrained models are available on https://www.sbert.net/docs/pretrained_ models.html set and a validation set using a 0.9:0.1 split. We predominantly fine-tuned the learning rate and the number of epochs of the model manually to obtain the best results for the validation set. For document classification, we obtained 1e − 5 as the best value for the learning rate and 3 as the best value for the number of epochs. The same learning rate was found as the best value for STransformers with epochs of 5. For the sequence length, different values have experimented with document classification and they are further discussed in Section 6.1. A fixed sequence length of 136 was used for ESCI considering its data.
To improve the performance of document classification, we used the majority-class self-ensemble approach mentioned in (Hettiarachchi and Ranasinghe, 2020). During the training, we trained three models with different random seeds and considered the majority-class returned by the models as the final prediction.
To train STransformers, we selected the online contrastive loss, an improved version of the con-  trastive loss function. The contrastive loss function learns the parameters by reducing the distance between neighbours or semantically similar embeddings and increasing the distance between nonneighbours or semantically dissimilar embeddings (Hadsell et al., 2006). The online version automatically detects the hard cases (i.e. negative pairs with a low distance than the largest distance of positive pairs and positive pairs with a high distance than the lowest distance of negative pairs) in a batch and calculates the loss only for them.

Results and Evaluation
In this section, we report the conducted experiments and their results.

Subtask1: Document Classification
Task organisers used Macro F1 as the evaluation metric for subtask 1. Since only the training data were released, we separated a dev set from each training dataset to evaluate our approach. Depending on the data size, 20% from English and 10% from other-language training data were separated as dev data. Initially, we analysed the performance of finetuned document classifiers for English using BERT and improved transformer models for long documents, along with varying sequence length. Considering the sequence length distribution in data, we picked the lengths of 256, 512 and 700 for these experiments. The obtained results are summarised in Table 4. Even though we targeted large versions of the models (e.g. BigBird-roberta-large), due to the resource limitations, we had to use base versions (e.g. BigBird-roberta-base) for some experiments. According to the results, BERT models improve the F1 when we increase the sequence length. In contrast to it, both BigBird and Longformer models have higher F1 with low sequence lengths.
For predictions in Spanish and Portuguese documents, we fine-tuned the models using both monolingual and multilingual learning approaches. Since transformers with the maximum sequence length of 512 are used, we fixed the sequence length to 512 based on the findings in English experiments. The obtained results and training configurations are summarised in Table 5. For the high-resource language (i.e. English), multilingual learning returns a low F1 than monolingual learning. However, low-resource languages show a clear improvement in F1 with multilingual learning. Since there were no training data for the Hindi language, the best multilingual models were picked to apply the zero-shot learning approach.
We report the results we obtained for test data in Table 6. According to the results, our approach which used the BigBird model became the best system for the English language. For other languages, multilingual learning performed best. Among models, XLM-R outperformed the BERT-multilingual model. Compared to the best systems submitted, our approach has very competitive results for these languages too.   languages for further splits.

Subtask3: ESCI
For the English language, we experimented with the clustering approaches using the embeddings generated by different STransformer models. Initially, we focused on pretrained models and their fine-tuned versions on task data. Later we built STransformers from scratch using general transformer models and further integrated LM too. The obtained results and corresponding model details are summarised in Table 7. According to the results, STransformers build from scratch outperformed the pretrained and fine-tuned models. LM did not improve the results and it is possible when data is not enough for modelling. Among the clustering algorithms, HAC showed the best results.
We could not train any STransformer for other languages because the organisers provided a limited number of labelled instances for those languages. We used pretrained multilingual models and adhering to zero-shot learning, fine-tuned them using English data. Further English data were used to build STransformers from scratch too. All the evaluations were also done on English data and best-performing systems were chosen to make predictions for other languages. The obtained results are summarised in Table 8. Similar to the English monolingual scenario, from-scratch multilingual models performed best.
We report the results for test data in Table 9. According to the results, for all languages, we could obtain competitive results compared to the results of the best-submitted system. Since our approach can be easily extended to different languages with very few training instances, we believe the results are at a satisfactory level.

Conclusions
In this paper, we presented our approach for document and cross-sentence level subtasks of CASE 2021 Task 1: Multilingual protest news detection. We mainly used pretrained transformer models including their improved architectures for long document processing and sentence embedding generation. Further, different learning strategies: monolingual, multilingual and zero-shot and, classification and clustering approaches were involved. For document level predictions, our approach achieved the 1 st place for the English language while being within the top 4 solutions for other languages. For cross-sentence level predictions, we secured the  Table 9: ESCI results for test data 3 rd place for the English language with competitive results for other languages. Despite that, our approach can support multiple languages with low or no training resources.
As future work, we hope to further improve semantically meaningful sentence embedding generation using improved architectures, learning strategies and ensemble methods. Also, we would like to analyse the impact of different clustering approaches on cross-sentence level predictions.