PROTEST-ER: Retraining BERT for Protest Event Extraction

We analyze the effect of further retraining BERT with different domain specific data as an unsupervised domain adaptation strategy for event extraction. Portability of event extraction models is particularly challenging, with large performance drops affecting data on the same text genres (e.g., news). We present PROTEST-ER, a retrained BERT model for protest event extraction. PROTEST-ER outperforms a corresponding generic BERT on out-of-domain data of 8.1 points. Our best performing models reach 51.91-46.39 F1 across both domains.


Introduction and Problem Statement
Events, i.e., things that happen in the world or states that hold true, play a central role in human lives. It is not a simplification to claim that our lives are nothing but a constant sequence of events. Nevertheless not all events are equally relevant, especially when the focus of attention and analysis moves away from individuals and touches upon societies. In this broader context, socio-political events are of particular interest since they directly impact and affect the lives of multiple individuals at the same time. Different actors (e.g., governments, multilateral organizations, NGOs, social movements) have various interests in collecting information and conducting analyses on this type of events. This, however, is a challenging task. The increasing availability and amount of data, thanks to the growth of the Web, calls for the development of automatic solutions based on Natural Language Processing (NLP).
Besides the good level of maturity reached by NLP systems in many areas, numerous challenges are still pending. Portability of systems, i.e., the reuse of previously trained systems for a specific task on different datasets, is one of them and it is far from being solved (Daumé III, 2007;Plank and van Noord, 2011;Axelrod et al., 2011;Ganin and Lempitsky, 2015;Alam et al., 2018;Xie et al., 2018;Zhao et al., 2019;Ben-David et al., 2020). As such, portability is a domain adaptation problem. Following Ramponi and Plank (2020), we consider a domain to be a variety where each corpus, or dataset, can be described as a multidimensional region including notions such as topics, genres, writing styles, years of publication, socio-demographic aspects, annotation bias, among other unknown factors. Every dataset belonging to a different variety poses a domain adaptation challenge.
Unsupervised domain adaptation has a long tradition in NLP (Blitzer et al., 2006;McClosky et al., 2006;Moore and Lewis, 2010;Ganin et al., 2016;Ruder and Plank, 2017;Guo et al., 2018;Miller, 2019;Nishida et al., 2020). The availability of large pre-trained transformer-based language models (TLMs), e.g., BERT (Devlin et al., 2019), has inspired a new trend in domain adaptation, namely domain adaptive retraining (DAR) Han and Eisenstein, 2019;Rietzler et al., 2020;Gururangan et al., 2020). The idea behind DAR is as simple as effective: first, additional textual material matching the target domain is selected, then the masked language modeling (MLM) objective is used to further train an existing TLMs. The outcome is a new TLM whose representations are shifted to better suit the target domain. Fine-tuning domain adapted TLMs results in improved performance.
This contribution applies this approach to develop a portable system for protest event extraction. Our unsupervised domain adaptation setting investigates two related aspects. The first concerns the impact of the data used to adapt a generic TLM to a target domain (i.e., protest events). The second targets the portability in a zero-shot scenario of a domain-adapted TLMs across protest event datasets. Our experimental results provide additional evidence that further pretraining TLM on domain-related data is a "cheap" and successful method in single-source single-target unsupervised domain adaptation settings. Furthermore, we show that fine-tuned retrained TLMs results in models with a better portability.

Task and Data
We focus on the protest event detection task following the 2019 CLEF ProtestNews Lab (Hürriyetoglu et al., 2019). 1 Protest events are identified as politically motivated collective actions which lay outside the official mechanisms of political participation of the country in which the action takes place.
The lab is organised around three nonoverlapping subtasks: (a.) document classification; (b.) sentence classification; and (c.) event extraction. Tasks (a.) and (b.) are text classification tasks, requiring systems to distinguish whether a document/sentence is referring to a protest event. The event extraction task is a sequence tagging problem requiring systems to identify event triggers and their corresponding arguments, similarly to other event extraction tasks, e.g., ACE (Linguistic Data Consortium, 2005).
The lab is designed to challenge models' portability in an unsupervised setting: systems receive a training and development data belonging to one variety and are asked to test both against a dataset from the same variety and a different one. We report in Table 1 the distribution of the markables (event triggers and arguments) for event extraction across the two varieties. We refer to the same variety (or source) distributions as India and to the different variety (or target) as China.  The data are good examples of differences across factors characterising language varieties. For instance, although they belong to the same text genre (news articles), they describe protest events from two countries that have historical and cultural differences concerning what is worth protesting (e.g., caste protests are specific to India) and the type of protests (e.g., riots vs. petitions). Differences in the political systems entail differences in the actors of the protest events which is mirrored in the named entities describing person or organization names. Language is a further challenge. Both datasets are in English but they present dialectal and stylistic differences.
We quantified differences and similarities by comparing the training data (India train ) against the two test ones (India test and China test ) using the Jensen-Shannon (J-S) divergence and the out-of-vocabulary rate (OOV) that previous work has shown to be particularly useful for this purpose (Ruder and Plank, 2017). The figures in Table 2 better show how these data distributions occupy different regions in the variety space, with India test being closer to the training data than China test . Tackling these similarities and differences is at the heart of our domain adaptation problem for event extraction.
A further challenge is posed by the limited amount of training material. A comparison against the training portion of ACE shows that Protest-News has 5 times less triggers and 4 times less arguments. 2 Unlike ACE, event triggers are not further classified into subtypes. However, seven argument types are annotated, namely participant, organiser, target, etime (event time), place, fname (facility name), and loc (location). The role set is inspired by ACE Attack and Demonstrate event types but they are more fine-grained. The markables are encoded in a BIO scheme (Beginning, Inside, Outside), resulting in different alphabets for triggers (e.g. B-trigger, I-trigger and O) and each of the arguments (e.g. O, B-organiser, I-organiser, B-etime, I-etime, etc.).

Continue Pre-training to Adapt
We applied DAR to English BERT base-uncased to fill a gap in language variety between BERT, trained on the BooksCorpus and Wikipedia, and the ProtestNews's data.
We collected two sets of domain related data from the TREC Washington Post Corpus version    Table 5: J-S (Similarity) and OOV (Diversity) between the DAR datasets WPC-Gen and WPC-EV and the and test data distributions for the event extraction task.
We apply each data collection separately BERT base-uncased by further training for 100 epochs using the MLM objective. The outcomes are two pre-trained language models: NEWS-BERT and PROTEST-ER. The differences between the models are assumed to be minimal but yet relevant to assess the impact of the data used for DAR. To further support this claim we report in Table 5 an analysis of the similarities and differences of 3 https://trec.nist.gov/data/wapost/ the DAR data materials against the India and China test data. As the figures show, the DAR datasets are equally different from the protest event extraction ones. Furthermore, we did not modify BERT original vocabulary by introducing new tokens. More details on the retraining parameters are reported in the Appendix A.1.

Experiments and Results
Event extraction is framed as a token-level classification task. We adopt a joint strategy where triggers' and arguments' extent and labels are predicted at once (Nguyen et al., 2016). We used India test to identify the best model (NEWS-BERT vs. PROTEST-ER) and system's input granularity. With respect to this latter point, we investigate whether processing data at document or sentence level could benefit the TLMs as a strategy to deal with limited training materials. We compare each configuration against a generic BERT counterpart. We fine-tune each model by training all the parameters simultaneously. All models are evaluated using the official script from the ProtestNews Lab. Triggers and arguments are correctly identified only if both the extent and the label are correct. We apply to China only the best model and input format.
India data Results for India are illustrated in Table 3. In general, PROTEST-ER obtains better results than BERT and NEWS-BERT. Sentence qualifies as the best input format for PROTEST-ER, while document works best for NEWS-BERT and BERT.
The language variety of the data distributions used for DAR has a big impact on the performance of fine-tuned systems, with NEWS-BERT being the worst model. The extra training should have made this model more suited for working with news articles than the corresponding generic BERT. This indicates that selection of suitable data is an essential step for successfully applying DAR.
Globally, the results show that DAR has a positive effect on Precision, especially when sentences are used as input for fine tuning the models. Positive effects on Recall can only be observed for PROTEST-ER.
With the exclusion of NEWS-BERT, the systems achieve satisfying results for the trigger component. Argument detection, as expected, is more challenging, with no model reaching an F1-score above 50%. PROTEST-ER always performs better, especially when processing the data at sentence level. In numerical terms, PROTEST-ER provides an average gain of 11.74 points. 4 We observe a relationship between argument type frequency in the training data and models's performance where the most frequent arguments, i.e., participant (26.43%), organizer (18.31%), and place (14.45%), obtain the best results. However, PROTEST-ER improves performances also on the least frequent argument types, i.e., loc (6.49%) and fname (5.85) of, respectively, 12.00 and 5.38 points on average, when compared to BERT. Table 4. We applied only PROTEST-ER keeping the distinction between document vs. sentence input. Although using sentences as input leads to the best results for India, we also observe that the results of the document input models are competitive, leaving open questions whether such a way of processing the input could be an effective strategy for model portability for event extraction. The results clearly indicate that PROTEST-ER is a competitive and pretty robust system. Interestingly, we observe that on the China data, the best results are obtained when processing data at document level.

China data Results for China are reported in
Looking at the portability for the event components, it clearly appears that arguments are more difficult than triggers. Indeed, the absolute F1score of the best models for triggers is in the same range of that for India. When focusing on the arguments, the drops in performances severely affect all argument types, except for fname. We also observe that the biggest drops are registered in those arguments that are most likely to express domain specific properties. For instance, the absolute F1score difference between the best models for India and China for place is 39.79 points, 36.29 for organizer, and 27.11 for etime. On the contrary, only a drop of 9.84 points is observed for participant, suggesting that ways of indicating those who take part to a protest event (e.g. protesters, or rioters) are closer than expected.

Discussion and Conclusions
Our results indicate that DAR is an effective strategy for unsupervised domain adaptation. However, we show that not every data distribution matching a potential target domain has the same impact. In our case, we measure improvements only when using data that more directly target the content of the task, i.e., protest events, possibly supplementing limitations in training materials. We have gathered interesting cues that processing data at document level can actually be an effective strategy also for a sequence labeling task with small training data. We think that this approach allows the TLMs to gain from processing longer sequences and acquire better knowledge. However, more experiments on different tasks (e.g., NER) and with different training sizes are needed to test this hypothesis.
A further positive aspect of DAR is that it requires less training material to boost system's performance, pointing to new directions for few-shot learning. We projected the learning curves of BERT and PROTEST-ER using increasing steps of the training data. PROTEST-ER achieves an overall F1-score ∼30% with only 10% of the training data, while BERT needs minimally 30% to achieve comparable performances (see Appendix A.3).
Disappointingly, PROTEST-ER falls way back the best model that participated in Protest-News. Skitalinskaya et al. (2019) propose a Bi-LSTM-CRF architecture using FLAIR contextualized word embeddings (Akbik et al., 2018). They also adopt a joint strategy for trigger and argument prediction. PROTEST-ER obtains a better Precision only on China for the overall evaluation and for trigger. Quite surprisingly, on India it is BERT that achieves better results on trigger, although the model appears to be quite unstable, as shown by the standard deviation. At this stage, it is still unclear whether these disappointing performances are due to the retraining (i.e., need to extend the number of documents used) or the small training corpus. Future work will focus on two aspects. First, we will further investigate the impact of the size of the training data when using TLMs. This will require to experiment with different datasets and tasks. Secondly, we will explore solutions for multilingual extensions of PROTEST-ER.

A.1 BERT-NEWS/PROTEST-ER Further Training
Preprocessing The unlabeled corpora of (protest related) news articles from the TREC Washington Post version 3 are minimally preprocessed prior to the language model retraining phase. We use the full text, including the title, of each news article. Document Creation Times are removed. We perform sentence splitting using spaCy (Honnibal et al., 2020).
Training details We further train the English BERT base-uncased for 100 epochs. We use a batch size of 64 through gradient accumulation. Other hyperparameters are illustrated in Table 6. Our TLM implementation uses the HuggingFace library (Wolf et al., 2020). The pretrainig experiment was performed on a single Nvidia V100 GPU and took 8 days.  A.2 BERT/PROTEST-ER Fine-tuning Table 7 shows the values of the hyperparameters used for fine-tuning BERT and PROTEST-ER. We used Tensorflow (Abadi et al., 2016) for the implementation and the Huggingface library (Wolf et al., 2020) for implementing the BERT embeddings and loading the data. We used the CRF implementation available from the Tensorflow Addons package. The models are trained for a maximum of 100 epochs, using a constant learning rate of 2e-5; if the validation loss does not improve for 5 consecutive epochs, training is stopped. The best model is selected on the basis of the validation loss. We manually experimented with the learning rates 1e-5, 2e-5, 3e-5. No other hyperparameter optimization was performed.  We used the original train, validation, and test splits of the event extraction task of the 2019 CLEF ProtestNews Lab.
We conducted all the experiments using the Google Colaboratory platform. The time required to run all the experiments on the free plan of Colaboratory is approximately 20 hours. Figure 1 graphically illustrates the base architecture.

A.3 BERT/PROTEST-ER Learning Curves
In the following graphs we plot the learning curves of the BERT and PROTEST-ER model on the India and China dataset. In both cases, we observe that PROTEST-ER obtains competitive scores just using 10% of the training data, suggesting that the TLM's representations are already shifted towards the protest domain. To obtain the same results, the generic BERT models need minimally 30% of the training data, when using documents as input, and 70% of the training, when using sentences.