Enhancing Crisis-Related Tweet Classification with Entity-Masked Language Modeling and Multi-Task Learning

Social media has become an important information source for crisis management and provides quick access to ongoing developments and critical information. However, classification models suffer from event-related biases and highly imbalanced label distributions which still poses a challenging task. To address these challenges, we propose a combination of entity-masked language modeling and hierarchical multi-label classification as a multi-task learning problem. We evaluate our method on tweets from the TREC-IS dataset and show an absolute performance gain w.r.t. F1-score of up to 10% for actionable information types. Moreover, we found that entity-masking reduces the effect of overfitting to in-domain events and enables improvements in cross-event generalization.


Introduction
Messages on social media during disaster events have become an important information source in crisis management (Reuter et al., 2018).In contrast to traditional sources (e.g., official news), social media posts immediately provide details about developments, first-party observations, and affected people in an ongoing emergency situation (Sakaki et al., 2010).Having access to this information is crucial for developing situational awareness and supporting relief providers, government agencies, and other official institutions (Kruspe et al., 2021).
One key challenge poses the information refinement of high-volume social media streams which requires automatic methods for reliable detection of relevant content (Kaufhold, 2021).Most recent work has focused on binary, multi-class, and multi-label text classification techniques to classify posts into coarse (e.g., Relevant, Irrelevant) or fine-grained (e.g., InfrastructureDamage, Missing-People) categories composed of flattened or hierarchical structures (Alam et al., 2018b(Alam et al., , 2021;;Buntain et al., 2021).
Another challenge in Natural Language Processing (NLP) is the nature of data prevalent in social media and microblogging platforms.For example, most works in the crisis-related domain focus on Twitter data (Kruspe et al., 2021) which inherits properties such as short texts (280 characters limitation per tweet), less contextual information, hashtags, and noise (e.g., misspellings, emojis) (Wiegmann et al., 2020;Zahera et al., 2021).According to Sarmiento and Poblete (2021), different types of disasters (e.g., flood, wildfire) can be identified by only a few text-based features.However, eventrelated biases and entities as shown in Figure 1 prevent models from generalizing to unseen disaster events and therefore degrade w.r.t.detection performance.
To circumvent this problem, approaches such as adversarial training (Medina Maza et al., 2020), domain adaptation (Alam et al., 2018a), and hierarchical label embeddings (Miyazaki et al., 2019) have been proposed but suffer from mixed event types, assume unlabeled data or require semantic label descriptions.Contrary to this work, we aim to enhance the detection of rare actionable information for unseen events by masking out entities, applying adaptive pre-training, and incorporating the hierarchical structure of labels.
Contributions Our main contributions are as follows: (1) We introduce an adaptive pre-training strategy based on entity-masking.
(2) We incorporate the hierarchical structure of labels as multi-task learning (MTL) problem.
(3) We empirically show that our approach improves generalization to new events and increases detection performance for actionable information types.

Related Work
Crisis Tweet Classification Besides conventional detection approaches such as filtering (Kumar et al., 2011) or crowdsourcing (Poblet et al., 2014), machine learning has received much attention in this area.Researchers experimented with several methods such as Naive Bayes, Support Vector Machines, and Decision Trees either with termfrequency features (Habdank et al., 2017) or static embeddings (Kejriwal and Zhou, 2019).More recently, the combination of Word2Vec (Mikolov et al., 2013) with Convolutional and Recurrent Neural Networks achieved remarkable improvement in this field (Kersten et al., 2019;Snyder et al., 2019).Due to the success of Transformers (Vaswani et al., 2017) and the follow-up language models (Devlin et al., 2019), most works have been built upon this and outperformed previous approaches (Alam et al., 2021;Wang et al., 2021).
Adaptive Pre-Training Transfer learning with language models essentially contributes to stateof-the-art results in a variety of NLP tasks (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020).Typically, such language models follow the three training steps (Howard and Ruder, 2018;Ben-David et al., 2020): (1) Pre-training on massive corpora; (2) Optional pre-training on task-specific data; (3) Supervised fine-tuning on target tasks.However, the second step is often neglected due to computational constraints whereby adaptive pretraining has shown to be effective (Howard and Ruder, 2018).Hence, Gururangan et al. (2020) introduced domain-adaptive pre-training (DAPT) and task-adaptive pre-training (TAPT) which cover continual pre-training on corpora tailored for a specific task.Moreover, strategies such as adding special tokens for tweets (Nguyen et al., 2020;Wiegmann et al., 2020) or additional masked language modeling (MLM) approaches (Ben-David et al., 2020) have been proven beneficial.
Hierarchical Multi-Label Classification Hierarchical multi-label classification (HMC) covers local and global approaches and the combination of both worlds (Wehrmann et al., 2018).A popular categorization of local methods is the subdivision into local classifier per parent node (LCPN) (Dumais and Chen, 2000), local classifier per node (LCN) (Banerjee et al., 2019), and local classifier per level (LCL) (Wehrmann et al., 2018).Hybrid approaches integrate the global part as a particular constraint such as hierarchical softmax (Brinkmann and Bizer, 2021) or combine multiple local and global prediction heads (Wehrmann et al., 2018).
Recent work in information type classification introduced label embeddings which utilize the hierarchical structure (Miyazaki et al., 2019).Finally, the classification can also be viewed as MTL by combining certain loss functions (Yu et al., 2021;Wang et al., 2021).

TREC-IS
In this work, we mainly focus on the dataset of the shared-task TREC-IS, which represents a collection of annotated crisis-related tweets (Buntain et al., 2021).Each tweet belongs to a disaster event and is annotated with high-level information types which are derived from an ontology composed of hierarchical stages.However, information type labels are only shipped as a two-level hierarchy with four upper classes L T and 25 lower classes L B .Thus, both hierarchy levels represent a multi-label classification task.Following the TREC-IS track design, we split the dataset into train and test events which corresponds to the TREC-IS 2020B task.This split poses a challenging setup due to the requirement of cross-event generalization (Wiegmann et al., 2020).[CLS] T 1 T 5 [MASK] T N ...

Method
As depicted in Figure 2 our approach combines the two concepts entity-masked language modeling (E-MLM) and MTL.In the following, we briefly describe our method as a combination of those two.

Entity-Masked Language Modeling
Based on adaptive pre-training, we extend on masked language modeling of a transformer encoder pre-trained on a large corpus such as BERT (Devlin et al., 2019).Here, the mitigation of eventrelated biases is facilitated by replacing entitieswhich are prone to be event-specific -with special tokens (see Figure 2a).This way we intend to capture disaster-related language patterns independently of the concrete entities.Following Ben-David et al. ( 2020), we further introduce a masking probability α tailored to entities in addition to the standard word masking with probability β.That is, with a typically higher probability α we select random entity-tokens such as locations and lower probability β random standard subword-tokens.Finally, these selected tokens will be replaced by [MASK], random tokens or the unchanged tokens in order to learn the linguistic patterns related to those entities.For the rest of this paper, we rely on the pre-trained BERT BASE as the encoder model and the corresponding default MLM setup for pretraining ([MASK] with 80%, random tokens with 10%, and unchanged tokens with 10%).

Multi-Task Learning
The next step represents the fine-tuning of a classification head.We implement four basic hierarchical multi-label classification approaches as shown in

Evaluation Metric
We follow the TREC-IS evaluation scheme: macroaveraged F1-score across information types for the two hierarchy levels in addition to the actionable information types (AIT) (McCreadie et al., 2019).The latter include rare information types with high priority consisting of: MovePeople, EmergingThreats, NewSubEvent, ServiceAvailable, GoodsServices, and SearchAndRescue.

Named Entity Recognition
As event-specific entities, we use the special tokens hashtag, url, person, location, organization, event, address, phone number, date, and number.All entities except the tokens hashtag and url are extracted with the Natural Language API of the Google Cloud Platform. 3We manually annotated 300 tweets and calculated a strict F1-score (Segura-Bedmar et al., 2013) of 0.692 which represents a reasonable good result for tweets.

Baseline and Hyper-Parameters
As baseline, we use TF-IDF with Logistic Regression (TF-IDF+LR) and BERT BASE with a singletask classification head.Furthermore, we apply the standard MLM of BERT in contrast to E-MLM in order to validate the effect of masking entities.Lastly, we train the MTL model (MTL prio ) from Wang et al. (2021) which combines lower classes as classification and priority scores as regression task.We choose the best hyper-parameters for each model based on a stratified split with a ratio of 90% for train and 10% for development data, respectively.In terms of hyper-parameters, we set α = 0.5 and β = 0.1 for E-MLM; other parameters were set according to other work, including learning rate of 5e−5, batch size of 32, and λ = 0.1 for fine-tuning.The detailed hyper-parameter selection process is shown in Appendix B.

Results
In the following, we report the performance for the upper classes L T , lower classes L B , and AIT.However, for our evaluation we do not focus on 3 We extracted the entities on 29 March 2022.the single-task models for actionable categories and in addition the L B classes except for LCPN.We assume that the L T classification objective implicitly clusters the internal representation w.r.t. the high-level information types and therefore mitigates overfitting towards the major classes.As depicted in Figure 3, the HMCN local model improves the detection of rare actionable information types over the single-task model while at the same time decreasing the performance on the category with the most information types.This can be caused by the ambiguous label definitions and semantic similarities with other information types (Mehrotra et al., 2022).

Analysis of Events
In Figure 4 we illustrate the model performance for L B across different event types.For multiple events, we report the mean and standard deviation, respectively.We observe an increase in performance for the event types covid, shooting, typhoon, storm, tornado, and flood and a small decrease for the event types fire, hostage, and explosion.As shown by the variance for multiple events, the performance highly differs across specific events.Surprisingly, the event type covid achieved the worst performance for both models despite the existence of three covid events within the train data.These results indicate that even regional differences about the same global event predominantly affect the generalization performance across events.

Ablation Study
As ablation study we removed several proposed components to assess the performance impact of our model.Thereby, the component entities represents the additional special tokens and replacement within the input text.As shown in Table 4, we started with the HMCN local model and demonstrate that entities, MLM and MTL contribute to an increase w.r.t.F1-score for both L B and AIT.The results indicate that the variant which removes the hierarchical component only degrades the performance for the low-resource actionable information types.Removing the E-MLM mechanism degrades the model's performance most in our experiments.

Conclusion and Future Work
In this work, we identified shortcomings in the field of crisis tweet classification for unseen events.For the TREC-IS data, we found contrasting effects in terms of pre-training and observed an absolute improvement of up to 3% w.r.t.F1-score for actionable information types by incorporating the hierarchical structure.Furthermore, we confirmed the effectiveness of our method based on the sharedtask TREC-IS.Future work includes pre-training on a larger corpus, the mitigation of the trade-off between major and minor classes performances, and to analyse the influence of label semantics.

Ethical and Societal Implications
Open Source Intelligence (OSINT) has become a significant role for various authorities and NGOs for advancing struggles in global health, human rights, and crisis management (Bernard et al., 2018;Evangelista et al., 2021;Kaufhold, 2021).Following the view of OSINT as a tool, our work pursues the goal to support relief providers, government agencies, and other disaster-response stakeholders during ongoing and evolving crisis events.
We argue that NLP for disaster response can have a positive impact on comprehensive situational awareness and in decision-making processes such as coordination of particular services or physical goods.In the context of this work, positive impact means to supplement traditional information sources with social media streams that enable faster access to ongoing developments, first-party observations, and more fine-grained information content.For example, NLP for social media can enrich the information with the public as co-producers which may reveal critical subevents like missed or trapped people (Li et al., 2018).Retrieving this kind of information could positively affect disaster management strategies and relief efforts during natural and human-made disasters.
In contrast, relying on social media as an information source runs the risk of introducing mis-and disinformation.This can cause adverse effects on relief efforts and requires tailored strategies and particular care before the deployment of such models.Furthermore, data privacy issues may arise due to the inherited properties of social media data.Various anonymization processes should be taken into account for identifying and neutralizing sensitive references (Medlock, 2006).In this work, the use of entity tokens as categorization can be seen as one kind of anonymization procedure.However, model training with such entities could be taskspecific and prone to error propagation by named entity recognition systems.

A Overview of Information Types
We list all information types of the TREC-IS dataset in Table 5.The value in the last column indicates the number of Twitter posts to which the corresponding labels were assigned.Table 6 displays example tweets for various events with the corresponding labels from the TREC-IS dataset.

B Hyper-Parameters
The search space for TF-IDF+LR included ngramrange, max features and regularization strength.In terms of BERT fine-tuning, we manually experimented with the same parameters as in Wang et al. (2021) 2020), we experimented with the MLM probabilities α ∈ {0.1, 0.3, 0.5, 0.8} and β ∈ {0.1, 0.3, 0.5, 0.8} and found the setup α = 0.5 and β = 0.1 to perform best.This is in line with Ben-David et al. (2020) which empirically show good results.For MTL we tuned λ ∈ {0.1, 0.5, 0.9} and finally set λ = 0.1.We trained all transformer models with the Transformers library (Wolf et al., 2020) and AdamW for up to 50 (pre-training) and 15 (finetuning) epochs, evaluated the performance each 1000 steps on the development set and selected the best performing checkpoint.If not other mentioned, we used for the rest of the hyper-parameters the default setup of BERT BASE from the Transformers library.

Figure 1 :
Figure 1: Example tweets of several disasters over time, annotated with entitites.The short posts are mostly biased towards specific events.

Figure 2 :
Figure 2: Illustration of the concepts E-MLM with named entity recognition (NER) and MTL.FC M LM represents the prediction head for MLM.The classification heads will be placed on top of the pre-trained encoder.The building blocks Pooler, L T , L B and L G are fully connected layers and use the CLS token as sentence embedding.
Figure 2b.The LCL classification head jointly trains a flattened classification layer for each of the two hierarchy levels.In contrast, the LCPN model consists of a classification layer for each parent node.The hierarchical multi-label classification network (HMCN) is adapted from Wehrmann et al. (2018) and introduces a pooling layer on top of the preceding pooling layer.We experiment with a local and a global variant, whereas the global one additionally consists of a global classification layer.All pooling and classification layers are composed of a single feed-forward layer with tanh and sigmoid as activation functions, respectively.Finally, we minimize the binary cross-entropy L M T L = λL L T + (1 − λ)L L B as a weighted loss function whereby L L T represents the upper classes and L L B the lower classes loss.
and selected in line with this work the learning rate 5e − 5 and batch size 32.Due to computational constraints, we used for BERT pre-training the TAPT parameters of Gururangan et al. (2020).Similar to Ben-David et al. (

Table 1 :
Overview of the dataset split; the values within the brackets of the upper classes corresponds to the number of unique low-level information types.

Table 1
gives an overview of each split; obviously, the information type distribution is highly imbalanced.For example, information types with low criticality such as MultimediaShare (31.7%) and News (25.4%) are prevalent.In contrast, the highly critical information types MovePeople (0.9%) and SearchAndRescue (0.4%) occur only rarely(Mc- Creadie et al., 2019). 2

Table 2 :
Overall results on the development set.

Table 3 :
Overall results of information type classification; bold and underlined values indicate the best and second-best results, respectively.* We fine-tuned the approach of Wang et al. (2021) with BERT BASE and without ensembling.Comparison across event types w.r.t.F1score between the BERT BASE and HMCN local model.We plot the mean and standard deviation for multiple events within a event type.
L T since the experiments did not show large differences across all BERT models.The MTL models are only reported with BERT E−M LM .Multi-Task Learning In terms of MTL, the HMCN local model achieved the best results for AIT.Overall the MTL classification outperforms

Table 4 :
Overall results of the ablation study.

Table 5 :
Information types and hierarchical structure of labels.

Table 6 :
Example tweets and labels for different events.