HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https://blog.thedeep.io/humset/.


Introduction
During humanitarian crises caused by reasons ranging from natural disasters, wars ,or epidemics such as COVID-19, a timely and effective humanitarian response highly depends on fast and accurate analysis of relevant data to yield key information.Early in the response phase, namely in the first 72 hours after a disaster strikes, the humanitarian response analysts in international organizations 1 review large amounts of data loosely or strongly relevant to the crisis to gain situational awareness.
1 Such as the International Federation of Red Cross (IFRC), the United Nations High Commissioner for Refugees (UN-HCR), or the United Nations Office for the Coordination of Humanitarian Affairs (UNOCHA) A large portion of this data appears in the form of secondary data sources i. e. reports, news, and other forms of text data, and is integral in revealing which type of relief activities to undertake.Analysis in this phase involves extracting key information and organizing it according to sets of pre-defined domain-specific structures and guidelines, referred to as humanitarian analysis frameworks.
While typically only small workforces are available to analyze such information, an automatic document processing system can significantly help analysts save time in the overall humanitarian response cycle.To facilitate such systems, we introduce and release HUMSET, a unique and rich dataset of document analysis in the humanitarian response domain.HUMSET is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects.The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. 2For each document, analysts have identified informative snippets (entries) with respect to common humanitarian frameworks and assigned one or many classes to each entry (details in §2).
HUMSET provides a large dataset for the training and evaluation of entry extraction and classification models, enabling the research and development of further NLP systems in the humanitarian response domain.We take the first step in this direction, by studying the performance of a set of strong baseline models (details in §3).Our released dataset expands the previously provided collection by Yela-Bello et al. (2021) with a more recent and comprehensive set of projects, as well as additional classification labels.Other similar datasets in the humanitarian domain, Imran et al. (2016) present humanannotated Twitter corpora collected during 19 dif-ferent crises between 2013 and 2015, Alam et al. (2021) provide a combination of various socialmedia crisis-related existing datasets, and Adel and Wang (2020) and later Alharbi and Lee (2021) publish Arabic Twitter classification datasets for crisis events.HUMSET, in contrast to the current resources which mostly originated from social media, is created by humanitarian experts through an annotation process on official documents and news from the most recognized humanitarian agencies, conferring high reliability, continuous updating, and accurate geolocation information.

HUMSET Dataset
The collection originated from a multiorganizational platform called the Data Entry and Exploration Platform (DEEP),3 developed and maintained by Data Friendly Space (DFS) 4 The platform facilitates classifying primarily qualitative information with respect to analysis frameworks and allows for collaborative classification and annotation of secondary data.The dataset is available at https://blog.thedeep.io/humset/.

Dataset Overview
HUMSET consists of data used to inform 46 humanitarian response operations across the globe.24 responses were in Central/South America, 14 in Africa, and 8 in Asia (detailed countries can be found in Table 6 in Appendix).
For each project, documents, referred to as leads, related to a particular humanitarian crisis are collected, analyzed, and annotated.The annotated documents in the dataset mostly consist of recently released information, with 79% of the documents being released in 2020 and 2021 (Table 5 in Appendix), and 90% of all documents being sourced from websites (see Table 4 in Appendix for the most commonly used platforms).Documents are selected from different sources, ranging from official reports by humanitarian organizations to international and national media articles.Overall, documents consist of files in PDF format (70.4%) and HTML pages (29.6%) with an average length of ∼ 2K words.The number of documents analyzed per project varies, ranging from 2 to 2,266.
The relevant snippets of texts, referred to as entries, in each document are annotated by humani- tarian experts.The dataset provides an average of ∼ 10 entries per document, and an average length of ∼ 65 words per entry.Overall, HUMSET is composed of 148,621 tagged entries, selected from 16,857 documents, and in three languages: English (61.3%),French (20.4%) and Spanish (18.3%).The list of projects as well as the number of documents and annotated entries per project is reported in Table 7 in Appendix.Figure 1 shows the distribution of the number of tagged documents per project, as well as the number of tokens per document and entry.

Humanitarian Analysis Frameworks and Data Annotation Process
The concept of analytical frameworks originated in the social sciences (Ragin and Amoroso, 2011), but can be considered foundational and indispensable in numerous research fields.and disaster relief but also enables various groups to share resources (Zhang et al., 2002).When starting a response or project, humanitarian organizations create or more often use an existing analysis framework, which covers the generic but also specific needs of the work.Our data originally contained 11 different frameworks.As there are high similarities across frameworks, we created a common framework, which we refer to as humanitarian analysis framework.This framework covers the framework dimensions of all projects.We build our custom set of tags by mapping the original tags in other frameworks to ours.More specifically, our analysis framework consists of three categories: Sectors (11 tags), Subpillars 1D (33 tags), and Subpillars 2D (18 tags).Pillars/Subpillars 1D, and 2D have a hierarchical structure, consisting of a two-leveled tree hierarchy (Pillars to Subpillars).
The list and the number of tags present for each category are reported in Table 1.
For each project, documents relevant to understanding the situation, unmet needs, and underlying factors are captured and uploaded to the DEEP platform.From these sources, entries of text are selected and categorized into an analysis framework.Humanitarian annotators are trained in specific projects to follow analytical standards and thinking to review secondary data.
This process eventually results in annotating and organizing the data according to the humanitarian analysis framework.As the HUMSET dataset is created in a real-world scenario, the distribution of annotated entries is skewed, with 33 tags be-ing present in less than 2% of data.Tables 10,  11, and 12 in Appendix show the detailed number and proportions of the annotated entries in Sectors, Subpillars 1D, and 2D, respectively.Figure 2 in Appendix reports the distribution of tags in dataset.

NLP Tasks
Entry Extraction Task.The first step for humanitarian taggers in analyzing a document is finding entries containing relevant information.A piece of text or information is considered relevant if it meaningfully contains at least one tag present in the given humanitarian analytical framework.Since documents often contain a large amount of information (Figure 1), it is extremely beneficial to automate the process of entry identification, and this is the first task of this research.This can be seen as an extractive summarization task i. e. selecting a subset of passages that contain relevant information from the given document.However, the entries do not necessarily follow the common units of text such as sentence and paragraph and can appear in various lengths.In fact, only 38.8% of entries consist of full sentences, and the rest are snippets that are shorter or longer than sentences.This limits the direct applicability of prior approaches to extractive summarization (Liu and Lapata, 2019;Zhou et al., 2018), and makes the task particularly challenging for NLP research.
Multi-label Entry Classification Task.After selecting the most relevant entries within a document, the next step is to categorize them according to the humanitarian analysis framework (Table 1).An automatic suggestion on which tag to choose from a large number of possibilities can be decisive in speeding up the annotation process.For each category, more than one tag can be assigned to an entry.Hence, we can view this task as multi-label classification.

Entry Extraction
We evaluate the performance of the entry extraction task using ROUGE-1, ROUGE-2 ,and ROUGE-L F1 score (Lin, 2004).The target text (ground truth) is a concatenation of all relevant entries, and the predicted text is a concatenation of all entries predicted as relevant.We consider a simple heuristic method (LEAD4), as well as Transformerbased (Vaswani et al., 2017) pre-trained language models (PLM) with a multilingual backbone as our baselines as explained in the following: LEAD4: LEAD-n is a simple baseline where the first n sentences are predicted as being relevant entries.Consistent with prior work (Yela-Bello et al., 2021), we choose n = 4. Transformers: to approach the task using Transformer-based PLM, we formulate the task as a token classification problem.The objective is to distinguish between tokens that are part of relevant entries and tokens which are not.For simplicity, we fine-tune the entire model and do binary classification using a twolayer prediction head on top of the contextualized representation of each token.We conduct our experiments on XtremeDistil l6−h256 (Mukherjee and Hassan Awadallah, 2020) and XLM-R Base (Conneau et al., 2019) as the underlying PLM.
The evaluation results of the mentioned methods are reported in Table 3.Among our baselines, the model with XLM-R Base shows the best overall performance.However, we should consider these experiments as starting points, and improvements on this task are expected by increasing model capacity and architectural variations.

Entry Classification
We test different multi-label sequence classification models applied to our five categories.We use the Precision and F1-score metrics to assess the performance of the models on each subcategory.We report macro-averages of the metrics, as the tags are unbalanced (see Table 2) and macro-averaging can provide a more nuanced view of the performance especially by supporting the more sparse classes.Finally, we perform threshold tuning of the classification decision boundary with respect to macro-average F1-scores for each label of each category (Pillai et al., 2013).Tuning the threshold is done by finding the optimal results on the validation set, used to make classifications on the test set.We conduct experiments using fastText (Joulin et al., 2016), as well as Transformer-based PLMs as explained below.
fastText: is an Open Source library for text representation and classification that consists of a bag of n-grams representation and a linear classifier.fastText classification is language-agnostic and does not need language-specific pre-trained word vectors, allowing us to train a multilingual classifier as a simple baseline.To handle multiple labels, we trained independent binary classifiers for each label.Transformers: For consistency with the previous task, we fine-tune the same multilingual PLMs and add a dense layer on top for multi-label classification.
Table 2 reports the evaluation results on the mentioned baseline models.For comparison, a random baseline is also reported.The random baseline is a stratified random classification, created based on the distribution of the classes in the training set.Similar to the entry extraction task, the XLM-R Base outperforms other baselines.Although overall promising results are obtained, we highlight the shortcoming of the models on the categories with many tags (Subpillars 1D and Subpillars 2D), suggesting future research directions for addressing these challenges.

Conclusion
We presented HUMSET, a new dataset of annotated humanitarian data, containing 148,621 entries with a total of 62 different tags.We have shown two NLP-based tasks that can be applied to it, providing initial experiments of its applications.HUMSET is a multilingual human-annotated humanitarian text dataset not composed of social network data, providing a valuable and highly reliable resource for the development of automation tools regarding crisis response and humanitarian aid activities.

Limitations
HUMSET is composed of an aggregation of 46 different projects, each with a different contribution in terms of data quantity and topics (Table 7).This can introduce an implicit bias due to the different goals and themes of each project and on respective analysis framework understanding and interpretation by humanitarian annotators.(Röttger et al., 2021) refer to it as persistent subjectivity.This is a complex and challenging limitation and, for example, (Geva et al., 2019) show how this kind of bias can be monitored using annotator identifiers as features in NLP models training when data is produced by crowdsourcing project (Sheng and Zhang, 2019).Since HUMSET is an extension of a real-case application and not the result of crowdsourcing, a more structured analysis on these aspects is needed.
Another complexity lies in the raw data sources.Lead text is the result of a text extraction process from PDF and HTML files (c.f.Section 2.1).In both cases, converting visually-rich graphical text representation into plain text involves errors and limitations.There are several works proposing solutions for digital documents layout-aware text extraction (Ramakrishnan et al., 2012;Zhu and Cole, 2022) but they are often domain-specific, applying only to specific types of documents.(Xu et al., 2020) propose a Transformer-based multi-modal architecture for documents understanding using text, layout, and image data as features.Improvement in document processing could produce better data quality and subsequently improve performance on the entry extraction task (Section 3.1).
Finally, we should point out that HUMSET might contain societal biases and stereotypes and/or over-represent particular demographics or entities.This case is observed and studied in several other data resources and scenarios (Bolukbasi et al., 2016;Krieg et al., 2022a;Rekabsaz et al., 2021b), which can lead to reflecting or even exaggerating societal biases in the system's output (Melchiorre et al., 2021;Rekabsaz and Schedl, 2020), and may negatively affect users' perception and interaction behavior (Krieg et al., 2022b).Hence, when using the dataset (particularly for real-world applications), we strongly recommend first defining and monitoring such potential biases (De-Arteaga et al., 2019;Rekabsaz et al., 2021c), and then mitigating them using the proposed methods in literature (Elazar and Goldberg, 2018;Zmigrod et al., 2019;Rekabsaz et al., 2021a;Zerveas et al., 2022;Ganhör et al., 2022).

Acknowledgement
We want to thank the humanitarian community users of the Data Entry and Exploration Platform (DEEP) for their openness and interest in sharing their data for this research and to trust that the NLP community can help them make their work better.We want to extend our gratitude, especially to the people working in the USAID's Bureau for Humanitarian Assistance (BHA) for entrusting Data Friendly Space (DFS) with a grant to create the DEEP and to work on this paper; to the DEEP users and taggers for their work that make this set possible; to the project owners in DEEP that allowed us to use the data to create this set and their organizations; to the DEEP board members Internal Displacement Monitoring Centre (IDMC), International Federation of the Red Cross, iMMAP, Office of the High Commissioner for Human Rights, Okular Analytics, United Nations Office for the Coordination of Humanitarian Affairs (UNOCHA), United Nations High Commissioner for Refugees (UNHCR), United Nations Children's FUND (UNICEF), United Nations Development Coordination Office (UNDCO), and the Danish Refugee Council (DRC).To the whole DFS team around the world, the ToggleCorp team, our partner institutions ISI Foundation and Johannes Kepler University Linz for their continuous support in making the usage of NLP possible in the humanitarian community.

A Additional Statistics
Figure 1: (a) Distribution of documents per project.(b) Log-scale distribution of tokens 5 per document.(c) Logscale distribution of tokens per entry.

Figure 2 :
Figure 2: Proportion of tags in dataset.This figure shows the unbalanced nature of the dataset.Each bar represents a different tag.The y-axis shows the proportion of entries that contain each tag.The horizontal line added is the 2% occurrence line.It is used to visualize the relatively high number of tags with occurrence inferior to 2%.

Table 1 :
Overview of humanitarian analysis framework.

Table 4 :
The most frequently sourced websites by number and proportion of documents.

Table 5 :
Publishing year of documents.

Table 6 :
Countries of projects per region.

Table 7 :
Key statistics per project.

Table 10 :
Proportion of sectors in the dataset.

Table 11 :
Proportion of each pillar and subpillar 1D in the dataset.

Table 12 :
Proportion of each pillar and subpillar 2D in the dataset.