CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization

Social media has increasingly played a key role in emergency response: first responders can use public posts to better react to ongoing crisis events and deploy the necessary resources where they are most needed. Timeline extraction and abstractive summarization are critical technical tasks to leverage large numbers of social media posts about events. Unfortunately, there are few datasets for benchmarking technical approaches for those tasks. This paper presents CrisisLTLSum, the largest dataset of local crisis event timelines available to date. CrisisLTLSum contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms. We built CrisisLTLSum using a semi-automated cluster-then-refine approach to collect data from the public Twitter stream. Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks. Our dataset, code, and models are publicly available.


Introduction
We present CrisisLTLSum, the first dataset on extraction and summarization of local crisis event timelines from Twitter.An example of an annotated timeline in CrisisLTLSum is shown in Figure 1.A timeline is a chronologically sorted set of posts, where each brings in new information or updates about an ongoing event (such as a fire, storm, or traffic incident).CrisisLTLSum supports two complex downstream tasks: timeline extraction and timeline summarization.As shown in Figure 1, the timeline extraction task is formalized as: given a seed tweet as the initial mention of a crisis event, extract relevant tweets with updates on the same event from the incoming noisy tweet stream.This task is crucial for real-time event tracking.
Figure 1: This is a sample annotated timeline from Cri-sisLTLSum.The noisy timeline is the set of tweets sorted chronologically.The first tweet is the seed of the event.means that the tweet is annotated to be part of the timeline and indicates that the tweet is excluded.The reason for the exclusion is written under the mark.
The timeline summarization task aims to generate abstractive summaries of evolving events by aggregating important details from temporal and incremental information.
CrisisLTLSum can facilitate research in two directions: 1) NLP for Social Good (crisis domain), and 2) natural language inference and generation, i.e., timeline extraction and summarization tasks.Here, we discuss the importance and the differences of CrisisLTLSum compared to previous work for both of these aspects.Towards the first direction, the extraction of real-time crisis-relevant information from microblogs (Zhang and Eick, 2019;Mamo et al., 2021) plays a vital role in providing time-sensitive information to help first responders understand ongoing situations and plan relief efforts accordingly (Sarter and Woods, 1991).Cri-sisLTLSum goes beyond the task of categorizing each single crisis-relevant post independently (Imran et al., 2013;Olteanu et al., 2014;Imran et al., 2016;Alam et al., 2018;Wiegmann et al., 2020;Alam et al., 2021a,b) and enables a more challenging task for extracting new updates of an ongoing crisis event from incoming posts and summarizing them with respect to the important event details.This can help provide time-sensitive updates while avoiding missing critical information in the bulk of the posts in microblogs due to the high volume of redundant and noisy information (Alam et al., 2021a).To the best of our knowledge, this is the first annotated dataset for such an extraction task, while this problem has been tackled before in unsupervised settings (Zhang et al., 2018).
Moreover, we focus on the extraction of local crisis events.The term "local" indicates that an event is bound to an exact location, such as a building, a street, or a county, and usually lasts for a short period.Building a corpus of local crisis events is particularly useful for first responders but also challenging because the timelines of these events are often not captured in existing knowledge sources.This means one has to design mechanisms for automatically detecting and tracking events directly from the Twitter stream, which is especially hard for existing clustering methods (Guille and Favre, 2015;Asgari-Chenaghlu et al., 2021) given the low number of available tweets for each local event.
For the second point, CrisisLTLSum enables NLP research on the complex tasks of timeline extraction and abstractive summarization.These tasks are particularly challenging in the context of social media.First, the process of identifying and extracting relevant updates for a specific event has to contend with the large volume of noise (Alam et al., 2021a) and informal tone (Rudra et al., 2018) compared to other domains such as news.Additionally, summarizing an on-going event helps toward a quick and better understanding of its progress.This requires a good level of abstraction with important details covered and properly presented (e.g., the temporal order of event evolution).CrisisLTLSum is the first dataset to provide human-written timeline summaries to support research in this direction.
CrisisLTLSum is developed through a two-step semi-automated process to create 1,000 local crisis timelines from the public Twitter stream.To our best knowledge, this is the first timeline dataset focusing on "local" crisis events with the largest number of unique events.The contributions of this paper are as follows: • We propose CrisisLTLSum, which is the largest dataset over local crisis event timelines.Notably, this is the first benchmark for abstractive timeline summarization in the crisis domain or on Twitter.• We develop strong baselines for both tasks, and our experiments show a considerable gap between these models and human performance, indicating the importance of this dataset for enabling future research on extracting timelines of crisis event updates and summarizing them.

Related Work
Our work in this paper is related to two main directions of crisis domain datasets for NLP and timeline summarization.
Crisis Datasets for NLP: Prior research has investigated generating datasets from online social media (e.g., Twitter) on large scale crisis events, while providing labels for event categories (Wiegmann et al., 2020;Imran et al., 2013), humanitarian types and sub-types (Olteanu et al., 2014;Imran et al., 2016;Alam et al., 2018;Wiegmann et al., 2020;Arachie et al., 2020;Alam et al., 2021a,b), actionable information (McCreadie et al., 2019), or witness levels (Zahra et al., 2020) of each crisis related post.While existing datasets on crisis event timelines (Binh Tran et al., 2013;Tran et al., 2015;Pasquali et al., 2021) are limited to a small set of large-scale events, CrisisLTLSum covers a thousand timelines compared to only tens of events covered by each of the existing datasets.Additionally, we further go beyond the simple tweet categorization by enabling the extraction of information that include updates over the events' progress.
Timeline Summarization: Timeline summarization (TLS) was firstly proposed in Allan et al. (2001), which extracts a single sentence from the news stream of an event topic.In general, the TLS task aims to summarize the target's evolution (e.g., a topic or an entity) in a timeline (Martschat and Markert, 2018;Ghalandari and Ifrim, 2020).Existing approaches of TLS are mainly based on extractive methods, which are often grouped into several categories.For instance, Update Summarization (Dang et al., 2008;Li et al., 2009) aims to update the previous summary given new information at a later time, while Timeline Generation (Yan et al., 2011;Tran et al., 2015;Martschat and Markert, 2018) aims to generate itemized summaries as the timeline, where each item is extracted by finding important temporal points (e.g., spikes, changes or clusters) and selecting representative sentences.Another category, Temporal Summarization, was first proposed in the TREC shared task (Aslam et al., 2013) with follow-up work (Kedzie et al., 2015), which targets extracting sentences from a large volume of news streams and social media posts as updates for large events.Temporal Summarization is close to the first task (Timeline Extraction) proposed in CrisisLTLSum.
There have been a few recent works on abstractive timeline summarization across different domains, e.g., biography (Chen et al., 2019), narratives (Barros et al., 2019), and news headlines (Steen and Markert, 2019), where the humanwritten summaries are directly collected from the web.The abstractive summarization goal is to generate a set of sentences summarizing the context of interest without taking the exact words or phrases from the original text but rather by combining them and summarizing the important content.To our best knowledge, CrisisLTLSum is the first to provide human-written summaries for crisis event timelines collected from noisy social media stream.Recent research (Nguyen et al., 2018) has also investigated the summarization task based on tweets in other domains which essentially do not reflect the challenges in the summarization of an evolving event.

CrisisLTLSum Collection
This section presents our semi-automated approach to collect CrisisLTLSum.We first extract clusters of tweets as noisy timelines and then refine them via human annotation to get clean timelines that only include non-redundant, informative, and relevant tweets.

Noisy Timeline Collection
Figure 2 shows the process for generating a set of noisy timelines starting from the Twitter stream and followed by pre-processing and knowledge enhancement steps, the online clustering method, and post-processing & cleaning steps.
Location, Time, and Keywords Filtering We limit the incoming tweets to specific geographical areas, periods, and domains of interest.The location filtering relies on a list of location candidates created by gathering cities, towns, and famous neighborhoods in a big area of interest.A tweet is considered relevant to our area of interest if 1) the text mentions one of the candidates, 2) the geo-tag matches the area of interest, or 3) the user location matches the area of interest.To limit the tweets to a specified crisis domain, we curate domain-specific keywords and only select tweets with phrases matching one of the keywords.This approach is not comprehensive or exhaustive but somewhat representative of each crisis domain.Improving this method to be more encompassing is an area for future research.The combinations of (area a, domain d, time t) are manually selected so that the events of type d at location a are more frequent during time period t.For instance, wildfire events are most likely to happen in California from May through August, while the same type of event is more likely from December to Match in Victoria (Australia).More details with examples of curated keywords can be found in Appendix A.
Entity Extraction This step aims to extract entity mentions from the tweet text and provide additional information that can be used to help identify related tweets.We use three different modules.First, we use a pre-trained neural model from AllenNLP (Gardner et al., 2018), trained on CoNLL03 (Tjong Kim Sang and De Meulder, 2003), to extract entities with types of people, location, and organization.Although this module extracts some important entities in the text, it fails to extract uncommon entities or special mentions such as the name of a wildfire.To address this, similar to prior research (Zheng and Kordjamshidi, 2020), we further exploit the extractions from Ope-nIE (Stanovsky et al., 2018) and select the noun arguments with less than ten characters as entities.Lastly, we add the tweet's hashtags to the entity set.Since location mentions are crucial in extracting local events and existing models have low performance detecting them from noisy tweets text, we further developed a BERT-based NER model tuned on Twitter data to detect location mentions.
Location Augmentation We use Open-StreetMap API to map location mentions to physical addresses. 2This step provides complementary information about each location while reducing the noise introduced by the entity extraction module through removing location mentions that are wrongly detected or are not located in the area of interest.This is especially important since our focus is on local events happening at specific locations.
Online Clustering This step aims to mimic the real-life scenario where tweets are sequentially fed into a clustering algorithm (Wang et al., 2015).We further choose this method since this is a lot faster than the retro-respective (all data available at the same time) clustering methods for a large pool of input data.Here, the clustering objective is to group tweets related to the same local event, such as a "fire in building A" or a "wildfire in a specific area".The online clustering method utilizes a custom similarity metric that combines the similarity of the entities, the closeness of locations in the real world, and the existence of shared hashtags.Algorithm 1 shows the similarity computation between two tweets.The smallest_distance computes the minimum physical distance between location mentions given their augmented real-world location (the output of the location augmentation step).As the distance between higher-level location mentions such as state/city/country is always zero, we simply ignore those location types.The find_matching_entities function follows the ideas in Faghihi et al. ( 2020) on creating a unique matching matrix, which we use for extracting the top matching pairs of entities from tweets.Here, each entity can only be paired once with the highest matching-score entity from the other tweet.The min dist , max dist , s hashtag , and s dist are hyper-2 https://www.openstreetmap.org/Algorithm 1 Find Similarity of tweet t i and t top_pairs.similarity/norm_factor similarity = similarity + s entity return similarity parameters of the clustering algorithm.We have only used heuristics and a small set of executions to tune these hyper-parameters.The pre-processed set of tweets is passed to the online clustering method, one tweet at a time.For each new tweet, similarity scores are computed between the new tweet and all cluster heads.The new tweet is added to the highest matching-score cluster where the similarity score is higher than sim threshold and the time elapsed between the new tweet and the last update of the cluster is less than time threshold .If the previous criteria are met for none of the clusters, a new cluster is created based on the new tweet.During this process, we remove inactive clusters whose last update was at least expiration threshold minutes ago and have less than tweet threshold number of tweets available.A cluster head is always the tweet with the most entity mentions; In case of a tie, the more recent tweet becomes the head of the cluster.The hyperparameters of this method is noted in Appendix A.
Cluster Post-Processing We apply three postprocessing steps to improve the quality of the generated clusters.First, we manually merge pairs of clusters with a cluster head similarity higher than a threshold head min .This step compensates for some of the errors from missing entities in the preprocessing step, which affects the intermediate similarity scores in the clustering algorithm.Second, we use a simple fuzzy sequence-matching technique to remove identical or similar tweets inside each cluster.Third, we train a BERT-based (Devlin et al., 2019) binary classifier to detect informative content, which can be used to prune out the noisy tweets that do not include crisis-relevant information.This classifier is trained on the available labeled data (Alam et al., 2021a) on tweets' informativeness.Since most of the available tweets in Alam et al. (2021a) are specific to storm and wildfire domains and there are no representative subsets for our other domains of interest (traffic, local fire), we only apply this step to the clusters that are generated for those categories.These postprocessing steps aim not to prune out all the noisy information but rather to provide a better starting point for our next steps.

CrisisLTLSum Human Annotation
Taking the noisy timelines generated from the previous step, we leverage human annotations to refine and generate clean timelines and summarize them.We, authors of this work, manually selected 1,000 clusters that contain enough tweets describing how a crisis event evolves, while specifying the "seed tweet" (i.e. the first observed post that describes the ongoing event) of each timeline.The detailed process is presented in Appendix B. The selected clusters cover events mainly from four crisis domains, including wildfire, local fire, storm, and traffic.More data statistics are shared in Section 4.
Procedure We use the Amazon Mechanical Turk (MTurk) platform to label and refine the noisy clusters to generate a clean timeline and collect the summaries.We split the annotation into multiple batches of Human Intelligence Tasks (HITs), where each batch contains timelines from the same domain.Each HIT contains three noisy timelines, and we collect annotations from 3 different workers on each.The workers are given the seed tweet and the subsequent tweets sorted by time, and they were asked to read the tweet one by one and answer i) whether the tweet should be part of the timeline, and ii) what is the reason if not.
A tweet is labeled as part of the timeline only if it satisfies all the following three conditions: • relevant: talks about the same event indicated in the seed tweet • informative: provides facts about the event but not only contains personal points of view about the ongoing event After reviewing all the tweets, the worker is finally asked to write a concise summary to describe how the event progresses over time.Detailed instructions and annotation workflows are presented as Figures 8-14 in Appendix E.
Annotation Workflow & Quality Control Following prior quality control practices (Briakou et al., 2021), we use multiple quality control (QC) steps to ensure the recruitment of high-quality annotators.First, we use location restriction (QC1) to limit the pool of workers to countries where native English speakers are most likely to be found.Next, we recruit annotators who pass our qualification test (QC2), where we ask them to annotate 3 timelines.We run several small pilot tasks, each with a replication factor of nine.We check annotators' performance on timeline extraction task against experts' labels and have experts manually review (QC3) annotators' summary quality.Only workers passing all the quality control steps contribute to the final task.During the final task, we perform regular quality checks (QC4), and only use workers who consistently perform well.
Compensation We compensate the workers at a rate of $3 per HIT for the task.Each batch of tasks is followed by a one-time bonus that makes the final rate over $10 per hour.

CrisisLTLSum Statistics & Analysis
In this section, we cover comprehensive statistics and analysis of CrisisLTLSum to further elaborate on the statistical characteristics of our dataset.

Dataset Statistics
Out of these 1,000 annotated timelines (10,610 tweets) in CrisisLTLSum, 423 (42%) are about wildfire, 287 (29%) are about traffic, 155 (15%) are about local fire, 109 (11%) are about storm, and 26 (3%) are about other types of events (e.g., building collapse).To make the ground-truth on whether each tweet is part of the timeline or not, we take the majority label among the three workers.4,303 (41%) tweets are labeled as part of the timeline, whereas 6,307 (59%) are not.Out of all timelines, 110 (11%) only include tweets that are part of the timeline, whereas 68 (7%) did not include any.Table 1 presents the statistics across different event crisis domains.

Dataset Analysis
Timelines by Lengths Table 2 presents the aggregated length (i.e., number of tweets) distribution of the timelines.The majority of the timelines (447 or 45%) have five tweets or less.This is observed across all crisis domains except for storm events, where the majority of the timelines (43 out of 109) are 6 to 12 tweets in length.It is worth noting that our dataset includes long timelines of 26 or more tweets, constituting 9% (94 timelines) across all domains.Figure 3 presents the average percentage of part-of-timeline tweets based on the aggregated length distributions of the timelines.We notice an interesting trend across the domains: the longer the timeline, the lower the average percentage of tweets to be part of the timeline.
Annotation Quality To measure the agreement rate between workers on the timeline extraction task, we consider the annotations of two out of three workers who agree most per timeline.The average timeline-level agreement between those two workers is 90.06%.We also explore a more profound analysis by comparing the workers' annotations on 20% of the timelines against the annotations of experts.To do so, we evaluate the majority label provided by the three workers against the label provided by the experts.The average timelinelevel agreement rate of this analysis is 91.77%.

Dataset Splits
To aid reproducibility when using our dataset for various research experiments, we divided our dataset into: training (TRAIN: 70% or 706 timelines), development (DEV 10% or 86 timelines), and testing (TEST 20% or 208 timelines) splits.The splits are created via stratified sampling based on the event crisis domains and the length of timelines (i.e., number of tweets) as shown in  Detailed statistics are available in Appendix D.

Experiments
Here, we first formalize the definitions of the downstream tasks.We then propose a set of existing naive and advanced models to serve as a baseline for these tasks.Next, we analyze the performance of the baselines on each task and provide error analysis for the best performing baseline to indicate the remaining major challenges of this dataset for future research.

Task Definitions
Given a crisis event timeline consisting of n tweets T = [t 0 , t 1 , ..., t n ], where t 0 is the seed tweet, i.e. the first observed post that describes the ongoing event, we define two tasks: Timeline Extraction and Timeline Summarization.Timeline Extraction: Given an initial seed tweet t 0 , and following chronologically sorted tweets t i , i = 1, ..., n, the goal of the timeline extraction task is to determine whether tweet t i is part of the timeline, i.e., related to the same event and adding new information compared to all prior tweets [t 0 , ..., t i−1 ].Up to time n, a clean timeline T extracted = [t 0 , t 1 e , t 2 e , .., t m e ], m ≤ n is extracted 5460 to describe how the event progresses over time.
Timeline Summarization: Similar to previous work by Chen et al. (2019), this task aims to generate a summary Ŷ = { ŵ1 , ..., ŵk } with sequence of words ŵi to concisely describe the crisis event and its evolution given the output from the timeline extraction task T extracted .In particular, it tries to optimize the parameters that maximize the probability P (Y /T extracted ) given ground-truth summaries Y = {w 1 , ..., w k }.

Timeline Extraction
Naive Baseline We employ a simple and naive baseline that assigns the label of the majority class observed in TRAIN to detect whether a tweet is part of the timeline or not.
Sequence Classification We leverage pretrained language models to build sentence-level classifiers.We construct a list of tweet sequences S by concatenating all the tweets that are part of the timeline from t 1 to t Therefore, each timeline of n tweets would yield n sequences.For our first sentence-level classification model, we fine-tune BERT (Devlin et al., 2019) on the tweet sequences where every training example would have the form of "t 0 [SEP] s i "; where s i ∈ S. We fine-tune this model using Hugging Face's Transformers (Wolf et al., 2020).
Sequence Labeling Given the sequential nature of timelines, we treat the timeline extraction task as a sequence labeling problem.We create a model by adding a GRU (Cho et al., 2014) on top of BERT.Given a timeline T = [t 0 , t 1 , t 2 , .., t n ], we feed every tweet t i to BERT and get its contextual representation corresponding to the [CLS] token.We then concatenate the representations of all the tweets and feed them to the GRU to predict if each tweet is part of the timeline or not.3

Timeline Summarization
Naive Baselines We define three simple models as the first set of timeline summarization baselines: 1) first-tweet: uses the first tweet of the timeline as the output summary; 2) last-tweet: uses the last tweet of the timeline as the output summary; and 3) random-tweet: uses a random tweet from the timeline as the output summary.
Seq2Seq Models We further benchmark two pretrained sequence-to-sequence (Seq2Seq) models: BART (Lewis et al., 2020) and Distill-BART (Shleifer and Rush, 2020).We chose these two models as they achieve strong results on various summarization datasets.For BART, we train it from scratch, and for DistillBART, we use a version that has been fine-tuned on both XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015) datasets.We use the fine-tuned DistillBART model in two settings: 1) further fine-tuning on CrisisLTL-Sum; 2) zero-shot setting.We use HuggingFace's Transformers to fine-tune both of these models.
The input to the Seq2Seq models is the concatenation of all tweets that are part of the timeline.Since each timeline in our dataset is annotated by three workers ( §3.2), it includes three summaries.To adapt the timeline summarization task to this setting, we pick the two summaries written by the two workers who agree the most based on the timeline extraction labels they assign to the tweets in each timeline.This reduces the variance between the summaries regarding their coverage of the crisis event described in the tweets that are part of the timeline.During fine-tuning, we double the examples in TRAIN by repeating each timeline twice, once for each summary.We describe all training settings and hyperparameters in Appendix C.

Timeline Extraction
Table 3 presents timeline extraction results on the DEV set.For evaluation, we use the average timeline-level accuracy.The naive majority class baseline marks every tweet as not part of the timeline and achieves 48.57average timeline accuracy.This is expected since 10 timelines in the DEV set only contain tweets that are part of the timeline ( §4.1).The BERT-based sequence classification model achieves 74.64 average timeline accuracy, beating the BERT-GRU model, which achieves 65.86.Although the BERT-based sequence classification model has a limitation when extracting long timelines due to its limited size of 512 positional embeddings, it performs better than the BERT-GRU sequence labeling model.We attribute this to: 1) the careful preprocessing we did when constructing the sequences used to train the BERTbased sequence classification model, and 2) the limited data size, which might not enable the BERT-GRU model to fully capture the sequential relation-ship across the tweets in the timelines.
For timeline extraction results on the TEST set, we compare both the human-level performance and the best model's performance against experts' annotations. 4To get the human-level performance on the TEST set, we average the performance of the two workers who agree most across all timelines in TEST.The best timeline extraction model achieves 73.51 average timeline accuracy, whereas the human-level performance is 88.98.This highlights the difficulty of the timeline extraction task and indicates that more involved models are needed to close the gap between model performance and human-level performance.

Timeline Summarization
We evaluate all timeline summarization models with ROUGE (Lin and Och, 2004).The summarization output of each model is evaluated in a multi-reference setting against the two summaries written by the two workers who agree the most based on the timeline extraction labels.We use SacreROUGE (Deutsch and Roth, 2020) to compute multi-reference ROUGE scores.
Table 4 presents the F1 scores of ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) of the timeline summarization models on the DEV set.The three naive baselines (first tweet, last tweet, and random tweet) achieve comparable performances, with R1 being the highest compared to R2 and RL.This is expected since the tweets in the timeline and the reference summaries (also tweets) all describe the same event, which leads to a considerable unigram overlap (i.e., R1).
For the Seq2Seq pretrained models, we present results in two settings: 1) using the oracle timeline; and 2) using the extracted timeline from the best timeline extraction model.In the oracle setting, we use the gold labels in the DEV set to identify tweets that are part of each timeline to the summarization models.The results of this oracle experiment estimate an upper bound for the summarization models used in this task.As shown in Table 4, the finetuned BART model achieves the best performance in terms of R1, R2 and RL.When we apply the best timeline extraction models to identify the tweets that are part of the timeline, we observe consistent conclusions.On TEST set, the best summarization model (i.e., BART) achieves 47. 05, 25.40 35.90 in R1, R2, and RL, respectively.

Human Evaluation
We further conducted a human evaluation to better assess the quality of the summaries.We take the whole TEST set and we evaluate eight summaries per timeline: 1) the three human-written summaries; 2) three modelgenerated summaries using the three models we develop for timeline summarization in the oracle setting; 3) two summaries from random systems: a random tweet that is part of the timeline, and a summary from a randomly selected timeline that belongs to a different crisis domain.The random systems add naive baselines and also serve as additional check on the annotation quality.Following a similar process in 3.2, we recruit a group of workers from MTurk with the same Location restriction (QC1), who can pass our pilot study as a qualification test.Each summary was evaluated by five different workers on a scale from 1 to 5 across four axes: Coherence, Accuracy, Coverage, and Overall quality, as was done by Lai et al. (2022).Detailed instructions and annotation workflows are presented as Figures 15-21 in Appendix E. Table 5 presents the results.Looking at the results over all the timelines, the human-written summaries are significantly better than the ones generated by models, in terms of overall quality with an average rating of 4.12.The human-written summaries are also better in terms of accuracy, however,   (Mann and Whitney, 1947).Bold numbers denote best systems.
this was not statistically significant compared to the average accuracy ratings assigned to the models generated summaries.The models generated summaries have higher average ratings in terms of coherence and coverage, but this is not statistically significant compared to the human-written summaries.Notably, the gap of the ratings becomes larger for longer timelines (i.e., timelines with 6 or more tweets).The human-written summaries are significantly better in terms of accuracy, coverage, and overall.In terms of coherence, they are on par with the models generated summaries.This highlights the shortcomings of the summarization models when it comes to long timelines.

Error Analysis
We conduct an error analysis over the outputs of the best models for timeline extraction and summarization on DEV.For timeline extraction, 66 timelines (77%) have at least one error, in which 272 out of 958 tweets (28%) are wrong.Table 6 presents timeline extraction and summarization results of the best models based on different timeline lengths over the DEV set.We observe that the model performance decreases with increasing timeline length for both timeline extraction and summarization tasks, with significant drops in accuracy and ROUGE scores when timelines get longer.

Extraction Summarization
Moreover, we inspected some of the generated summaries manually and we noticed that most of the summarization errors were due to hallucinations or to copying specific sentences that are present in the timeline without covering all the important event details described in the timeline.For instance, for the timeline in Figure 1, the best baseline summarization model (i.e., BART) generated the following summary: "A large fire burning in northeast Fresno near Woodward Lake sent plumes of smoke into the air above the city.Officials say the fire started as a commercial fire".The two sentences in the generated summary are copied verbatim from the tweets in the timeline.This highlights the need for better models to capture the important details mentioned in the timeline.

Conclusion
We presented CrisisLTLSum, the first dataset on local crisis timelines extracted from Twitter and the first to provide human-written summaries on information extracted from Twitter.We showed that CrisisLTLSum supports two downstream tasks: Timeline Extraction and Timeline Summarization.Our experiments with SOTA baselines indicate that both of these tasks are challenging and encourage future research.Our dataset further provides a resource for developing methods on utilizing microblogs toward aiding first-responders in evaluating ongoing crisis events.In future, we plan to explore models that can solve both tasks in a joint setting by extracting new information from each update point and then summarizing those.This dataset can also be expanded by additional annotation to enable abstractive entity-based understanding of the event flow (Mishra et al., 2018;Faghihi and Kordjamshidi, 2021).

Limitations
This study has some limitations on the dataset generation workflow.First, our noisy timeline collection process is not a comprehensive and exhaustive extraction to find all available information about a local crisis event.Here, our focus is to provide a representative dataset rather performing a comprehensive set containing all the local crisis events and their updates.Second, the proposed noisy timeline collection pipeline is highly dependent to the performance of the entity extraction modules, especially for the location extraction, and the accuracy of the OpenStreetMap API to find the correct physical address of each location mention.Accordingly, replicating the same process for other language or based on other locations may be hard because of such dependencies.Furthermore, the online clustering method used in this paper has a set of hyperparameters which are tuned heuristically from a small set of experiments.More comprehensive and large scale experiments on tuning those parameters could potentially impact the quality of the generated timelines.Generating similar results by following our noisy timeline collection process is in general limited by the users' access to the public Twitter stream and the changes in the available posts (they may become restricted or deleted).
seven states in the USA and two states of Australia during 2020 and 2021.The selected states in the USA are New York, Illinois, Massachusetts, California, Texas, and Florida.From Australia, we have selected the states of New South Wales and Victoria.These states are selected as they are some of the areas subject to the most events falling into our crisis domains of interest.California, Victoria, and New South Wales are mainly selected due to the abundance of wildfires in hot seasons.Texas and Florida have been selected as they are both subject to wildfires and storm events during different months.Massachusetts and Illinois are selected first because those areas are frequently subjected to bad weather, and second as they contain big cities subject to traffic and local fire events alongside New York.

A.2 Keyword Filtering
This step selects the subset of filtered tweets related to a specific crisis category of interest.Ideally, this task can be performed by a neural model trained on a large set of tweets with labeled data indicating the categories.However, as such a large amount of labeled data is unavailable, we rely on a common approach of designing keywords.We carefully curate a keyword list for each of our categories of interest by employing expert knowledge gathered through over-viewing crisis stories in news, social media, and related big disaster events of the past.Table 7 shows some example keywords used in each of the crisis domains in CrisisLTLSum.Please note that these lists are generic for each category and not specific to unique events.For instance, instead of defining keywords specific to an event called "Hurricane Ida", our keywords list includes phrases such as "fallen tree", "building collapse", or "storm".The quality of the keywords list is crucial to the final quality of the generated timelines, and we polish this list multiple times based on small sets of experiments before using them for the final task.Each keyword can be either a single word or multiple words.To avoid making a long list of keywords and ensure that different lexical formats of the same word are still considered in the keyword matching process, we maintain both a lower-cased and a lemittized version of the keywords and the tweet's text.If any of the keywords exist in the text and in any of these formats, then the tweet is considered related to the category.The multi-word keywords are not considered n-grams but as an in-dication that all the words in the keyword should exist in the text, even if not as a sequence.This approach will not be comprehensive or exhaustive but rather representative of each crisis domain.Improving this method to be more encompassing is an area for future research.

A.3 Clustering Hyper-Parameters
We use the following parameters shared among all different domains: s hashtag is set to 0.2 and s dist is 0.3.The sim threshold and time threshold are set to 0.7 and 15 hours respectively.The time threshold is reduced to 3 hours to avoid merging various accidents at different times but in the same location.The min dist and max dist are set to 0.4 kilometer and 4 kilometer for fire and traffic events while having the larger range of 0.4 and 10 kilometers for the wildfire and storm extraction.The expiration threshold is set to 15 hours and tweet threshold is 4.

A.4 Automatic and Manual Cluster Merge
The online clustering approach may fail to properly relate some tweets due to missing entities, clustering method hyper-parameters, or the difference in the description of various angles of the same story.To address this, we merge clusters by comparing all cluster heads with each other and combining those with a higher similarity score than a threshold s head .Additionally, we use human feedback to merge clusters where their similarity score is below s head but higher than a second hyper parameter s min head .This process can compensate for a portion of the missing entities due to entity extraction accuracy or tweets' informal text.

A.5 Duplicate Removal
Here, the goal is to remove tweets from the same cluster with identical or similar text.The duplicate removal relies on a fuzzy string sequence-matching technique to compute the similarity between a pair of tweets.We simply go over each tweet sorted in chronological order and remove the ones that match any of the previous tweets in the same cluster with a matching score higher than d match .

A.6 Noise Removal
Although the keyword filtering step would reduce the number of tweets unrelated to the category of interest, this step aims to further remove the unrelated clusters generated from the pipeline.To Wildfire wildfire, bushfire, campfire, volcano erupt, forest fire, ash fall, vegetation fire Local fire house fire, structure fire, building fire, gas leak, fire alarm, tower burned Storm tornado, shelter, storm damage, building collapse, storm collapse, roof damage, fallen tree Traffic traffic delay, road blocked, car fire, crash, injury, rollover, accident, stalled vehicle do so, we use a neural model for detecting the informativeness of tweets' text by training it on the available labeled data (Alam et al., 2021a).As the definition of informativeness in our domain differs from the data source, we further define a mapping between their label set and the informativeness as we refer to it.Hence, any fine-grained label related to personal emotions, prayers, or donations is removed from the informative set.Since this process is very category relevant and the existing labeled datasets do not cover all of our categories of interest, we can only apply this step to those categories where a representative subset exists for them in the available resources.We formulate this task as a Boolean tweet classification task to predict the tweet informativeness by applying a linear classification module on top of the aggregation token of a transformer-based language model.

B Noisy Clusters Selection for Human Annotation
This section gives more details on how we select noisy clusters for human annotation.Our goal is to select 1000 clusters that contain enough tweets describing an evolving crisis event.In particular, we first identify a "seed tweet" as the first observed post mentioning a local crisis event, then roughly check whether the following tweets in the cluster contain updates about the same event.Taking the example in Figure 1, we first see "a vegetation fire happened" in the seed tweet, then see "plumes of smoke", "windy condition impacts fire control".We select clusters across four crisis domains, including wildfire, fire, storm, and traffic, depending on how frequently each type of event happens in extracted noisy clusters.

B.1 Sample noisy timelines
Figure 4 shows the noisy timeline of the same annotated example from Figure 1. Figure 5, 6, and 7 show sample noisy timelines from other domains in CrisisLTLSum.

C Detailed Experimental Setup
C.1 Timeline Extraction BERT We fine-tune BERT base uncased on a single GPU for 10 epochs with a learning rate of 5e-5, batch size of 32, a seed of 42, and a maximum sequence length of 512.At the end of the finetuning, we pick the best checkpoint is based on the performance on the DEV set.
BERT-GRU For the BERT-GRU sequence labeling model, we use BERT base uncased to get the contextual representation for each tweet.The GRU has one layer with a hidden size of 128.The model was trained for 50 epochs with early stopping after five epochs if the performance did not improve on the DEV set.We use a learning rate of 5e-5, a batch size of 16, and a seed of 42.

C.2 Timeline Summarization
We fine-tune BART based on a single GPU for ten epochs with a learning rate of 5e-5, batch size of 16, a seed of 42, and a maximum target sequence length of 512.For DistillBART, we use the same hyperparameters.During inference, we use beam search with a beam size of 4. At the end of the fine-tuning, we pick the best checkpoint is based on the performance on the DEV set.

D Detailed Data Splits
Detail statistics on TRAIN, DEV, and TEST are shown in Table 8.For the QE task, we first display the rating rubrics and examples to the workers as shown in Figure 15 -17.To ensure the workers have a good understanding of the QE dimensions listed in the rubrics, as shown in Figure 18, the workers are asked to pass a screening test before they can access the quality rating part of the task interface.

Figure 2 :
Figure 2: The process of noisy timeline collection.The output of this step are noisy clusters which are used to create the dataset.

Figure 3 :
Figure 3: Average % of tweets that are part of the timelines based on the aggregated timelines lengths across different crisis domains.

Figure 8 -
Figure 8 -14 show the annotation interface for the Timeline Extraction & Summarization task. Figure 15 -21 show the annotation interface for the Summarization Quality Estimation (QE) task.For the QE task, we first display the rating rubrics and examples to the workers as shown in Figure15-17.To ensure the workers have a good understanding of the QE dimensions listed in the rubrics, as shown in Figure18, the workers are asked to pass a screening test before they can access the quality rating part of the task interface.

Figure 7 :
Figure 7: Sample noisy timeline of a fire event.

Figure 8 :
Figure 8: The first page of our annotation interface for the Timeline Extraction & Summarization task, which contains the task introduction and a brief instruction.

Figure 9 :Figure 11 :Figure 13 :
Figure 9: Step 1 -2 of the annotation instructions in our annotation interface for the Timeline Extraction & Summarization task

Figure 16 :Figure 17 :
Figure 16: Instruction page of the Summarization Quality Estimation Task part 2.

Figure 18 :Figure 19 :
Figure 18: Instruction page of the Summarization Quality Estimation Task part 4.

Figure 20 :
Figure 20: Rating page of the Summarization Quality Estimation Task part 1.

Figure 21 :
Figure 21: Rating page of the Summarization Quality Estimation Task part 2.

Table 1 :
• not repetitive: brings in new information Data statistics across different crisis domains in terms of the number of timelines and tweets.indicates tweets that are part of the timeline and indicates tweets that are not part of the timeline.

Table 2 :
Dataset statistics based on the aggregated timelines lengths (i.e., number of tweets) across different crisis domains.

Table 3 :
, and Results of timeline extraction models on DEV.

Table 4 :
Results of different timeline summarization models after timeline extraction and in oracle settings on DEV.

Table 5 :
Results of human evaluation on human and model generated summaries.[All] means all timelines in the TESTand [6 -] means ones with length ≥ 6. denotes human summary is significantly (p < 0.05) better or worse than the best performing model based on one-sided Mann-Whitney U test

Table 7 :
Sample keywords for each crisis domain in CrisisLTLSum.

Table 8 :
Aggregated data statistics based on crisis domains and timeline lengths of the TRAIN, DEV, and TEST splits.Pleasant and Allston St, command striking a Working Fire.Fire caused by a transformer explosion, Exterior pole and tree on fire, companies working on stopping the fire from extending to a 3 story large wood frame apartment house.The fire caught the tree (foreground) on fire, as well as the wires for 3 blocks up to Pleasant and Allston.This resulted in another transformer explosion further up the block, I believe.(I didn't see this one.)2/