Timeline Summarization based on Event Graph Compression via Time-Aware Optimal Transport

Timeline Summarization identifies major events from a news collection and describes them following temporal order, with key dates tagged. Previous methods generally generate summaries separately for each date after they determine the key dates of events. These methods overlook the events’ intra-structures (arguments) and inter-structures (event-event connections). Following a different route, we propose to represent the news articles as an event-graph, thus the summarization becomes compressing the whole graph to its salient sub-graph. The key hypothesis is that the events connected through shared arguments and temporal order depict the skeleton of a timeline, containing events that are semantically related, temporally coherent and structurally salient in the global event graph. A time-aware optimal transport distance is then introduced for learning the compression model in an unsupervised manner. We show that our approach significantly improves on the state of the art on three real-world datasets, including two public standard benchmarks and our newly collected Timeline100 dataset.


Introduction
Timeline summarization Yan et al., 2011a,b;Tran et al., , 2015Nguyen et al., 2014;Wang et al., 2016;Martschat and Markert, 2018;Steen and Markert, 2019) aims at generating a sequence of major news events with their key dates from a large collection of related news from multiple perspectives (see Figure 1 for an example). The timeline summarization task poses several challenges to existing Natural Language Processing (NLP) techniques: (1) In contrast to multi-document summarization (MDS) dealing with tens of documents (Fabbri et al., 2019), it summarizes hundreds of long documents, which requires the model to efficiently maintain a joint representation of the entire news collection, so that the summary has its coverage and coherence optimized globally.
(2) The summary is expected to select key dates and capture the temporal interdependency across key stories, which, compared to standard MDS, poses additional challenges in reconstructing temporal order. (3) Manual labeling of timeline summaries is costly; thus the labeled data for model training is very limited.
As a result, previous studies Steen and Markert, 2019) usually take an unsupervised approach. Specifically, these methods first identify the key dates from the publication time distribution. Then for each key date and its associated news articles, a summary is generated based on the salient sentences measured by the inter-similarity of these articles. In these methods, the document representations are limited to local text features, ignoring the global context of the news collection. The applications of neural models, especially advanced pre-trained language models, such as BERT (Devlin et al., 2019a) and GPT-2 (Budzianowski and Vulić, 2019), are restricted in terms of both representation capacity and memory efficiency when handling the global context within such input document size.
We propose an event graph representation along with compression to deal with the representation difficulties in global graph contextualization, scalability, and time-awareness. Our solution consists of the following key ideas.

2009-06-25
A recording of the telephone call made to emergency services has been released, in which the caller said Jackson was unconscious and had stopped breathing. Paramedics were called to Jackson's Los Angeles mansion while Dr Murray was performing CPR, according to a recording of the 911 call. She said Dr Murray had traveled in the ambulance with Jackson after he collapsed last Thursday.

2009-06-28
Jackson family left speechless and devastated by star' s death Los Angeles police investigating the death of Michael Jackson say they have carried out an extensive interview with his doctor, Conrad Murray .

2009-06-25
Jackson died at his Los Angeles home on 25 June aged 50.
Jackson's body was released to the family on Friday night.
Police also want to speak to Jackson's doctor who witnessed his collapse .

2009-06-28
He owes it to the family and to the public to say: 'These were the last hours of Michael 's life and here 's what happened'." Michael Jackson 's family are said to be seeking a second autopsy because they still have questions about his death. ....

2009-05
May 2009-06-28 She said Dr Murray had traveled in ambulance with Jackson after he collapsed last Thursday ... Dr Murray had been hired by Jackson in May to accompany him as he prepared to embark on a gruelling series of 50 concerts in London in July .

2009-06-28
Jackson family left speechless and devastated by star's death. Los Angeles police investigating the death of Michael Jackson say they have carried out an extensive interview with his doctor, Conrad Murray . .....

2009-06-28
Jackson's body was released to the family on Friday night. A spokeswoman for Dr Murray said he had been interviewed for three hours by police on Saturday .... He owes it to the family and to the public to say: ' These were the last hours of Michael's life and here's what happened' . "

2009-06-29
Paramedics were called to Jackson' s Los Angeles mansion while Dr Murray was performing CPR , according to a recording of the 911 call .

2009-07-22
The Houston clinic of Michael Jackson 's doctor has been searched by drug police looking for evidence of manslaughter .... Dr Murray 's spokeswoman told the BBC the raid was a surprise to us and it was a surprise to the attorneys.

Information Extraction
Our Generated Timeline   Figure 1: Timeline summarization based on event graph compression. The example is a partial timeline about the investigation on Dr Conrad Murray for the death of Michael Jackson, describing that Michael Jackson is found unconscious and Dr Murray traveled with him to hospital and started to be interviewed by police. We use green triangles to denote events, and grey circles stand for entities. Italics represents the raw text mention extracted. Black bold arrows represent the temporal order between events, and grey arrows are event-entity argument edges and entity-entity relation edges. Coreferential events and entities are merged across documents. Faded nodes are events being removed during summarization. In this example, we show the transport of node pairs i, j and i, k to the node pair i , j in the summary graph. graph connects events through temporal order (e.g., interview BEFORE − −− −→raid), shared arguments (e.g., The graph structure enables the model to capture global longdistance inter-dependency between events across documents. (2) Unsupervised event graph compression with optimal transport (OT): We propose a new formulation of timeline summarization, by selecting event nodes from the input graph to form a smaller summary graph. Under a certain summary size constraint, a summary graph with high coverage has a small information loss, compared to the one with low coverage (Filatova and Hatzivassiloglou, 2004). We constrain the total number of event nodes to be kept in the summary, and optimize the summary graph to be close to the original graph using opti-mal transport. The training objective is to find the optimal transport plan between input and summary graph that has the minimal transport distance. Figure 1 shows an example of transporting node pairs in the input graph to the node pair die, interview in the summary graph. die, interview receives relatively large mass during the graph transport since it has small distance with multiple node pairs in the input graph, such as die, speak . To obtain the minimal distance with only m events to be kept, a global decision is learned to select salient but also diverse events. The summary graphs are generated using a differentiable compression model according to a hyperparameter of compression rate, instead of using annotated timelines. Thus, our objective allows model training in an end-to-end unsupervised way.
(3) Time-aware Gromov-Wasserstein distance: The distance between two graphs should capture the following criteria: i) Semantic relevance: each node first has its initial local context encoded via a pre-trained BERT model and node type embeddings. For example, STARTPOSITION event is not closely related to the TRANSPORT event in Figure 1 though they have temporal dependencies. ii) Structural centrality: we employ a graph neural network to maintain a global context embedding by encoding the global structure topology, which enables the events of high node centrality to gather comprehensive information from neighbors. For example, although both are MEET events, interviewed (by police) is more structurally salient than speak. It encodes the information not only from its neighbor events such as raid, but also from longdistance neighbors such as travel (to hospital) via the aforementioned argument paths. iii) Temporal coherence: we define time-aware Gromov-Wasserstein distance over the temporal edges, and introduce a temporal regularizer to enlarge the distance between events that have wide time gap, such as the BORN and INJURE events in Figure 1, so that the temporal coherence can be captured. It enables the model to select temporally salient events that have temporal dependencies with multiple events in the news collection. Also, timeline summarization is sensitive to temporal ordering, such that the TRANSPORT (traveling in ambulance) before DIE in Figure 1 is more important to the story than the TRANSPORT (releasing body) after DIE. Hence, we distinguish the before and after events in the distance computation.
(4) New benchmark: Considering the current timeline summarization benchmarks are limited to certain topics, we collect a new dataset Timeline 100 with more testing samples and wider topic coverage. Experiments on three datasets show that our approach is significantly better than the baselines.

Overview
Our approach aims at finding the graph that has minimal distance from the input graph (Filatova and Hatzivassiloglou, 2004), so that when only a limited number of nodes is selected, the summary graph can have menial information loss. Optimal transport is solving this exact problem by finding the best transport plan that has a minimal distance between two graphs. To apply optimal transport to timeline summarization, the key is to design the distance to evaluate the information loss, and thus we propose time-aware optimal transport distance. Figure 1 gives an overview of our approach. It first extracts an event graph G from input documents. We then encode the graph and perform graph compression to compress G to its summary graph S. Our time-aware optimal transport is applied to train the graph encoder and compression model, with the goal of keeping events that are semantically related, structurally salient, and temporally coherent.

Event Graph Construction
The event graph is a heterogeneous graph G, where nodes are events {v i } and entities {e j }, and edges contain event-event temporal order-  We apply OneIE (Lin et al., 2020), a state-ofthe-art Information Extraction (IE) system, to extract entities, relations and events; then perform cross-document entity and event coreference resolution (Pan et al., 2015(Pan et al., , 2017 over the document cluster of each timeline topic. We apply (Ning et al., 2019) to extract temporal relations for events in the same paragraph or having shared arguments. For example, clashes happen before wound given the sentence fifty wounded are reported in the clashes. To obtain the date of each event, We extract and normalize time expressions using publication date (Manning et al., 2014), and then apply (Wen et al., 2021) to extract the event temporal attributes from the context. If the tem-poral attributes can not be decided according to the context, we propagate the temporal attributes from neighbor events based on their shared arguments (?). After that, we use the document publication date to populate the remaining missing dates. For example, in Figure 1, the date 2009-06-25 of the collapse (DIE) event is extracted from context last Thursday, and the date of the unconscious (INJURE) event is propagated along with their shared argument Michael Jackson.

Time-Aware Optimal Transport (OT)
Optimal Transport. We aim to generate the summary graph S that has minimal OT distance with the input graph G, such that where represents the Hadamard product. T ∈ R n×m + denotes the transport plan, learned to optimize a soft node alignment between two graphs. Namely, each node in G can be transferred to multiple nodes in S with different weights. We use T ii to denote the amount of mass shifted from node i in the input graph G to node i in the summary graph S, as shown in Figure 1. C ∈ R n×m is the cost matrix of event nodes between two graphs. Time-Aware OT Distance. Considering that event graphs are heterogeneous graphs, and timeline summarization is sensitive to temporal dependencies between events, we define the Gromov-Wasserstein Distance (Xu et al., 2019) on temporal edges to calculate distance between pairs of nodes within two graphs, i.e., i, j in G and i , j in S: Figure 1 shows an example of transporting edges i, j in the input graph to i , j in the summary graph. The cost |C ij − C i j | evaluates the intragraph structural similarity between two pairs of nodes i, j in G and i , j in S. To capture the direction of temporal ordering, we parameterize different matrices to distinguish the before and after nodes: In this way, although travel in Figure 1 and release are both TRANSPORT events connecting with the DIE event, they are distinguished during distance calculation. Here, v i and v j are the node representations and we want them to capture the semantic relevance, structural salience and temporal coherence. As a result, we design an event graph encoder later in §2.4 from these three aspects. Temporal Regularizer. The OT distance between events should also capture temporal coherence. For example, in Figure 1, BORN event and INJURE event have large time gap, so that there should be a large distance between them, although they have direct connections in the graph. As a result, we use a regularizer Ω(t i , t j ) to penalize events that have a large time difference t i − t j : where β ∈ (0, 1] is a hyper-parameter.

Event Graph Encoder
In order to calculate the time-aware optimal transport distance, we encode both the input event graph and the summary graph to obtain the node representations, which capture text semantics, graph structures and preserves the temporal information. Semantics Encoding. To capture the local text semantics of an entity e or an event v, we apply the pre-trained BERT (Devlin et al., 2019b) to initialize a contextualized embedding w using its text mentions. We use the average representation for nodes having multiple mentions, and concatenate it with the node type embedding φ, which is initialized by BERT using the type name. The frequency of events has been proven effective and critical to timeline summarization . As a result, we add the number of its text mentions |w| to capture the event frequency in the news collection: where [; ] denotes concatenation operation.
Graph Encoding. After that, we employ an edgewise graph neural network to contextualize all the nodes with their global graph contexts. We first generate edge type representation a and r by encoding the edge type name using pre-trained BERT, and temporal edge representation t is encoded using name "before". The message passed through an argument edge v i , r, e j is: The messages of relation and temporal edges are similar, by replacing a with r and t. We aggregate the messages using edge-aware attention following (Liao et al., 2019), where σ denotes sigmoid function. We adopt a two-layer MLP with ReLU as activation funtion. The event node representation v i is then updated using the messages from its local neighbors N (v i ): similar to entity node representations. Date Distribution Encoding. To encode the date distribution, for each event v i with date t i , we concatenate the above node representation v i with the number of documents published on t i , the number of events happening on t i , and the number of event text mentions attached to t i in local context. It enables the OT distance to capture the corpus-level date salience.

Differentiable Graph Compression
To get a summary graph with m event nodes 2 , we apply an event graph compression matrix M ∈ R n×m following (Ma and Chen, 2021), where A G ∈ R n×n is the temporal edge adjacency matrix of event nodes in G, with A S ∈ R m×m for S similarly. For timeline summarization task, the parametrization of M has two requirements: (1) M is differentiable to enable end-to-end training; (2) we want to guarantee that the nodes in the summary graph are originally from the input graph (due to our extractive summarization goal), so we follow (Ma and Chen, 2021) to directly select nodes as summary nodes according to their weights α ∈ R n×1 : Here, A ∈ R n×n is the normalized graph adjacency matrix defined in graph convolutional networks (Kipf and Welling, 2017), V ∈ R n×d is the node feature matrix, and W α ∈ R d×1 is a parameter vector. σ is the sigmoid function.
We pick the top m values of α and list them in the sorted order, denoted by α s ∈ R m×1 . Similarly, A s ∈ R n×m is the column-sorted and picked 2 We only compress the event nodes since that the key for timeline summarization is salient event selection, while arguments are used to capture the distance between events. version of A. Then the compression matrix M can be finally defined as where 1 means a column vector of all ones.

Training Objective
The optimal T that solves D(G, S) = min T T C can be approximated by a differentiable Sinkhorn-Knopp algorithm (Sinkhorn, 1964;Cuturi, 2013) following (Xu et al., 2019;Ma and Chen, 2021), where p ∈ R n×1 + and q ∈ R m×1 + . The solution T can be computationally obtained by using Sinkhorn's algorithm. Starting with any positive vector q 0 to perform the following iteration: for i = 0, 1, 2, . . . until convergence, where denotes element-wise division. A computational T k can be obtained by iterating a finite number k times, The parameterization of the graph compression step and Sinkhorn-Knopp algorithm are differentiable, so we can optimize our time-aware optimal transport distance between two graphs in an endto-end manner.
The advantage of our approach is that the training process is unsupervised, since the summary graph is generated automatically under the constraint of the hyperparameter m, i.e., the number of event nodes in the summary graph. The model parameters include those for the graph encoder (capturing semantic relevance, structural centrality and time salience), the transport distance matrix (capturing temporal coherence), the compression model (selecting top ranked nodes in a differentiable manner), and the transport plan (making a global decision to obtain minimum distance). They are optimized jointly to minimize the distance between the generated graph and the input graph.

Extractive Summarization
During summarization, the event summary graph is generated by selecting m events according to the event weights α, where m is a hyperparameter decided by the expected compression rate. To maintain the diversity of the temporal dimension following , we set a maximum event constraint to select no more than k events for each date. In detail, if the event number of one date reaches the limitation, the remaining events of that date will be ignored in the ranking list α, and only events happening on other dates can be selected to the summary graph. For each date, k is decided by the date distribution (i.e., the number of events happening on each date), as well as the compression rate hyperparameter.
Finally, for each event v ∈ V S in the summary graph, we extract an event summary sentence, i.e., the source sentence with the maximum event coverage. 3 The event summaries are ordered by dates to form the timeline. The event summaries on the same date are merged following the events' temporal orders with topological sort (Manber, 1989).

Experimental Settings
Datasets. The evaluation is conducted on three datasets. Timeline 17  and Crisis (Tran et al., 2015) are two widely used timeline summarization datasets. Timeline 17 contains 17 topics, and each topic has 1-3 ground-truth timelines, resulting in 19 timelines in total. Crisis has 5 topics and each topic has 4-7 ground-truth timelines, with 22 timelines annotated in total. We use all 19 and 22 timelines as references, and calculate the average scores following previous work.
To explore the robustness of our event graph compression for different scenarios, we also collect a new larger dataset Timeline 100 containing 100 timelines from news websites including VoA 4 and Reuters 5 . The timelines are written by journalists and are manually curated. The dataset covers various topics related to the economy, military, education, etc. The input documents for each timeline are selected using BM25 (Robertson et al., 1995). For each dataset, we construct input event graphs following §2.2. 6 We use the ACE event ontology 7 , with 7 entity types, 6 relation types, 33 event types, and 22 argument roles. For the (unsupervised) training of our event graph compression model, we use event graphs constructed from VoA news between 2011 and 2017 (Li et al., 2020a). The statistics are shown in Table 2  Evaluation Metrics. We use the conventional metrics for timeline summarization  to evaluate the key date selection using Date F 1 and the content generation using ROUGE scores, including (1) concat F 1 to compute ROUGE by concatenating the summaries of all selected dates; (2) agree F 1 to compute ROUGE only between the summaries which have the same dates; (3) align F 1 to first align summaries in the output with those in the reference based on similarity and the distance between their dates, then compute the ROUGE score between aligned summaries. Distant alignments are punished. Baselines. We compare with: (1) , a typical extractive model based on sentence similarity; and (2) , the state-of-the-art extractive timeline sumarization model based on submodular functions. (3) Pac-Sum (Zheng and Lapata, 2019), the state-of-the-art unsupervised graph-based ranking summarization baseline, which utilizes BERT to encode sentences for sentence centrality ranking in a sentence graph. We use the publication date of the selected sentence as key dates. (4) SummPip (Zhao et al., 2020), the state-of-the-art unsupervised multi-document summarization baseline, which constructs a sentence graph and performs spectral clustering. After that, a summary is generated for each sentence cluster  by multi-sentence compression, and we use the most frequent publication date of the sentences in the cluster as key dates. (5) "w/o temporal regularizer", an ablation study by removing the temporal regularizer in the OT distance. 8 Training Details. The dimension of contextual embedding, type embedding, and edge embedding are 768. β is 0.5. γ is 1. The ratio of event nodes kept after compression m is determined based on the ratio of input graph size and summary graph size of the dataset. We use 0.05 for Timeline 17 dataset, 0.005 for Crisis dataset, and 0.05 for Timeline 100 dataset 9 . Due to the large size of input graphs, we first compress the subgraph extracted from each publication date following the hard cutoff of , and then compress the graph of the entire corpus. The graph compression model is trained on one Tesla V100 GPU with 16GB DRAM.

Quantitative Performance
As shown in Table 3, our method outperforms baselines on all three datasets. Event graph connects events through entities and temporal relations, which enables capturing the correspondence between events, and excludes unrelated events. Gen-eral multi-document summarization and text graph based summarization cannot capture the temporal dimension, so the performance is especially low on date F 1 , agree F 1 and align F 1 . All Concat F 1 scores are significantly different from baselines with p value less than 0.05. Removing the temporal regularizer results in a consistent performance drop on date F1, showing that our time-aware OT helps select events that are temporally coherent.
We achieve larger gains compared to baselines on Crisis dataset, which has larger input graph size and compression rate according to Table 2. It proves the effectiveness of our event graph on encoding a large number of documents and perform effective summarization. Compared to Timeline 17 , the performance gain on Timeline 100 is larger, which cover more scenarios. It demonstrates the robustness of our event graph compression method. Figure 1 shows an example of generated timeline comparing with the reference timeline and the best performing baseline . The number of dates selected by the baseline is larger compared to our approach, which demonstrates that our approach can better detect salience of dates. We think this is because we take advantage of event graphs to capture the events that are temporally salient. For example, our approach avoids the dates that do not have associated salient events, such as 2009-06-26. Also, our temporal attributes are more comprehensive and accurate due to the attribute propagation through shared arguments. For example, the dates of unconscious and travel in Figure 1 are propagated from the die event via the shared argument Michael Jackson.

Qualitative Analysis
Compared to the baselines, our approach keeps more events in the summary (highlighted in green in Figure 1), while the baseline may produce a summary without events included, e.g., the summary of 2009-06-29.
Compared to the reference timeline, our model is shown to successfully detect the salient events in the graph compression process. Although the release event has connections to multiple events, it is not semantically relevant to other events, and thus it will not receive a large mass during the transportation. The speak event is not strongly connected to other nodes, and it is semantically close to interview, which will not be selected in the global decision of the optimal transport plan. Similarly, the born event is omitted due to its large time gap with other events, and the hire event is excluded since it is not semantically related to other events. More examples are included in the Appendix.

Human Evaluation
We follow previous work (Steen and Markert, 2019) to do a scoring-based evaluation. We instruct the human annotators to read 15 randomly sampled reference timelines, and rate summaries generated by our system and baselines on a 1-5 point scale (1 is the worst and 5 is the best). We provide reference timelines as the gold standard to annotators, instead of providing the input news collection. It is because that each timeline contains hundreds of long documents as input, making it hard to judge coverage and control scoring standards of multiple annotators. As the evaluation is scoring-based, we only ask one annotator to score all timelines of each topic to guarantee the same scoring standard. The order of annotating timelines is random, and the annotators have no knowledge about the order of the systems. Each timeline annotation takes around thirty minutes.
The timelines are evaluated in the following dimensions: (1) general score: the general quality of the timeline; (2) coverage score: the events that are covered by the timeline; (3) coherence score: the coherence of the story; (4) temporal preserving score: the selection of key dates. Table 4 shows that our approach gets better results on all four measures, proving that our model is reasonable to find semantically relevant, structurally salient and temporally coherent events.

Discussions
Generation Length. Previous work on timeline summarization Martschat and Markert, 2018) relies on the reference timeline to decide the compression parameters, such as the overall length or the number of days. In our model, the number of nodes to be kept is decided by the hyperparameter m. Following previous work, we choose m based on the reference compression rate, i.e., the ratio of the event nodes in reference summary to the input event nodes, as detailed in §3.1. Figure 2 shows the relevance between the performance and compression rate.
Compression Rate. The summarization performance is affected by the compression rate of the reference summary. Figure 2 shows that our model achieves larger gains compared to baselines on the timeline with higher reference compression rates, demonstrating that our model is able to effectively select salient events for a large input corpus.
Timeline Topics. Figure 2 shows that the compression rates do not have correlations with timeline topics, and our performance gains compared to baselines are not closely related to timeline topics, proving the robustness of our method.
Input Graph Size. When generating timelines for the same complex event BP Oil Spill, as shown in Table 5, the performance gain is generally increasing with respect to the input graph size. It proves the effectiveness of our model on selecting salient information from large graphs.

Related Work
Multi-Document Summarization. Graph-based MDS methods (Barzilay et al., 1999;Erkan and Radev, 2004;Haghighi and Vanderwende, 2009;Ganesan et al., 2010;Banerjee et al., 2015;Yasunaga et al., 2017;Fabbri et al., 2019;Liu and Lapata, 2019; are closely related to timeline summarization but cannot be directly applied, due to the lack of temporal dimensions. Timeline Summarization. Due to the lack of training data, timeline summarization focuses on extractive methods with heuristics Yan et al., 2011a,b;Tran et al., , 2015Nguyen et al., 2014;Wang et al., 2016;Martschat and Markert, 2018), with a few abstractive methods (Steen and Markert, 2019;Ansah et al., 2019) that require a few gold summaries to work. They both fail to capture the rich event structures and ignore the temporal orders between events. We are the first to use optimal transport on summarization task to select semantic relevant, structurally salient and temporally coherent events. Graph Representation of Documents. In general NLP research, people have built various text graphs by augmenting original text sequences with different hidden structural information, such as entitycentric graphs for efficient joint-encoding of large corpora (Wu et al., 2021;De Cao et al., 2019;Ding et al., 2019;Asai et al., 2020;Min et al., 2019;Das et al., 2019). Event graphs from a single document have been built for event schema induction (Li et al., 2018(Li et al., , 2020b, event coreference resolution (Phung et al., 2021;Zeng et al., 2021), etc. However, they ignore relations between event arguments, or only use hierarchical or temporal relations to connect events. Also, cross-document entity coreference and event coreference resolution are critical for large corpora understanding, while previous work focuses on a single document. Our approach is unique in building event-centric graphs across documents, with rich argument and temporal information.

Conclusions and Future Work
We propose a novel event graph compression framework for timeline summarization and achieve stateof-the-art on multiple real-world datasets. Our usage of event graphs allows for efficient joint encoding of a large number of documents; and our proposed time-aware optimal transport allows unsupervised training of the entire framework. Future work includes extending our approach to abstractive summarization, and adding subevent relation to hierarchically generate the timeline.

A Example Output
Method Example Output