Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries

In this paper, we address a novel task, Multiple TimeLine Summarization (MTLS), which extends the flexibility and versatility of Time-Line Summarization (TLS). Given any collection of time-stamped news articles, MTLS automatically discovers important yet different stories and generates a corresponding time-line for each story.To achieve this, we propose a novel unsupervised summarization framework based on two-stage affinity propagation. We also introduce a quantitative evaluation measure for MTLS based on previousTLS evaluation methods. Experimental results show that our MTLS framework demonstrates high effectiveness and MTLS task can give bet-ter results than TLS.


Introduction
Nowadays, online news articles are one of the most popular Web documents. However, due to a huge amount of news articles available online, it is getting difficult for users to effectively search, understand, and track the entire news stories. To solve this problem, a research area of TimeLine Summarization (TLS) has been established, which can alleviate the redundancy and complexity inherent in news article collections thereby helping users better understand the news landscape.
After the influential work on temporal summaries by Swan and Allan (2000), TLS has attracted researchers' attention. Most of works on TLS (Martschat and Markert, 2018;Steen and Markert, 2019;Gholipour Ghalandari and Ifrim, 2020) have focused on improving the performance of summarization. However, their drawbacks are as follows: (a) the methods work essentially on a homogeneous type of datasets such as ones compiled from the search results of an unambiguous query (e.g., "BP Oil Spill"). The requirements imposed on the input dataset make it hard for TLS systems to generalize; (b) the output is usually a single timeline regardless of the size and the complexity of the input dataset.
We propose here the Multiple TimeLine Summarization (MTLS) task that enhances and further generalizes TLS. MTLS automatically generates a set of timelines that summarize disparate yet important stories, rather than always generating a single timeline as is in the case of TLS. An effective MTLS framework should: (a) detect key events including both short-and long-term events, (b) link events related to the same story and separate events belonging to other stories, and (c) provide informative summaries of constituent events to be incorporated into the generated timelines.
MTLS can also help to deal with the ambiguity, which is common in information retrieval. For example, suppose that a user wants to get an overview of news about a basketball player, Michael Jordan, from a large collection of news articles. However, when a search engine over such a collection takes "Michael Jordan" as a query, it would likely return documents constituting a mixture of news about different persons having the same name. Then, how can a typical TLS system return meaningful results if only a single timeline can be generated? Similarly, ambiguous queries such as "Apple", "Amazon", "Java" require MTLS solutions to produce high quality results.
To address this task, we further propose a Two-Stage Affinity Propagation Summarization framework (2SAPS). It uses temporal information embedded in sentences to discover important events, and their linking information latent in news articles to construct timelines. 2SAPS has several advantages: firstly, it is entirely unsupervised which is especially suited to TLS-related tasks as there are very few gold summaries available for training supervised systems; secondly, both the number of events and the number of generated timelines are self-determined. This allows our framework to be dependent only on the input document collection, instead of on human efforts. Furthermore, the current TLS evaluation measures allow only 1-to-1 comparison (system-to human-generated timeline), which is not suitable for MTLS task where multiple timelines must be compared to (typically) multiple ground-truth timelines. Therefore, we also propose a quantitative evaluation measure for MTLS based on the adaptation of the previous TLS evaluation framework.
Given these points, our contributions in this work are summarized as follows: 1. We propose a novel task (MTLS), which automatically generates multiple, informative, and diverse timelines from an input time-stamped document collection.
2. We introduce a superior MTLS model that outperforms all TLS-adapted MTLS baselines.
3. We design an evaluation measure for MTLS systems by extending the original TLS evaluation framework.
2 Related Work

Timeline Summarization
Since the first work on timeline summarization (Swan and Allan, 2000;Allan et al., 2001), this topic has received much attention over the years (Alonso et al., 2009;Yan et al., 2011a;Zhao et al., 2013;Li and Li, 2013;Suzuki and Kobayashi, 2014;Wang et al., 2016;Takamura et al., 2011;Pasquali et al., 2019Pasquali et al., , 2021. In the following, we review the major approaches. Chieu and Lee (2004) constructed timeline by directly selecting the top ranked sentences based on the summed similarities within n-day long window. Yan et al. (2011b) proposed evolutionary timeline summarization (ETS) to return the evolution trajectory along the timeline, consisting of individual but correlated summaries of each date. Shahaf et al. (2012) created information maps (Maps) to help users understand domain-specific knowledge. However, the output consists of a set of storylines that have intersections or overlaps, which is not appropriate for a dataset that may contain quite different topics. Nguyen et al. (2014) proposed a pipeline to generate timelines consisting of date selection, sentence clustering and sentence ranking.
Recently, Martschat and Markert (2018) adapted a submodular function model for TLS task, which is originally used for multi-document summarization (MDS). Duan et al. (2020) introduced the task of Comparative Timeline Summarization (CTS), which captures important comparative aspects of evolutionary trajectories in two input sets of documents. The output of the CTS system is, however, always two timelines generated in a contrastive way. Then, Gholipour Ghalandari and Ifrim (2020) examined different TLS strategies and categorized TLS frameworks into the following three types: direct summarization approaches, date-wise approaches, and event detection approaches.
To the best of our knowledge, the idea of multiple timeline summarization has not been formally proposed yet. Table 1 compares the related tasks.

Timeline Evaluation
Some works (Yan et al., 2011b;Chen et al., 2019;Duan et al., 2020) evaluate timeline by only computing ROUGE scores (Lin, 2004). This way ignores the temporal aspect of a timeline, which is important in timeline summarization. Martschat and Markert (2017) then proposed a framework, called tilse, to assess timelines from both textual and temporal aspects. Subsequently, TLS works (Steen and Markert, 2019;Gholipour Ghalandari and Ifrim, 2020;Born et al., 2020) have followed this framework to evaluate their models. Some researches (Tran et al., 2015;Shahaf et al., 2012;Alonso and Shiells, 2013) also involved user studies, in which users are required to score systemgenerated timelines based on varying criteria such as relevance and understandability. In Section 5, we will adapt the tilse framework to MTLS task.

Problem Definition
We formulate MTLS task as follows: Input: A time-stamped news article collection D = {d 1 , d 2 , ..., d |D| }. The collection can be standalone or compiled from search results returned by a news search engine. Output: A set of timelines, T = {T 1 , T 2 , . . . , T k } is generated based on D, so that each timeline T i includes a sequence of time/date 1 and summary pairs (t T i 1 , s T i 1 ), . . . , (t T i l , s T i l ) where s T i j (i = 1, . . . , k) are the summary sentences for the time t T i j (j = 1, . . . , l) and l is the length of T i . Each timeline in T should be consistent and coherent, yet different from other timelines.  (Yan et al., 2011b) Maps (Shahaf et al., 2012) MTLS (Proposed task) We note that while the traditional TLS task is limited as a document collection for it is typically coherent and homogeneous, MTLS is more flexible as the input news collection can be diverse. For example, the input collection can be generated using a search query q composed of multiple entities or concepts like q = {"egypt", "h1n1", "iraq"} or by using an ambiguous query like q = {"michael", "jordan"}, or it can also consist of news articles crawled over a certain time span from multiple news sources. Generally, the more heterogeneous D is, the more timelines could be produced. The intuition behind this idea is that users will need more structured information to help them understand a relatively complex document collection.

Framework
Next, we present two key components of our framework: event generation module (Sec. 4.1) and timeline generation module (Sec. 4.2).
We first make the following two assumptions: Assumption 1: News articles sometimes retrospectively mention past events for providing necessary context to the target event, for underlying continuation, causality, etc. Assumption 2: Sentences mentioning similar dates have higher probability to refer to the same event than sentences with different dates.

Event Generation Module
In this module, we extract important historical events from a document collection. Gholipour Ghalandari and Ifrim (2020) constructed events by simply grouping articles with close publication dates into clusters, resulting in lower accuracy. Note that Assumption 1 implies that a single news article may contain multiple events. Accordingly, in our work, the concept of event is more fine-grained. We define event as a set of sentences that describe the same real-world occurrence, typically using the same identifying information (e.g., actions, entities, locations). This information is captured by sentence-BERT (Reimers and Gurevych, 2019): a pre-trained model on a transformer network where similar meanings are positioned nearby in semantic vector space. We then employ Affinity Propagation (AP) (Frey and Dueck, 2007) following Steen and Markert (2019) for clustering similar sentences. AP algorithm groups data points by selecting a set of exemplars along with their followers due to message passing. It operates over an affinity matrix S, where S(i, j) denotes similarity between data points x i and x j .
We observe that high semantic similarity does not always guarantee that sentences refer to the same event. Especially, for some periodic events, similar happenings might have occurred several times. For example, a news article could include sentences reporting that Brazil won the gold medal in the World Cup (in 2002) while some other sentences in this document could recall that Brazil has won the first place in the World Cup in 1994. It is clear that those sentences describe two distinct events, which would be grouped into one event if only semantic similarity is considered.
Therefore, based on Assumption 2, we introduce another key factor, temporal similarity, which enhances the confidence of how likely two sentences will refer to the same event. We define each element S 1 (v i , v j ) of affinity matrix S 1 as follows: where v i and v j denote different sentences, and t i and t j denote dates mentioned by v i and v j , respectively. 2 In addition, S date and S cos denote the temporal and semantic similarities, respectively. While we employ cosine similarity for the semantic similarity, we define temporal similarity S date (i, j) to quantify how similar two dates are using Equation (2): where γ 3 is the decay rate of the exponential func-380 tion. The larger the time gap between two dates, the smaller the value of S date . By passing messages of both semantic and temporal information between sentences, clusters consisting of exemplar and non-exemplar sentences are constructed to form the candidate event set E. Each cluster represents an event.
Event Selection. In a timeline, it is not necessary to show all events of a story as users usually care about the most important events only. We design an event selection step that is helpful for handling excessive number of events. The selection relies on two measures: Salience and Consistency defined by Equations (3) and (4), respectively: where v e is the exemplar sentence in event e; | e | and | D | denote the number of sentences in e and document collection D, respectively. Intuitively, important historical events would often be mentioned by future news reports. Salience of event is used to evaluate such importance and is computed as the relative frequency of sentences about that event compared with all sentences in the collection. On the other hand, Consistency ensures high quality of events. We then rank all candidate events based on the weighted summed score of these two measures. Hereafter, we denote the weight of Event Salience as ζ 1 and that of Event Consistency as 1 − ζ 1 .
We select the top-scored events obtaining a new event set E * by setting a threshold. To avoid tuning its value, we set the value to one standard deviation from the mean (lower end).

Timeline Generation Module
While TLS systems directly link all the identified events, MTLS requires their deeper understanding. As described in Section 1, an effective MTLS framework should link events related to the same story and separate other unrelated events to different timelines. To achieve this, we explain the following steps in this module: Event Linking, Timeline Selection, and Timeline Summarizing.
Event Linking. According to Assumption 1, current events can refer to related past events. We thus define a reference matrix R, in which each element R(e i , e j ) denotes the degree of reference between two events e i and e j . As events in our work are represented by sentences and a sentence belongs to a single event, R(e i , e j ) can be reflected by counting patterns of sentence co-occurrences in documents. Formally, R(v i , v j ) represents the case where two sentences v j and v i refer to each other as defined by Equation (5): where d is an article, e k and e l are elements in E * . The degree of reference between e i and e j is then defined as follows: where |e i | and |e j | are sizes of e i , e j , respectively. We then construct a graph of events where each node is an e ∈ E * , and the value of an edge reflects the connection degree between a pair of two events. We reuse AP algorithm to detect the community of events over the affinity matrix S 2 defined by Equation (7): where S cos (e i , e j ) denotes cosine similarity between e i and e j to capture semantic similarity. Based on the affinity matrix S 2 , AP finally generates clusters, i.e., the initial timeline set, T .
Timeline Selection. In order to ensure the quality of constructed timelines, we define criteria to select high-quality timelines from T . Similar to event selection described in Section 4.1, we also use two indicators to evaluate the quality of a timeline. We define Timeline Salience as the average score of Event Salience of all events within the timeline, and Timeline Coherence as the average of semantic similarity scores between any chronologically 4 adjacent events defined by Equation (8): where | T | is the size of a timeline, i.e., the number of events in this timeline. Intuitively, important timelines, which reflect important stories in the document collection, are more likely to be preferred by users. Timeline Salience captures this importance by passing the importance of its components (i.e., events), while Timeline Coherence ensures that the story expressed by the timeline is consistent.
We rank timelines based on a weighted sum of Timeline Salience and Timeline Coherence. The weight of Timeline Salience is denoted as ζ 2 ; thus the weight of Timeline Coherence is 1−ζ 2 . We then select the top-scored elements from the timeline set T based on a threshold. Same as before, we set the value to one standard deviation from the mean.
Timeline Summarizing. By previous steps, we have now obtained multiple timelines {T 1 , T 2 , ...}, where T is a list of events {e 1 , e 2 , ...}. However, it is not feasible to show all contents of each e as it usually contains many sentences. We use only the exemplar sentence in event since exemplar is the most typical and representative member in the group.
In addition, it is possible that two events e i and e j occur on the same day. In this case, we concatenate their exemplar sentences.
Timeline Tagging. This step is an add-on to MTLS systems. To better understand the stories of constructed timelines, we believe that it should be helpful for users to also obtain a label for each timeline. As described in Section 1, the input document collection may be composed of different topics or of one topic discussed through different aspects. For example, among the timelines generated based on the topic syria, one timeline might summarize the story about Syrian civil war while another might be about Syrian political elections. A label should then help people understand the story of the timeline. We simply select the 3 most frequent words among events (excluding stopwords) for each timeline as its label.

TLS evaluation relies on ROUGE score and its variants as follows:
Concatenation-based ROUGE (concat). It considers only the textual overlap between concatenated system summaries and ground-truth, while ignoring all date information of timeline (Yan et al., 2011b;Nguyen et al., 2014;Wang et al., 2016).
Date-agreement ROUGE (agreement). It measures both textual and temporal information overlap by computing ROUGE score only when the date in the system-generated timelines matches the one of the ground-truth timeline . Otherwise, its value is 0.
Alignment-based ROUGE. It linearly penalizes the ROUGE score by the distances of dates or/and summary contents. Martschat and Markert (2017) proposed three types of this metric: align, align+, align+m:1 (align by date, align by date and contents, align by date and contents where the map function is non-injective, respectively).
Date selection (d-select). It evaluates how well the model works in selecting correct dates in the ground-truth (Martschat and Markert, 2018).

MTLS evaluation
The evaluation methods for TLS cannot directly assess the performance of MTLS systems as there are multiple output timelines and multiple ground-truth timelines. Concretely, given an input collection D, corresponding ground-truth timeline set G = {G 1 , G 2 , ...G k 1 } (k 1 ≥ 1), and system-generated timeline set T = {T 1 , T 2 , ..., T k 2 } (k 2 ≥ 1), evaluation metrics need information to automatically "match" the ground-truth timeline when evaluating T i . Therefore, we make the system find the closest ground-truth G * to timeline T as follows: where f m is the TLS evaluation function to compute the score between T and G based on metric m, which can be either concat, agreement, align, align+, align+m:1, or d-select. Then, the overall performance of the MTLS models is computed by taking the average of all the members in T .

Experimental Setup
The goal of our experiments is to answer the following research questions (RQs): RQ1: Do MTLS models produce more meaningful output than TLS models? RQ2: How does 2SAPS framework perform on MTLS task compared with other MTLS baselines?
RQ3: How effective are the components of the modules in 2SAPS? How do parameter changes in the model affect the results?

Datasets
We note that there is no available dataset for MTLS task, thus we construct MTLS datasets 5 extending existing TLS datasets. Tran et al. released Time-line17 (Binh Tran et al., 2013) and Crisis (Tran et al., 2015) datasets for TLS over news articles.    (1) set the number of topics L used to generate a new dataset; (2) from TLS datasets, randomly choose L topics, then merge their document collections into a new dataset D along with grouping their associated ground-truth timelines into G. 6 (3) repeat steps (1) and (2). Here, the value of L reflects the complexity of the dataset. The more topics the dataset contains, the more complex it is. We repeated the steps (1)~(3) on Timeline17 7 and finally created 25 datasets as shown in Table 3. Timeline17 contains 9 document collections, covering the following topics: "BP Oil Spill" (bpoil), "Influenza H1N1" (h1n1), "Michael Jackson death" (mj), "Libyan War" (libya), "Egyptian Protest" (egypt), "Financial Crisis" (finan), "Haiti Earthquake" (haiti), "Iraq War" (iraq), "Syrian Crisis" (syria).

Baselines
As there are no ready models for MTLS task, we design the baselines as "divide-and-summarize" approaches. The underlying idea is: first segment the input dataset into sub-datasets (subsequently called 6 If a topic has multiple ground-truth timelines, we pick one that has length closest to the average length of the timelines for that topic. 7 We note that Crisis contains only 4 topics, resulting in few possible combinations, so we finally decided to skip it. segments) by partition/division algorithms; then adopt TLS techniques to generate a timeline for each sub-dataset (segment). We now describe the choices for each step.
Dataset Division Approaches: • Random. We randomly decide the number of segments from 1 to 10. Then, we assign a news article to a random segment.
• LDA (Latent Dirichlet Allocation) (Blei et al., 2003). Given a dataset, we first use LDA to detect the main topics in the dataset. Then, we assign each news article to its dominant topic.

TLS Approaches:
• CHIEU2004 (Chieu and Lee, 2004): It is a frequently used unsupervised TLS baseline which selects the top-ranked sentences based on summed similaries within n-day window.
• MARTSCHAT2018 (Martschat and Markert, 2018): It is one of the state-of-the-art TLS models and is also the first work to establish formal experimental settings for TLS task. We use the implementation given by the authors. 9 • GHALANDARI2020 (Gholipour Ghalandari and Ifrim, 2020): It constructs timeline by first predicting the important dates via a simple regression model and then selecting important sentences for each date. 10 We combine the above 3 dataset division approaches and 3 TLS approaches and thus yield 9 baselines.

Experimental Settings
Concerning the characteristics of MTLS task and our datasets, the experimental settings differ from the TLS settings applied in Martschat and Markert (2018). In particular, the settings are: • When generating timelines, none of the compared models knows the actual value of L (i.e., L is not an input data). The stratification given in Table 3 is shown only for the reader to explain the datasets' construction method.
• For the dataset-division algorithms, LDA and k-means, we use different techniques to find optimal number of segments. For LDA, we evaluate topic coherence measure (C v score) (Röder et al., 2015) for topic numbers ranging from 1 to 10, and then choose the optimal number. For k-means, we use silhouette value (Rousseeuw, 1987) to determine the optimal number of segments.
• All the compared methods do not take the information of the ground-truth as input. That is, the number of dates, the average number of summary sentences per date, the total number of summary sentences, the ground-truth start dates, and end dates are all unknown.
• We set the length of timelines to 20 and summary length to 2 sentences per date.

MTLS vs. TLS
We first address RQ1 to show the necessity of MTLS and to demonstrate that TLS performs poorly when an input dataset contains mixture of documents on different stories. To achieve this, we compare results of MTLS baselines with a standard TLS approach. Table 4 shows the performance comparison between TLS and MTLS baselines based on MARTCHAT2018. For fair comparison in this first experiment, we select only one timeline from MTLS outputs that is most similar to the timeline generated by TLS. We observe that when L = 1, 2, MTLS underperforms TLS by 15.1%, 4.8% in terms of align+m:1 ROUGE-1, respectively. However, it outperforms TLS by 150%, 117.1%, and 94.7% when L equals 3,4,5, respectively. This indicates that as the complexity of input document collection increases (higher L values), TLS systems do not produce good results when compared to MTLS ones. In real world scenarios, it is rather rare that the input dataset is clean enough to contain only a single topic. Thus, these results suggest that MTLS approach should in practice be more useful than TLS. The results for the other two TLS algorithms introduced in Section 6.2 show a similar trend, too. Furthermore, the example outputs of TLS and MTLS systems are also available as supplementary materials.

Performance of 2SAPS
We now investigate the performance of our framework to answer RQ2. Table 5 shows the overall performance of MTLS systems. We observe that 2SAPS achieves the best performance in terms of all ROUGE metrics. In particular, when compared with CHIEU2004, MARTSCHAT2018 and GHALANDARI2020 in terms of concat ROUGE-1 score, it outperforms them by 52.9%, 12.2%, and 16.4%, respectively. We also observe that GHALANDARI2020 method still achieves the best performance among baselines except for concat ROUGE-1. Furthermore, it is worth noticing that kmeans works best in dividing datasets. On average, k-means outperforms Random and LDA by 15% and 7.2%, respectively, in terms of concat ROUGE-1. Finally, compared with the best-performing baseline, k-means-GHALANDARI2020, our 2SAPS outperforms it by 9.9%, 15.1%, 0%, 10%, 4.7%, 3.6%, 19.1%, in terms of concat (ROUGE-1,ROUGE-2), align+m:1 (ROUGE-1,ROUGE-2), agreement (ROUGE-1,ROUGE-2) and d-select, respectively.

Ablation Study
We turn to the first part of RQ3. We conduct ablation tests on Event Selection (ES) and Timeline Selection (TS) components. Table 6 shows the changes of different models. We observe that without ES, d-select and align+m:1 ROUGE-2 scores decrease 14.6% and 42.2% compared with 2SAPS. The plausible reason is that without ES, many unimportant dates and events are included in a timeline, resulting in low recall of correct dates. On the other hand, without TS component, the generated timeline set tends to contain noisy timelines, causing low ROUGE-1 as the performance drops by 18.8%.

Parameter Impact
We now analyze the impact of key parameters, α 1 , α 2 , ζ 1 , ζ 2 . α 1 and α 2 directly influence the quality of generated events and timelines, while ζ 1 and ζ 2 indirectly affect the model's performance by controlling the selection steps. Figure 1 shows the performance of 2SAPS under concat ROUGE-1, align+m:1 ROUGE-1, and agreement ROUGE-1.
In particular, we observe that: a smaller value of α 1 (from 0.1 to 0.4) gives better results than a larger value (Figure 1a). When α 1 turns to 1, AP algorithm does not converge, and the values of all measures become 0. The plausible reason for this could be that when sentence dates are very    close, the elements of transition matrix differ only slightly, resulting in non-convergence. Figure 1b shows the impact of the reference relation in linking events. The values of all metrics increase as α 2 increases. It makes sense that reference relation exerts an important role in linking events into timelines, thus a higher value is necessary. However, when α 2 is over 0.9, the performance drops because when news articles provide few contextual events (e.g., background events, related events, etc.), then the reference relation between events becomes unreliable. ζ 1 controls the impact of Event Salience described in Section 4.1. Another corresponding factor is Event Consistency, which is weighted by 1-ζ 1 . Figure 1c shows that the model with larger values of ζ 1 underperforms the ones with relatively small values of ζ 1 (from 0.2 to 0.4), indicating that con- sistency of event matters more than its salience in selecting high-quality events. Finally, in Figure 1d, we observe that along with the increase of ζ 2 , the performance of all metrics decrease, suggesting that the coherence of timeline is more effective than salience in selecting good timelines.

Limitations
Our 2SAPS model works essentially on the unit of sentences and constructs a graph where each sentence is a node and edge is the relation between sentences. It has then a complexity of O(n 2 ). Future work could address this by simplifying graph structure and providing approximate solutions to cover also the cases of processing large datasets. Another solution is to select only important sentences from news articles using the combination of classification, summarization or filtering.

Conclusions
We introduced MTLS task to generalize the timeline summarization problem. MTLS improves the performance of timeline summarization by generating multiple summaries. We conducted experiments to first show that given a heterogeneous time-stamped news article collection, TLS usually does not produce satisfactory result. We further proposed 2SAPS, a two-stage clustering-based framework, to effectively solve MTLS task. Furthermore, we extended TLS datasets to MTLS datasets, as well as introduced a novel evaluation measure for MTLS. Experimental results show that 2SAPS outperforms MTLS baselines which follow the "divide-and-summarize" strategy. Our work significantly improves the generalization ability of timeline summarization and can provide users with easier access to news collections. As an unsupervised approach that does not require costly training data, it can be applied to any potential datasets and languages.
In future work, we plan to test our approach on additional MTLS datasets. We will also investigate scenarios in which MTLS can enhance information retrieval systems operating over news article collections. For users searching over large temporal collections, structuring the returned results into a series of timelines could prove beneficial, instead of returning a usual list of interwoven documents that relate to different stories or periods.