Interactively Learning to Summarise Timelines by Reinforcement Learning

Timeline summarisation (TLS) aims to create a time-ordered summary list concisely describing a series of events with corresponding dates. This differs from general summarisation tasks because it requires the method to capture temporal information besides the main idea of the input documents. This paper proposes a TLS system which can interactively learn from the user’s feedback via reinforcement learning and generate timelines satisfying the user’s interests. We deﬁne a compound reward function that can update automatically according to the received feedback through interaction with the user. The system utilises the reward function to ﬁne-tune an abstractive summarisation model via reinforcement learning to guarantee topical coherence , factual consistency and linguistic ﬂuency of the generated summaries. The proposed system avoids the need for reference summaries in training and can adapt to preference feedback from individual users. The experiments show that our system out-performs the baseline on the benchmark TLS dataset and can generate accurate and precise timelines that better satisfy real users.


Introduction
Notable news events can last for a relatively long time period. To reveal events' development legibly, timelines are often applied to present sub-events in time order. For instance, before COVID-19, H1N1 swine flu caused a pandemic around the world. Figure 1 shows the timeline of its early stage. Nowadays, influenced by COVID-19, when people try to recall the precautions applied during the last worldwide epidemic, the huge number of relevant news articles makes it hard to sort out even for experts. On the other hand, even if there are available existing timelines such as Figure 1, they often only briefly narrate the development of sub-event series. For example, it is easy to find that some cases travelled to the UK by flights. But it is still hard for UK readers to obtain details about early precautions that are relevant to their direct concerns, which lowers the timeline's value to those readers. The tricky problem of huge document collections can be solved by summarisation techniques, which scan the corpus and identify informative text in it. However, timelines are a special type of summary that is chronologically structured. Sub-events subordinating to the main event are summarised individually, and have a corresponding time tag. General summarisation methods, therefore, are usually not directly applicable to TLS tasks. Martschat and Markert (2018) introduced a temporal module into prior summarisation methods to generate timeline summaries. Some prior works (Steen and Markert, 2019;Yu et al., 2021) detected sub-events first and produced summaries with assigned dates to constitute a timeline. We take inspiration from them to apply clustering algorithms in our system to detect sub-events.
Most prior TLS methods (Gholipour Ghalandari and Ifrim, 2020; Martschat and Markert, 2018) focused on generating extractive timeline summaries. Their contents consist of raw sentences copied from the input document collection. Although the syntax of each copied sentence is guaranteed to be correct, irrelevant and redundant information can also be carried in the selected sentences. The tran-sition between sentences can also be harsh due to probable deletion of transition sentences. Abstractive summarisation approaches are appropriate to solve the issue, but prior works have not applied them to timelines, because the benchmark TLS dataset is too small for training an abstractive summariser which is usually based on deep neural network. This motivates us to design a method utilising abstractive summarisation to generate humanlike timeline summaries.
In recent years, reinforcement learning has drawn much attention in academia. Given proper reward signals, it can fine-tune the model without reference data, thus preventing the requirement of large training corpora. Furthermore, by designing appropriate interactive scheme and reward function, interactive learning can be integrated with reinforcement learning, which enables the model to be trained towards the goal set by the user.
In this paper, we proposed an abstractive TLS system that can interact with human users and learn from their feedback via reinforcement learning. Our contributions in this paper can be summarised as follows: 1. We introduce interactive learning and reinforcement learning into a new TLS method, which enables the system to generate tailored timeline summaries without the need for a large training corpus. At the core of our proposal is a modular reward function that incorporates keyword and preference feedback.
2. We evaluate neural abstractive summarisation for TLS for the first time, finding that it improves automated and human evaluation metrics but has issues with factual consistency and repeated content.
3. Our experimental results show that interacting with summaries by specifying keywords and pairwise preferences can improve the quality of summaries. Keywords can guide the model to generate specific details, while pairwise preferences contribute to learn the complex requirements from the user.
Our proposed system outperforms the strong baseline TLS method, CLUST (Gholipour Ghalandari and Ifrim, 2020), on the benchmark dataset Timeline17 . All experimental code will be released on acceptance.

Automatic Summarisation
Automatic summarisation tasks can be categorised into 2 types, extractive summarisation and abstractive summarisation, according to how the output is generated. Research on the former one started earlier. Mihalcea and Tarau (2004) used graphbased ranking algorithms to select a set of important sentences as the summary of input text, which is still a widely used baseline today. In this stage, researchers also explored graph-based algorithms to generate abstractive summaries (Ganesan et al., 2010). With the rapid development of machine learning, computers have made impressive progress on text representation. It enables algorithms to evaluate sentences from various perspectives instead of only similarity, thus deriving better metrics for selecting key sentences. For example, Liu et al. (2015) applied sparse encoding to compute the coverage, sparsity, and diversity of each sentence, and thereby to cover the main ideas of an input document with the minimum number of sentences. Recent works in automatic summarisation often involved Deep Learning. Deep neural networks, such as BERT (Devlin et al., 2019), were widely applied in both summarisation tasks. They can provide strong word embeddings (Zheng and Lapata, 2019) and classify whether the input sentence should appear in the summary (Liu, 2019). Transformer neural networks also empowered language models to learn from huge corpora and generate human-like abstractive summaries (Raffel et al., 2019;.

Reinforcement Learning in Natural
Language Generation The application of reinforcement learning on Natural Language Generation tasks has received some success recently. These approaches can guarantee certain necessary properties of the generated output by designing corresponding reward signals. Song et al. (2020) and Mesgar et al. (2020) used reinforcement learning to improve the factual consistency of dialogue systems. Prior works (Gao et al., 2018) on automatic summarisation applied interactive learning to learn reward functions from users. In this case, the human user is not required to directly provide the reward for hundreds of episodes of learning: reinforcement learning agents learn the policy offline using a reward function learned from a small number of user-supplied labels.

Background
The proposed method in this paper focuses on adapting an abstractive summariser to TLS tasks. The workflow of detecting sub-events of the timeline follows the baseline, CLUST (Gholipour Ghalandari and Ifrim, 2020), which is one of the stateof-the-art event detection TLS methods. To lay a solid foundation for our system, we recap its necessary details, the improvements we made on top of it, and important concepts of reinforcement learning in Natural Language Generation in this section.

CLUST workflow
CLUST divides the TLS task into two stages, described as follows: 1. The first stage identifies mentioned dates and detects sub-events from document collections.
(a) Apply Heideltime (Strötgen and Gertz, 2010), an advanced temporal tagging tool, to annotate all the date expressions in raw documents. The document vectors used for clustering are computed by TF-IDF, which is outperformed by dense word embedding on many tasks, such as computing semantic textual similarity. Therefore, we use Sentence-BERT (Reimers and Gurevych, 2019) to embed sentences in the document into 768-d vectors, then compute the average sentence vector as the document vector.
As for the clustering algorithm, we replace Markov Clustering by Affinity Propagation (Frey and Dueck, 2007), as it does not require the user to input the number of clusters, and is widely used in clustering complex and unstructured datasets. To reach better clustering results, we reserve the idea of temporal constraints from CLUST and adapt it to Affinity Propagation. News articles that have close publish dates are more likely to describe the same sub-event. In contrast, articles with distant publish dates, even if they are adjacent in semantic space, should be considered as describing similar but disparate sub-events. For example, a series of aftershocks often happen after the main earthquake. Although many aftershocks may have similar levels, they should be recorded separately by their happening dates in the timeline.

Reinforcement Learning in Natural Language Generation
Reinforcement learning searches for an optimal solution for Markov Decision Processes. It is commonly used to formulate and model sequential decision making problems, which summarisation can be modelled as. In abstractive summarisation tasks, the summariser is often regarded as the agent. It performs actions by sequentially selecting tokens (words and punctuation) from the vocabulary and appending them to the current generated text. In this paper, we choose PEGASUS  as our summariser. Prior works (Peters et al., 2019;Mesgar et al., 2021) have shown that finetuning the language head of the model can receive competitive benefits compared to fine-tuning the whole model, and is more computationally efficient. Therefore, we only tune the last linear layer of the model via reinforcement learning, as highlighted in red in Figure 2. A reinforcement learning problem is defined by a tuple (S, A, P, R, T ). Their definitions in this scenario are defined as follows: • S: The set of states. The state s ∈ S is a length-fixed vector encoded and padded by the tokeniser, which represents the generated summary so far.
• A: The set of actions, which is usually the vocabulary. The action a ∈ A is the token_id in the vocabulary.
• P : S × A × S → R The transition probability P (s |s, a) representing the probability of be- ing transited to the new state s from the prior state s by performing action a.
• R : S × A → R The reward function R(s, a) represents the immediate rewards given by performing action a in state s.
• T : The set of terminal states. T ⊆ S contains all the states s that contain an end token.
Another important concept in reinforcement learning is the policy π : S × A → R. It represents the probability of performing action a in state s. Given these definitions, the agent (summariser) aims at learning an optimal policy π * such that:

Methods
Our main methodological contribution is to integrate an iterative interaction process into the event summarisation step of the TLS system. To have a clear view of our interactive method's performance, Figure 3 shows the workflow of the proposed method, which was adapted from CLUST.

Notations
We denote the generated summary so far as s. Tokens in s are marked as s i , i.e., s = (s 1 , s 2 , . . . , s l ), where l is the maximum length of the output. The user's feedback is noted as M, N, P , individually representing wanted keywords, unwanted keywords and preferences. D denotes the input documents.

Interactive Learning
The interaction process starts after the event detection stage is done. It allows the user to preview the draft timeline and provide feedback, as shown in Figure 4. Currently, the system accepts 3 feedback types, including wanted keywords, unwanted keywords, and preferences between pairs of draft timelines. The user can directly input the two types of keywords and use 0/1 to give the preference label, indicating which of two drafts is preferred. The keyword feedback is designed to receive specific details that the user wants to add or remove in the current timeline. And the preference label represents more complex requirements from the user, such as rhetoric and writing style, which are difficult to formulate.
The received feedback will be used to update the reward function, which is then used for reinforcement learning to fine-tune the summariser. Then the system can generate a new timeline to start another round of interaction with the user. The system does not require many rounds of interactive learning to achieve the goal of generating tailored timelines for the user because the rewards can be learned from small amounts of feedback then used for many episodes of offline reinforcement learning.

Reward Functions
The feedback input by the human user is still unreadable for the agent. Therefore, a compound reward function is defined to convert the feedback into numerical signals. Other necessary properties of a good timeline summary are also formulated into signals to guarantee the quality of the summary.
R 1 aims to ensure the factual consistency of the summary. In other words, the generated summary should be highly similar to the source document on semantics. To capture the semantic information, we obtain the sentence embedding by paraphrase-DistilRoBERTa-v1 (Reimers and Gurevych, 2019) and use the average sentence vectors to represent the summary and input documents in the semantic space. Prior works have shown that cosine similarity is a reliable metric of factual consistency for abstractive summarisation (Zhang et al., 2019). Therefore, we define R 1 as follows: R 2 contributes to the coherence between the summary and the user's feedback, which automatically updates when receiving new feedback from the user. It is a weighted sum of three functions. The first two functions are the rewards for topical coherence between keywords feedback and the output. They are computed in the same way as R 1 . Here, we use GPPL  as our preference learning model to map pairwise labels to scores reflecting user preference. The weights of the three function sum up to 1 and are chosen by tuning on a validation set.
R 3 quantifies the linguistic fluency of the summary. N (·) is the negative log-likelihood loss calculated by a language model. α is the maximum of N (·) which normalises the sub-reward function.
R 4 penalises generating repeated unigrams. It is well-known that reinforcement learning agents sometimes generate repeated tokens if they find some tokens receive high rewards. So, we define R 4 as follows: R 4 = 1 − #repeated_tokens #tokens (4) R 1 and R 2 encourage the system to generate summaries with correct and required content. R 3 and R 4 contribute to the linguistic quality of the text. The final reward function is the weighted sum that balances these different qualities:

Training Algorithm
The expected reward can be written as: where θ is the parameters of the policy function π. Mnih et al. (2016) gave the gradient of θ with respect to L as follows: Mnih et al. (2016) also pointed out that a modified policy gradient method, Actor-Critic, can be applied to fine-tune the policy function π θ . It not only approximates the policy function but also the state-value function. The advantage function, δ, is introduced into the algorithm to compute, in current state s, how much better the value of the action a is than the average: where γ is the discount rate,v is the critic function estimating the real value function and w is the critic's parameter vector. Optimising the parameters θ by multiplying the advantage δ with the gradient improves the likelihood of choosing actions with a positive influence on the final reward. Therefore, Actor-Critic can reduce the variance of sampling from the policy distribution.
In this paper, we use PEGASUS as the actor and a three-layer fully connected neural network as the critic. The critic's parameters w are updated by the squared error between its predicted rewards and the value given by the reward function. Timeline17 (Tran et al., 2013; is the benchmark dataset in Timeline Summarisation field. It contains 19 manually created timelines with their associated news articles. Each timeline and its news documents belong to a famous news agency, such as BBC and FoxNews. The characteristics of Timeline17 are shown in Ta

Implementation
To have a better comparison with the baseline, we keep the original experiment settings where possible. Our main settings are as follows: • The dataset is splitted into several tasks according to the number of reference timelines. Each time the system only summarises one timeline from the related document collection and compares it to the reference timeline. The final results are averaged from all tasks.
• Two tasks are selected as the development set to tune the hyperparameters manually.
• Article headlines are not used.
• The length of generated timelines are limited to be shorter or equal to the corresponding reference timeline.
Prior sections have mentioned that PEGASUS is chosen to be the abstractive summariser (reinforcement learning agent). In fact, PEGASUS provides a wide range of versions for users. We choose PEGASUS-Multi_News  in the implementation because it is tuned on news datasets from a similar domain to ours.

Timeline Evaluation Metrics
ROUGE (Lin, 2004) is one of the most commonly used evaluation metrics in summarisation tasks. However, directly applying ROUGE on timeline summaries neglects the influence of temporal information. Martschat and Markert (2017) proposed an adapted ROUGE based on the date alignment mechanism, which is the most commonly used evaluation metric in TLS.
Alignment ROUGE manages to find a mapping from the output timeline's date to the reference date which minimise the distance of each mapping pairs. Then ROUGE scores are computed on pairwise summaries, and the average of them is used as the final score.

Simulated Interaction
Since the system requires at least half an hour to finish the training process for a topic, we design a simulated user to provide feedback while developing the system.
Keywords We use TF-IDF (Jones, 1972) to represent the importance of each word. 10 Words with the highest TF-IDF values are saved as the wanted keywords list. And 10 words with the lowest values are save as the unwanted list. These keywords are fed to the system in each interaction round.
Preferences We pre-generate several versions of timelines for each topic. We rank these timelines by their Alignment ROUGE score. Therefore, we can extract multiple pairwise preference labels from the rankings. The system is able to learn the score function and mark the generated summary as the reward.

Clustering Analysis
In the event detection stage, we replace the clustering algorithm by Affinity Propagation (Frey and Dueck, 2007). Table 2 shows that the number of clusters generated by Affinity Propagation is closer to the average length of the given timelines. Considering the ratio of the average timeline length and the average number of clusters, Markov Clustering discarded more than 70% clusters, which implies that Affinity Propagation has better granularity on clustering results. Table 3 shows that Affinity Propagation has higher precision, which indicates that the dates selected by Affinity Propagation are more likely to be correct.

MC
148.06 38.68 AP 50.90 29.17  Table 3: The precision, recall and F1 score of date selection of the two clustering algorithms. The ground truth dates are from the reference timelines in the dataset.

ROUGE Score
When evaluating the timeline summaries' content, we expect that our timelines can correctly cover the main words that appear in the reference timelines. Therefore, ROUGE-F1 score is the most suitable metric, because it reflects the overlaps' precision and recall in a balanced manner. Table 4 shows the results of the proposed TLS system and CLUST. r = 0 means that the summariser directly summarise the clusters without any fine-tuning. Otherwise, each interaction round is composed of several episodes of reinforcement learning. The zero-interaction setting outperforms the baseline on Alignment ROUGE1-F1 and receives a competitive score on Alignment ROUGE2-F2 indicating that the replacement we made in the event detection stage makes sense. With proper interaction rounds, the system outperforms the baseline and has a huge increase on Alignment ROUGE2-F1 score.  Table 4: Results of simulated reinforcement learning based interactive TLS system. The agent samples multiple episodes and learns from them over all clusters in each interaction round.

User Study
To further understand the performance of our system, we looked into the content it generated and compare it with the baseline's output. The comparison is attached in Appendix A due to the page limit. Besides the numerical evaluation and the content analysis, we also invited real human users to rank the timelines. All the participants are asked to read the timelines generated by the system and rank them from 4 perspectives: Relevance How relevant the content is to the topic.
Redundancy How much information is repeated in the timeline.
Text Legibility How easy it is to read the timeline.
Informativeness How much information does the timeline contain.
Since evaluating text quality is complex and subjective in some cases, an overall ranking is also required, where the participant can use their own standard to evaluate the timelines. Table 5 shows the experimental results from 10 participants.
The user study shows that the timeline generated by r = 2 system receives the highest ranking, indicating that our algorithm improves the system's performance with proper rounds of interaction. r = 1 and r = 3 have advantages in different aspects. The relevance reveals that one interaction round is not sufficient to instruct the summariser to generate content that satisfies the user's interests.
However, after learning from too many episodes, the system still starts to trick the reward function, even though we implemented repeated token penalty via Equation 4. It generates repeated Interaction Rounds(r) r = 0 r = 1 r = 2 r = 3  phrases with slight variations and fabricates facts to trick for high rewards from R 1 and R 2 while maintaining tolerably low R 3 and R 4 as trade-off, as show in Figure 5. These repetitions occupy the positions for those less important information nuggets, such that r = 3 has relatively bad text legibility. The redundancy ranking and the informativeness ranking also fall to the bottom when r = 3. As comparison, r = 0 receives the lowest ranking on the rest three aspects, suggesting that interactive learning is effective in improving relevance, legibility and overall quality. Besides the timeline rankings, we required the participants to identify whether they are native English speakers in experiments. Another interesting finding is that native speakers focus more on the informativeness, while nonnative speakers attach importance to conciseness so that they can easily find the important information nuggets. This may cause them to rate the timelines differently, indicating that it is a noteworthy aspect to design the reward function in future work.

Conclusion
We propose a new reinforcement learning based interactive learning timeline summarisation system, which is able to generate tailored timeline summaries according to the user's interests without referring to the golden standard data. Our system learns from both preference labels and keyword feedback, which enables the system to learn both topical and holistic preferences. We design a compound reward function for the reinforcement learning phase to maximise the output quality. It is designed to automatically translate the user's feedback into summary evaluation functions, enabling the system to leverage small amounts of user feedback to guide fine-tuning. We simulated the users' feedback, and compared the system's performance against the baseline. Our proposed system outperforms the baseline in the simulation experiments. The further exploration suggests that the reinforcement learning based interaction process changes the summaries' content towards the user's interests, and human evaluation results shows that real human users prefer the summaries after fine-tuning.

Limitations
Although our proposed system outperforms the strong baseline and receives statistically significant user study results, its limitations are still noteworthy because they indicate potential research directions for future works.
Small user study group Due to that the user study requires the participants to read huge amounts of documents, we only have results from 10 volunteers, which is a very small group. Therefore, although we identified differences between native and non-native speakers, this still needs further investigation.
Languages Currently, we only explored English summarisation. Other languages, such as Chinese and Germany, have different grammatical structure. Whether our paradigm is applicable to different languages still needs verification.
Simulated user feedback Our simulated experiments demonstrate that the reinforcement learning approach can learn from this kind of feedback, but future work will test the system with the keywords that users specify in a real deployment.
Limited scale of the dataset Due to that generating a timeline dataset is expensive, we can only finish our experiment on a small dataset. Although our findings are statistically significant, it would be more convincing if our work can be verified on a larger dataset.
Reward function design In Figure 5, we show that our system still learned to obtain rewards by tricking. It indicates that a more sophisticated reward function is needed to avoid the summariser learning those unacceptable behaviours. And other aspects, such as native speaker, should be added into consideration when designing the reward function as well.