Event Graph based Sentence Fusion

Sentence fusion is a conditional generation task that merges several related sentences into a coherent one, which can be deemed as a summary sentence. The importance of sentence fusion has long been recognized by communities in natural language generation, especially in text summarization. It remains challenging for a state-of-the-art neural abstractive summarization model to generate a well-integrated summary sentence. In this paper, we explore the effective sentence fusion method in the context of text summarization. We propose to build an event graph from the input sentences to effectively capture and organize related events in a structured way and use the constructed event graph to guide sentence fusion. In addition to make use of the attention over the content of sentences and graph nodes, we further develop a graph flow attention mechanism to control the fusion process via the graph structure. When evaluated on sentence fusion data built from two summarization datasets, CNN/DaliyMail and Multi-News, our model shows to achieve state-of-the-art performance in terms of Rouge and other metrics like fusion rate and faithfulness.


Introduction
Sentence fusion aims to combine several related sentences into a single coherent text. It is important in many NLP tasks such as text summarization, question answering and retrieval-based dialogue. In text summarization, it is a common practice for a proficient editor to fuse the information from several related sentences, however, it remains challenging for a state-of-the-art neural abstractive summarization model to achieve effective sentence fusion. As pointed out in (Lebanoff et al., 2019a), the human-written summaries contain 32% fusion sentences on the CNN/DailyMail dataset, 1 These authors contributed equally to this work.  while only 6% of the summary sentences generated by the Pointer-Generator model (See et al., 2017) are shown to fuse the information spread over sentences. Besides, without proper guidance, many sentences generated by fusion contain factual errors. Therefore, it is worthwhile to explore effective sentence fusion methods in the context of text summarization.
In fact, the importance of sentence fusion has long been recognized by researchers in the text summarization community. As shown in Figure 1, the researchers have been concerned with two types of sentence fusion task in the past. One is similar sentence fusion and the other one is disparate sentence fusion. For similar sentence fusion, a word graph or a dependency tree is often explored to find a coherent fusion path (Marsi and Krahmer, 2005;Filippova and Strube, 2008;Thadani and McKeown, 2013). For disparate sentence fusion, the coreference relations are typically considered as the key to tie the sentences together (Lebanoff et al., 2020b,a). Although both types of sentence fusion benefit text summarization, especially multidocument summarization, the solutions are rarely proposed to deal with the two types together. In this paper, we propose to apply the structured event information to guide the two types of sentence fusion in a unified framework.
We address the challenge of sentence fusion by building an event graph to capture the semantic relationships among the input sentences. The event graph is a directed graph composed of the nodes representing the predicate and event arguments and the edges that connect these event components together. Compared to the word graph or the dependency tree, the event graph provides more informative event-level (or to say entity-level) information. Meanwhile, it maintains the semantic integrity of each node, which allows us to add additional edges to represent some crucial relationships in disparate sentence fusion like co-reference. Such a structured representation is capable of preserving inherent event information and meanwhile formulating cross-sentence information such as entity interactions and proximity of relevant concepts.
With the target to guide sentence fusion, we develop a decoder that utilizes the information from both the sentence sequence and the event graph equipped with different attention mechanisms. We employ sequence attention and graph attention to determine what information is important to be select to generate the appropriate word token at each decoding step. Note that sentence fusion requires not only selecting the right salient information but also organizing the selected information logically and orderly. Otherwise, the models may tend to randomly combine the key event components or simply copy the most important text span. To this end, we develop a graph flow attention to explore potential fusion paths via the graph structure and control the fusion process. Moreover, how to avoid factual errors in a fused sentence is also a critical issue in sentence fusion. Inspired by (Scialom et al., 2020), we incorporate faithful beam search at the inference stage to reduce possible factual errors. This allows the model to remove the unfaithful candidate output sequence during the generation process by refining the generation probability with a faithful score.
Since there is no available dataset to evaluate the effectiveness of the sentence fusion models in the context of text summarization, following previous work (Lebanoff et al., 2020b), we automatically generate sentence fusion data from summarization datasets including CNN/DaliyMail (Hermann et al., 2015) and Multi-News (Fabbri et al., 2019). The experiments show that our proposed model indeed improves Rouges and the other metrics like faithfulness and the fusion rate. The contribution of our work can be summarized as follows: (1) We propose a model to address both similar sentence fusion and disparate sentence fusion, which are critical for abstractive summarization.
(2) We build an event graph to guide sentence fusion, which allows our model to utilize the structural event information and various cross-sentence relations.
(3) We innovatively apply a graph flow attention to control the fusion process via the graph structure.

Sentence Fusion in Text Summarization
Sentence fusion has been considered as an essential step for generating abstractive summaries. Its importance has long been recognized in the traditional text summarization research (Barzilay et al., 1999). The early attempts mainly focus on fusing a set of similar sentences (Marsi and Krahmer, 2005;Filippova and Strube, 2008;Elsner and Santhanam, 2011;Thadani and McKeown, 2013). They often build a dependency graph or a word graph from multiple similar sentences, and then adopt linear programming to generate the fused sentence from the graph. Recently, (Lebanoff et al., 2019a) conducts a comprehensive analysis of sentence fusion in neural abstractive summarization and finds that it remains a challenge for current state-of-theart models. To address this problem, (Lebanoff et al., 2020a,b) propose to utilize points of correspondence between sentences to fuse disparate sentences, and develop a transformer enhanced with the links between the co-referred entities. Similar to above-mentioned works, our research also focuses on the research of sentence fusion in the context of text summarization.
Moving beyond sentence fusion alone, (Mehdad et al., 2013;Lebanoff et al., 2019b) discusses the potential application scenarios for enhancing text summarization with sentence fusion. Their models follow a similar framework that first extracts a few related sentences from the source document and then fuses them to obtain a summary sentence. Our model can be considered as a better replacement of the fusion model in such a framework.

Source
Cbs news correspondent julianna goldman reports from washington that president obama didn ' t talk military planning friday night when he met with democratic donors ...

Event-aware Generation Model
Currently, in the conditional generation tasks like text summarization and question answering, most of the source documents are usually composed of a series of events. Understanding how to leverage event information in these generation models becomes crucial. (Moryossef et al., 2019) learns to generate a fluent sentence with an input subjectverb-object triple that describes an event. (Huang et al., 2020) transfers event triples extracted with OpenIE to an event graph to acquire semantic interpretation over input to assist text summarization. (Zheng and Kordjamshidi, 2020) adopts an event graph to understand the path of multi-hop reasoning in question answering. To control the generation process and avoid factual errors, (Cao et al., 2017) proposes an additional event relation encoder to produce representations of event triples. Considering the importance of the relations between events in sentence fusion and inspired by the above-mentioned works, we adopt the event graph to guide sentence fusion.

Method
Our sentence fusion model follows the typical encoder-decoder architecture, as shown in Figure 2. It is composed of a joint encoder that produces both source sentences and event graph representations, and a decoder that incorporates the information from the source sentences and the event graph to generate a fused sentence.

Event Graph Construction
The event graph is built to capture the semantic relationships in the source sentences. We utilize AllenNLP-OpenIE (Stanovsky et al., 2018) to extract a set of events, where each event is composed of a predicate and an arbitrary number of arguments. When there is an overlap between two events, only the longer one is retained. These predicates and arguments are represented as the nodes in the event graph. When two nodes share the same content, we merge them into one. The graph is a directed graph. Two types of edges are considered.
(1) Directional edges connect a predicate and its corresponding arguments in an event and the direction follows the order of subject to predicate and predicate to other arguments.
(2) Bi-directional edges connect two nodes if they share the same entity or there is a coreference relation between them.

Encoder
We apply a BERT-based encoder to jointly generate contextualized representations of the tokens in concatenated input sentences and the nodes in the event graph. Each node is represented by a special [cls] token and the output representation of this token is considered as the representation of the node. The input of our encoder is the concatenation of sentence tokens and a set of graph node tokens.
Since each node only corresponds to several words in the input sentences, one node token will only be attended by the sentence tokens that belong to this node in the attention layer of BERT. To distinguish the two kinds of tokens, we assign two different segment embeddings to sentence tokens and node tokens. Since there is no sequential relationship between nodes, we initialize the positional embedding for node tokens as a special pad embedding.
We use an additional mask matrix M similar to the one presented in (Yuan et al., 2020) to control the attention of the BERT-based encoder. M ij = 0 means token i is allowed to attend to j, while M ij = −∞ prohibits i from attending to j. In our model, three possible situations can happen: (1) a sentence token attends to all other sentence tokens; (2) a sentence token attends to its corresponding graph node token; (3) a node token attends to other adjacent nodes on the event graph. After defining the mask matrix M , we calculate attention with Equation (1) below, where Q, K and V refer to the query matrix, the key matrix and the value matrix, respectively, d k is a scaling factor.
(1) In our preliminary study, we have also considered using the graph neural network as the encoder for the event graph, but we find that the current approach achieves a better result.

Decoder
Overview of the Decoder. The decoder aims to generate the fused sentence utilizing both the (sentences) sequence information and the (event) graph information. We employ a one-layer LSTM as the decoder with the hidden state s t at step t. The decoder generates tokens recurrently based on three types of attentions, i.e., the sequence attention, the graph attention and the graph flow attention.
Sequence Attention. At each decoding step t, we calculate the context vector c s t over a sequence of input sentences using the attention mechanism proposed in (Bahdanau et al., 2014). We also employ a coverage mechanism to avoid redundancy.
(3) where h k represents the token representation obtained from the encoder, Cov refers to the coverage vector generated at the last step.
Graph Attention. The graph attention applies the mechanism analogous with the sequence attention but to the node embedding v i and current hidden state s t to compute the attention score. The graph vector c g t is computed over the node embeddings with attentions.
Graph Flow Attention. When the graph structure is ignored during the decoding process, the graph attention tends to reflect the importance of individual nodes rather than the connections between nodes. We thereby propose a novel graph flow attention to explore potential fusion paths by capturing the content coherence embedded in the graph structure. The graph flow attention is designed to inherit the attention tendency of nodes from the previous decoding step and focuses on neighboring nodes at the current step. The attention tendency of nodes is expected to be strongly correlated to the output of the decoder. In this way, the model can maintain the coherence between the generated tokens and the nodes focused by the graph flow attention. Considering the graph attention is not fully synchronized with the decoding process, the following situation may happen. It first focuses on one node, and then teleport to another one far from the current node across the two consecutive decoding steps. Therefore, we choose to compute the distribution of attention tendency of nodes in the last step a p t−1 based on the sequence attention in the last decoding step. Suppose M ap ∈ i × j is the mapping matrix between tokens and nodes, where M ap ij = 1 denotes that the i token in the source sequence is in the j node of the event graph. The a p t−1 is then calculated based on the following equation.
Given the adjacent matrix A of the event graph, the i row refers to the normalized in-degree of the node i. As shown in Figure 3, the graph flow attention transmits a p t−1 in the following three ways: (1) Remain in the previous node f t,0 = a p t−1 . Since one node usually contains multiple tokens, the model may focus on the same node in several steps.
(2) Move one step f t,1 = Aa p t−1 . For example, the attention moves from one node to its neighbor.
(3) Move two steps f t,2 = A 2 a p t−1 . The attention is allowed to skip a middle connection node.
The graph flow attention is then the weighted sum of the scores of the three flows controlled by a dynamic gate Gate t ∈ 1 × 3. And the graph flow vector c f t is computed by the following equation.
Token Prediction. After obtaining the three vectors from the input sequence and the graph, we regard them as the representations of the information summarized from different points of view. Then they are concatenated with the decoder hidden state s t to produce the vocabulary distribution D vocab as follows.
We add a copy mechanism to directly copy words from source text based on the sequence attention. The copy probability is: (11) where y t−1 denotes the embedding of the token predicted at step t − 1.

Training
Generation Loss. With the generation loss, the training goal is to maximize the estimated probability of the reference sequence. Following most current works, we adopt the maximum likelihood training objective function that minimizes the following loss.
where θ represents model parameters and D stands for the training data including source sentences x, reference sequence y, and event graph g.
KL Loss. Our preliminary study reveals that simply concatenating the graph vector and graph flow vector in the decoding process fails to achieve a good performance. We figure out that it is difficult for a model to obtain effective information from two disparate vectors. Therefore, we introduce another training objective that computes the KL loss between the graph attention and the graph flow attention. In this way, the two attentions take advantage of each other. The KL loss is shown below and T is the total number of decoding steps.
Node Salience Labeling. We further enhance the node representation via the third objective that models the salience of nodes. The goal of it is to identify whether the non-stop words in a node are mentioned in the reference fused sentence. We incorporate a classification layer over each node v i above the joint encoder to predict a probability m i ranged in [0,1]. During training, the gold label n i is set to 1 if the node contains at least one non-stop word in the reference, and 0 otherwise. The loss function is shown below.
where N v is the number of the nodes in the graph. To summarize, the full training objective function consists of three terms: L = L seq + L kl + L node .

Faithful Beam Search
Inspired by (Scialom et al., 2020), we propose faithful beam search to reduce possible factual errors at the inference stage. Given a factual consistency checking model F and a sentence fusion model G, the goal is to re-rank every generated token based on both the generation probability calculated by G and the faithful score derived from F . In our work, we adopt the FactCC model developed by (Kryscinski et al., 2020), a BERT-based faithfulness checking model, to evaluate faithfulness. The input to FactCC consists of a hypothesis sentence and several source sentences, while the output from FactCC is a probability that refers to whether the hypothesis sentence is faithful to the source sentences. Since what we need here is to verify the faithfulness of an incomplete fused sentence during the decoding process, we made a corresponding change when training FactCC with sentence fusion data. We truncate all the fused sentences in positive samples to random length. For the negative samples, we remove the tokens after the position of the error in fused sentences. At the inference stage, the objective function aims to maximize the cumulative probability of the output tokens. At each decoding step, the top-b sequence with the highest probability is carried into the next step, where b stands for the beam size. We add an additional faithful score to refine the generation probability during beam search, such that: S(y t ) = S(y t−1 )+αlogF (x, y)+logG(x, y 1:t−1 ) (15) where y refers to the generated sequence, x represents the source sentences and α is a weighting factor. F and G stand for the consistency checking model and the sentence fusion model respectively.
In the experiments, the α is set to 0.05.

Experimental Set-Up
Datasets: We follow the practice of (Lebanoff et al., 2019b) to sample the sentence fusion data from summarization datasets. We choose the wellknown single-document summarization dataset CNN/DaliyMail and multi-document summarization dataset Multi-News for the purpose of evaluation. With the CNN/DaliyMail dataset, the fusion data is directly obtained according to the set of heuristics suggested in (Lebanoff et al., 2020a), which we call CNN/DaliyMail Fusion. With the Multi-News dataset, we use a strategy similar to the one proposed in (Lebanoff et al., 2020a) to generate the fusion data, which we call Multi-News Fusion. Note that there is a 60-70% compression rate on both sentence fusion datasets. Hence, they are different from the one proposed by (Geva et al., 2019) where the compression rate is lower than 5%. This explains why we create the sentence fusion data generated from summarization datasets rather than using the existing one.
Evaluation Metrics: Sentence fusion can be approximately regarded as multi-sentence summarization. Following the common practice, we adopt ROUGE F 1 as the basic evaluation metric. We also apply FactCC (Kryscinski et al., 2020)  is trained on the CNN/DaliyMail Fusion dataset and the Multi-News Fusion dataset following the method presented in the original paper. It achieves 90% of accuracy on the test set of two sentence fusion datasets and we believe that it is reasonably good for our evaluation. Note that it is distinct from the one used in our faithful beam search, where the fused sentences are not modified in the training. Besides, we also report the results of another two metrics, including (1) fusion rate (Fus), which is the percentage of the fused sentence that contain at least two unique non-stop words from multiple source sentences; and (2) length (Len), which is the average length of the fused sentences.
Implementation Details: We build the encoder using the BERT-base-uncased version of BERT. We employ the LSTM models with 768-dimensional hidden states as the decoder. We truncate the input sentences to 150 tokens and limit the decoder to a maximum of 60 steps. The batch size is set to 32 and we train the model for 20 epochs. After training, we select top-3 checkpoints on the validation dataset, and report the one with the best record on the test set among the three. For inference, the beam size is set to 5 in CNN/DaliyMail Fusion and 2 in Multi-News Fusion.

Automatic Evaluation
To examine the effectiveness of our model, we compare our model with two widely adopted seq2seq baseline models. They are Pointer-Generator (See et al., 2017) and BERT+LSTM, which is our basic encoder-decoder architecture before integrating the graph information. We also implement the state-of-the-art sentence fusion model for comparisons. Tranformer-Linking (Lebanoff et al., 2020a) is a BERT based model proposed for disparate sentence fusion. It utilizes coref-  erence relationships between entities to enhance sentence fusion. Since our data can be approximately regarded as multi-sentence summarization, we also adopt BERT based document summarization model, BERTSUMABS (Liu and Lapata, 2019), for comparisons. Most of these models are trained on the two sentence fusion datasets by ourselves except that the output result of Transformer-Linking is directly obtained from its author. As shown in Table 2, our proposed model obtains the highest Rouge scores on the Multi-News Fusion dataset and the competitive Rouge scores on the CNN/DaliyMail Fusion dataset. Meanwhile, our model achieves the best performance in fusion rate and faithfulness on both datasets. These suggest the effectiveness of our model in fusing sentences and its ability to reduce factual errors. We also notice that the transformer decoder has a clear advantage over the LSTM decoder in fusion rate. One possible reason is that the transformer decoder can generate a more abstractive sentence, which makes fusion a lot easier. Considering our model adopt a LSTM based decoder, we believe the event graph effectively assists the fusion process by providing cross-event connections and reduce the shifting distance between event components.

Ablation Study
To look into more detail, we design an experiment to understand how different components contribute to our model. We remove the KL loss, the graph attention and the graph flow attention independently from the full model and report the results in Ta-  ble 3. On the one hand, we find that the graph flow attention boosts the fusion rate. We believe that the flow attention indeed benefits the fusion process when utilizing the graph structure to find possible fusion paths. On the other hand, the graph attention leads to relatively high Rouge scores but a lower fusion rate. This suggests that although the graph attention does not contribute to sentence fusion, it assists to select important information from source sentences. More importantly, when the KL loss is taken out, the model performance drops more compared to the other two reductions. It indicates that the KL loss is essential for our model to take advantage of both attentions.

Human Evaluation
Automatic evaluation results are often not enough to fully reflect the quality of the generated fused sentence. We further conduct human evaluation to analyze unfaithful errors and fusion quality. We randomly extract 50 samples from the Multi-News Fusion test set and invite three fluent English speakers as human judges. Given a sentence fusion instance, the judges are asked to answer yes or no to Source: (1) Police identified the rite aid shooter as Snochia Moseley, 26, who lived in the marsh neighborhood of Baltimore.
(2) The shooter was found with a self-inflicted gunshot wound and died at an area hospital.
(3) The woman died at a nearby hospital after shooting herself in the head.
BERT+LSTM: Police say the shooter as Snochia Moseley, 26, was found with a self-inflicted gunshot wound and died at an area hospital.

BERTSUMABS:
The woman, who died at a hospital, was found with a self-inflicted gunshot wound and died at an area hospital.
Our: Snochia Moseley was found with a self-inflicted gunshot wound and died at a nearby hospital after shooting herself in the head.
Reference: Police say the 26-year-old woman, who has not been identified, died of a self-inflicted gunshot wound to the head.  the following three questions. (1) Fluency: whether the generated sentence is grammatically correct and readable.
(2) Fusion: whether the generated sentence is generated through sentence fusion. (3) Faithful: whether the generated sentence is faithful to the source sentences. Table 5 shows the percentage of yes on the three questions. We adopt Fleiss' kappa (Fleiss, 1971) to conduct the inter-annotator agreement test and the result is 0.53. The result shows a similar trend to the automatic evaluation, where our model achieves the best result in both fusion rate and faithfulness. The performance of BERTSUMABS further indicates that sentence fusion will lead to the decline of fluency and more faithful errors if there is no proper guidance. We illustrate a sentence fusion example that contains both similar and disparate sentence fusion in Table 4. As shown, BERT+LSTM tends to fuse sentences by directly copying the text spans from the source text. BERTSUMABS attempts to utilize the coreference relations between "the shooter" and "the woman" to fuse the last two source sentences, but generates redundancy when merging similar  content. On the contrary, our model successfully fuses the information from all source sentences. It shows that our model can effectively handle both types of sentence fusion at the same time.

Application in Text Summarization
We further design an experiment to investigate the effectiveness of the sentence fusion model in text summarization using a framework from (Lebanoff et al., 2019b). It aims to extract a single sentence (no need for fusion) or a pair of sentences (need fusion), then rewriting them to produce a summary sentence. Each sentence pair consists of a primary sentence and a secondary sentence provides complementary information. We use the oracle extractive results as input to conduct the generation experiment. Table 6 shows the summarization results with three different strategies: (1) Oracle: concatenating oracle single sentences and primary sentences in oracle pairs as the summary; (2) Ora-cle_all: concatenating oracle single sentences and both sentences in oracle pairs as the summary; (3) Fusion: concatenating oracle single sentences and fused sentences as the summary, where the fused sentences are generated by our model using oracle pairs as input. All the summaries are truncated to 100 words. The result shows that the sentence fusion model has the potential to improve the performance of summarization models by fusing information from multiple sentences.

Conclusion
In this paper, we investigate the sentence fusion problem in the context of text summarization by exploring the event graph. Our model captures both node representations and the structural information embodied in the event graph to guide the fusion. We further propose a faithful beam search to reduce the possible faithful errors. The experiment results suggest that event graph is crucial for effective sentence fusion and both node representations and graph structure play important roles in sentence fusion. In the future, we would like to further explore the direct incorporation of event information and the sentence fusion model to text summarization.