Coreference-Aware Dialogue Summarization

Summarizing conversations via neural approaches has been gaining research traction lately, yet it is still challenging to obtain practical solutions. Examples of such challenges include unstructured information exchange in dialogues, informal interactions between speakers, and dynamic role changes of speakers as the dialogue evolves. Many of such challenges result in complex coreference links. Therefore, in this work, we investigate different approaches to explicitly incorporate coreference information in neural abstractive dialogue summarization models to tackle the aforementioned challenges. Experimental results show that the proposed approaches achieve state-of-the-art performance, implying it is useful to utilize coreference information in dialogue summarization. Evaluation results on factual correctness suggest such coreference-aware models are better at tracing the information flow among interlocutors and associating accurate status/actions with the corresponding interlocutors and person mentions.


Introduction
Text summarization condenses the source content into a shorter version while retaining essential and informative content. Most prior work focuses on summarizing well-organized single-speaker content such as news articles (Hermann et al., 2015) and encyclopedia documents (Liu* et al., 2018). Recently, models applied on text summarization benefit favorably from sophisticated neural architectures and pre-trained contextualized language backbones: on the popular benchmark corpus CNN/Daily Mail (Hermann et al., 2015), Liu and Lapata (2019) explored fine-tuning BERT (Devlin et al., 2019) to achieve state-of-the-art performance for extractive news summarization, and BART (Lewis et al., 2020) has also improved generation quality on abstractive summarization. The original conversation (in grey) is abbreviated; the summary generated by a baseline model is in blue; the summary generated by a coreference-aware model is in orange. While these two summaries obtain similar ROUGE scores, the summary from the baseline model is not factually correct; errors are highlighted in italic and magenta.
While there has been substantial progress on document summarization, dialogue summarization has received less attention. Unlike documents, conversations are interactions among multiple speakers, they are less structured and are interspersed with more informal linguistic usage (Sacks et al., 1978). Based on the characteristics of human-to-human conversations (Jurafsky and Martin, 2008), challenges of summarizing dialogues stem from: (1) Multiple speakers: the interactive information exchange among interlocutors implies that essential information is referred to back and forth across speakers and dialogue turns; (2) Speaker role shift-ing: multi-turn dialogues often involve frequent role shifting from one type of interlocutor to another type (e.g., questioner becomes responder and vice versa); (3) Ubiquitous referring expressions: aside from speakers referring to themselves and each other, speakers also mention third-party persons, concepts, and objects. Moreover, referring could also take on forms such as anaphora or cataphora where pronouns are used, making coreference chains more elusive to track. Figure 1 shows one dialogue example: two speakers exchange information among interactive turns, where the pronoun "them" is used multiple times, referring to the word "sites". Without sufficient understanding of the coreference information, the base summarizer fails to link mentions with their antecedents, and produces an incorrect description (highlighted in magenta and italic) in the generation. From the aforementioned linguistic characteristics, dialogues possess multiple inherent sources of complex coreference, motivating us to explicitly consider coreference information for dialogue summarization to more appropriately model the context, to more dynamically track the interactive information flow throughout a conversation, and to enable the potential of multi-hop dialogue reasoning.
Previous work on dialogue summarization focuses on modeling conversation topics or dialogue acts (Goo and Chen, 2018;Chen and Yang, 2020). Few, if any, leverage on features from coreference information explicitly. On the other hand, large-scale pre-trained language models are shown only to implicitly model lowerlevel linguistic knowledge such as part-of-speech and syntactic structure (Tenney et al., 2019;Jawahar et al., 2019). Without directly training on tasks that provide specific and explicit linguistic annotation such as coreference resolution or semanticsrelated reasoning, model performance remains subpar for language generation tasks (Dasigi et al., 2019). Therefore, in this paper, we propose to improve abstractive dialogue summarization by explicitly incorporating coreference information. Since entities are linked to each other in coreference chains, we postulate adding a graph neural layer could readily characterize the underlying structure, thus enhancing contextualized representation. We further explore two parameter-efficient approaches: one with an additional coreferenceguided attention layer, and the other resourcefully enhancing BART's limited coreference resolution capabilities by conducting probing analysis to augment our coreference injection design.
Experiments on SAMSum (Gliwa et al., 2019) show that the proposed methods achieve state-ofthe-art performance. Furthermore, human evaluation and error analysis suggest our models generate more factually consistent summaries. As shown in Figure 1, a model guided with coreference information accurately associates events with their corresponding subjects, and generates more trustworthy summaries compared with the baseline.

Related Work
In abstractive text summarization, recent studies mainly focus on neural approaches. Rush et al. (2015) proposed an attention-based neural summarizer with sequence-to-sequence generation. Pointer-generator networks (See et al., 2017) were designed to directly copy words from the source content, which resolved out-of-vocabulary issues. Liu and Lapata (2019) leveraged the pre-trained language model BERT (Devlin et al., 2019) on both extractive and abstractive summarization. Lewis et al. (2020) proposed BART, taking advantage of the bi-directional encoder in BERT and the autoregressive decoder of GPT (Radford et al., 2018) to obtain impressive results on language generation. While many prior studies focus on summarizing well-organized text such as news articles (Hermann et al., 2015), dialogue summarization has been gaining traction. Shang et al. (2018) proposed an unsupervised multi-sentence compression method for meeting summarization. Goo and Chen (2018) introduced a sentence-gated mechanism to grasp the relations between dialogue acts.  proposed to utilize topic segmentation and turnlevel information  for conversational tasks. Zhao et al. (2019) proposed a neural model with a hierarchical encoder and a reinforced decoder to generate meeting summaries. Chen and Yang (2020) used diverse conversational structures like topic segments and conversational stages to design a multi-view summarizer, and achieved the current state-of-the-art performance on the SAM-Sum corpus (Gliwa et al., 2019).
Improving factual correctness has received keen attention in neural abstractive summarization lately. Cao et al. (2018) leveraged on dependency parsing and open information extraction to enhance the reliability of generated summaries. Zhu et al. (2021) proposed a factual corrector model based on Figure 2: Examples of three common issues in adopting a document coreference resolution model for dialogues without additional domain adaptation training. Spans in blocks are items in coreference clusters with their cluster ID number. We highlight some spans for better readability.
knowledge graphs, significantly improving factual correctness in text summarization.

Dialogue Coreference Resolution
Since the common summarization datasets do not contain coreference annotations, automatic coreference resolution is needed to process the samples. Neural approaches (Joshi et al., 2020) have shown impressive performance on document coreference resolution. However, they are still sub-optimal for conversational scenarios (Chen et al., 2017), and there are no large-scale annotated dialogue corpora for transfer learning. When applying a document coreference resolution model (Lee et al., 2018;Joshi et al., 2020) on dialogue samples without domain adaptation, 1 as shown in Figure 2, we observed some common issues: (1) Each dialogue utterance starts with a speaker, but sometimes this speaker is not recognized as a coreference-related entity, and thus not added in any coreference clusters; (2) In dialogues, coreference chains are often spanned across multiple turns, but sometimes they are split to multiple clusters; (3) When a dialogue contains multiple coreference chain across multiturns, speaker entities could be wrongly clustered.
Based on the observation, to improve the overall quality of dialogue coreference resolution, we 1 The off-the-shelf version of coreference resolution model we used is allennlp-public-models/coref-spanbertlarge-2021.03.10, which is trained on OntoNotes 5.0 dataset. conducted data post-processing on the automatic output: (1) First, we applied a model ensemble strategy to obtain more accurate cluster predictions; (2) Then, we re-assigned coreference cluster labels to the words with speaker roles that were not included in any chains; (3) Moreover, we compared the clusters and merged those that presented the same coreference chain. Human evaluation on the processed data showed that this post-processing reduced incorrect coreference assignments by approximately 19%. 2

Coreference-Aware Summarization
In this section, we adopt a neural model for abstractive dialogue summarization, and investigate various methods to enhance it with the coreference information obtained in Section 3.
For each dialogue, there is a set of coreference clusters {C 1 , C 2 , ..., C u }, and each cluster C i con- As the multi-turn dialogue sample shown in Figure 3, there are three coreference clusters (colored in yellow, red, and blue, respectively), and each cluster consists a number of words/spans in the same coreference chain. During the conversational interaction, the referring of pronouns is important for semantic context understanding (Sacks et al., 1978), thus we postulate that incorporating coreference information explicitly can be useful for abstractive dialogue summarization. In this work, we focus on enhancing the encoder with auxiliary coreference features.

GNN-Based Coreference Fusion
As entities in coreference chains link to each other, a graphical representation could readily characterize the underlying structure and facilitate computational modeling of the inter-connected relations. In previous works, Graph Convolutional Networks (GCN) (Kipf and Welling, 2017) show strong capability of modeling graphical features in various tasks (Yasunaga et al., 2017;, thus we use it for the coreference feature fusion.

Coreference Graph Construction
To build the chain of a coreference cluster, we add links between each entity and their mentions. Unlike previous work  where entities in one cluster are all pointed to the first occurrence, here we connect the adjacent pairs to retain more local information. More specifically, given a clus- we add a link of each E to its precedent.
Then each coreference chain is transformed to a graph, and fed to a graph neural network (GNN). Given a text input of n tokens (here we use a sub-

GNN Encoder
Given a graph G with the nodes (words/spans with coreference information in the conversation) and the edges (links between mentions), we employ stacked graph modeling layers to update the hidden representations H of all nodes. Here, we take a single coreference graph encoding (CGE) layer as an example: the input of the first CGE layer is the output H from the Transformer encoder. We denote the input of k-th CGE layer as H k = {h k 1 , ..., h k n }, and the representations of (k+1)-th layer H k+1 are updated as follows: where W i and b i denote the trainable parameter matrix and bias, LayerN orm( * ) is the layer normalization component, and N i denotes the neighborhood nodes of the i-th node. After feature propagation in all stacked CGE layers, we obtain the final representations by adding the coreference-aware hidden states H G = {h G 1 , ..., h G n } with the contextualized hidden states H (here a weight λ is used, and initialized as 0.7), then the auto-regressive decoder is applied to generate summaries.

Coreference-Guided Attention
Aside from the GNN-based method which introduces a certain number of additional parameters, we further explore a parameter-free method. With the self-attention mechanism (Vaswani et al., 2017), contextualized representation can be obtained with attentive weighted sum. For entities in a coreference cluster, they all share the referring information at the semantic level. Therefore, we propose to fuse the coreference information via one additional attention layer in the contextualized representation.
Given a sample with coreference clusters, a coreference-guided attention layer is constructed to update the encoded representations H. The overview of adding the coreference-guided attention layer is shown in Figure 5. Since items in the same coreference cluster are attended to each other, values in the attention weight matrix A c are normalized with the number of all referring mentions in one cluster, then the representation h i of token i is updated according to the following: where a i is the attentive representation of t i , if t i belongs to one coreference cluster C * , the representation of t i is updated, otherwise, it remains unchanged. λ is an adjustable parameter and initialized as 0.7. In our experimental settings, we observed that when λ is trainable, it is trained to be 0.69 when our coreference-guided attention model achieved the best performance on the validation set. Following the coreference-guided attention layer, Figure 6: Similarity distribution of head probing with pre-defined coreference matrix. The X-axis shows the heads in the 6-th layer of the Transformer encoder. Values on the Y-axis denote the ratio that a head has the highest similarity with the coreference attention matrix.
we obtain the final representations with coreference information H A = {h A 1 , ..., h A n }, then they are fed to the decoder for output generation.

Coreference-Informed Transformer
While pre-trained models bring significant improvement, they still present insufficient prior knowledge for tasks requiring high-level semantic understanding such as coreference resolution. In this section, we explore another parameter-free method by directly enhancing the language backbone. Since the encoder of our neural architecture uses the selfattention mechanism, we proposed feature injection by attention weight manipulation. In our case, the encoder of BART (Lewis et al., 2020) comprises 6 multi-head self-attention layers, and each layer has 12 heads. To incorporate coreference information, we selected heads and modified them with weights that present coreference mentions (see Figure 7).

Attention Head Probing and Selection
To retain prior knowledge provided by the language backbone as much as possible, we first conduct a probing task to strategically select attention heads. Since different layers and heads convey linguistic features of different granularity (Hewitt and Manning, 2019), our target is to find the head that represents the most coreference information. We probe the attention heads by measuring the cosine similarity between their attention weight matrix A o and a pre-defined coreference attention matrix A c as described in Section 4.2: where A o i is the attention weight matrix of the original i-th head, and i ∈ (1, ..., N h ), N h is the number of heads in each layer. With all samples in the validation set, we conducted probing on all heads in the 5-th layer and 6-th layer of the 'BART-Base' encoder. We observed that: (1) in the 5-th layer, the 7-th head obtained the highest similarity score on 95.2% evaluation samples; (2) in the 6-th layer, the 5-th head obtained the highest similarity score on 68.9% evaluation samples. The statistics of heads in 6-th encoding layer are shown in Figure 6.

Coreference-Informed Multi-Head Self-Attention
In order to explicitly utilize the coreference information, we replaced the two predominant attention heads with coreference-informed attention weights. The multi-head self-attention layers (Vaswani et al., 2017) are formulated as: where Q, K and V are the sets of queries, keys and values respectively. W i and b i are the trainable parameter matrix and bias. d k is the dimension of keys, x l i is the representation of i-th token after the l-th multi-head self-attention layer. FFN is the point-wise feed forward layer. Based on the probing analysis in Section 4.3.1, we selected the 7-th head of 5-th encoding layer, and the 5-th head of 6-th encoding layer for coreference injection, and observed that models with probing selection outperformed that of random head selection.

Dataset
We evaluated the proposed methods on SAMSum (Gliwa et al., 2019), a dialogue summarization dataset consisting of 16,369 conversations with human-written summaries. Dataset statistics are listed in Table 1.

Model Settings
The vanilla sequence-to-sequence Transformer (Vaswani et al., 2017) was applied as the base architecture. We used the pre-trained 'BART-Base' (Lewis et al., 2020)  Multi-View BART (Chen and Yang, 2020), which provides the state-of-the-art result.

Training Configuration
The proposed models were implemented in Py-Torch (Paszke et al., 2019), and Hugging Face Transformers (Wolf et al., 2020). The Deep Graph Library (DGL)  was used for implementing the Coref-GNN. The trainable parameters were optimized by Adam (Kingma and Ba, 2014). The learning rate of the GCN component was 1e-3, and that of BART was set at 2e-5. We trained each model for 20 epochs and selected the best checkpoints on the validation set with ROUGE-2 score. All experiments were run on a single Tesla V100 GPU with 16GB memory.   6 Results

Automatic Evaluation
We quantitatively evaluated the proposed methods with the standard metric ROUGE (Lin and Och, 2004), and reported ROUGE-1, ROUGE-2 and ROUGE-L. 3 As shown in Table 2, our base model BART-Base outperformed Fast-Abs-RL-Enhanced and DynamicConv-News significantly, showing the effectiveness of fine-tuning pre-trained language backbones for abstractive dialogue summarization.
Adopting BART-Large could bring about relative 5% improvement, while it doubled the parameter size and training time of BART-Base. As shown in Table 2, compared with the base model BART-Base, the performance is improved significantly by our proposed methods. In particular, Coref-Attention performed best with 4.95%, 6.69% and 2.87% relative F-measure score improvement, and Coref-GNN achieved the highest scores on precision with 10.43% on ROUGE-1, 5.81% on ROUGE-2 and 5.17% on ROUGE-L. Coref-Transformer also showed consistent improvement.    (Chen and Yang, 2020). Reported scores are averaged on 100 samples.
Moreover, compared with the BART-Base model (Lewis et al., 2020), the proposed coref-models performed better on ROUGE-1 scores, especially on the precision metrics. More specifically, precision scores are improved 9.78%, 6.85%, and 8.61% relatively by Coref-GNN, Coref-Attention and Coref-Transformer, respectively. For ROUGE-2 and ROUGE-L, our models also obtain comparable performance. Recently, Feng et al. (2021) conducted a benchmarked comparison of state-ofthe-art dialogue summarizers. As shown in Table  3, our method (trained with BART-Large) is comparable to MV-BART-Large (Chen and Yang, 2020) and LM-Annotator (D All ) (Feng et al., 2021).
As shown in Table 2, we also observed that the most significant improvement is on the precision scores while the recall scores remains comparable with strong baselines. Moreover, as shown in Table   Model Missing   4, the average length of generated summaries of the base model is 22.72, and that of the coref-models is slightly shorter. We speculated that the proposed models tend to generate more concise summaries while preserving the important information, which is also supported by the analysis in Section 7.1.

Human Evaluation
As the example shown in Figure 1, ROUGE scores are insensitive to semantic errors such as incorrect reference, thus we conducted human evaluation to complement objective metrics. Following Gliwa et al. (2019) and Chen and Yang (2020), each summary is scored on the scale of [-2, 0, 2], where -2 means the summary is unacceptable with the wrong reference, extracted irrelevant information or does not make logical sense, 0 means the summary is acceptable but lacks of important information converge, and 2 refers to a good summary which is concise and informative. We randomly selected 100 test samples, and scored the summaries generated by the base model, Coref-GNN, Coref-Attention and Coref-Transformer. Four linguistic experts conducted the human evaluation, and their average scores are reported in Table 5. Compared with the base model, our coref-models obtain higher scores in human ratings, which is consistent with the quantitative ROUGE results.

Quantitative Analysis
To further evaluate the generation quality and effectiveness of coreference fusion for dialogue summarization, we annotated four types of common errors in the automatic summaries: Missing Information: The content is incomplete in the generated summary compared with the human-written reference. Redundant Information: There is redundant content in the generated summary compared with the human-written reference. Wrong References: The actions are associated with the wrong interlocutors or mentions (e.g., In the example of Figure 1, the summary generated by base model confused "Payton" and "Max" in the actions of "look for good places to buy clothes" and "love reading books").
Incorrect Reasoning: The model incorrectly reasons the conclusion from context of multiple dialogue turns. Moreover, wrong reference and incorrect reasoning will lead to factual inconsistency from source content. We randomly sampled 100 conversations in the test set and manually annotated the summaries generated by the base and our proposed models with the four error types. As shown Table 6, 34% of summaries generated by the base model cannot summarize all the information included in the gold references, and models with coreference fusion improve the information coverage marginally. Coreferenceaware models essentially reduced the redundant information: 84% relative reduction by Coref-Attention, 69% relative reduction by Coref-GNN, and 53% relative reduction by Coref-Transformer. Coref-Attention model also performed best on reducing 45% of wrong reference errors relatively, Coref-GNN and Coref-Transformer both relatively reduced 36% of that. Encoding coreference information by an additional attention layer substantially improves the reasoning capability by reducing 55% relatively in incorrect reasoning, Coref-Transformer and Coref-GNN also relatively reduced this error by 40% and 20% compared with the base model. This shows our models can generate more concise summaries with less redundant content, and incorporating coreference information is helpful to reduce wrong references, and conduct better multi-turn reasoning.

Sample Analysis
Here we conducted a sample analysis as in (Lewis et al., 2020). Table 7 shows 3 examples along with their corresponding summaries from the BART-Base and Coref-Attention model. Conversation i and ii contain multiple interlocutors and referrals. The base model made some referring mistakes: (1) in conversation i, "your brother's wedding" should refer to "Ivan's brother's wedding"; (2) in conversation ii, since "Fillip" and "Tommy" are exactly the same person, pronouns "you" and "I" in "Would you have an Android cable I could borrow..." should refer to "Tommy" and "Derek McCarthy", respectively. In contrast, the Coref-Attention model was able to make correct statements. However, if the coreference resolution quality is poor, the coreference-aware models will be affected. For example, in the conversation iii, when the pronouns "you" and "my girl" in "Julie: That's co cute of you, my girl" are wrongly included in the coreference cluster of "Julie", the model will also make referring mistakes in the summary .

Conclusion
In this paper, we investigated the effectiveness of utilizing coreference information for summarizing multi-party conversations. We proposed three approaches to explicitly incorporate coreference information into neural abstractive dialogue summarization: (1) GNN-based coreference fusion; (2) coreference-guided attention; and (3) coreferenceinformed Transformer. These methods can be adopted on various neural architectures. Quantitative results and human analysis suggest that coreference information helps track referring chains in conversations. Our proposed models compare favorably with baselines without coreference guidance and generate summaries with higher factual consistency. Our work provides empirical evidence that coreference is useful in dialogue summarization and opens up new possibilities of exploiting coreference for other dialogue related tasks.