Controllable Neural Dialogue Summarization with Personal Named Entity Planning

In this paper, we propose a controllable neural generation framework that can flexibly guide dialogue summarization with personal named entity planning. The conditional sequences are modulated to decide what types of information or what perspective to focus on when forming summaries to tackle the under-constrained problem in summarization tasks. This framework supports two types of use cases: (1) Comprehensive Perspective, which is a general-purpose case with no user-preference specified, considering summary points from all conversational interlocutors and all mentioned persons; (2) Focus Perspective, positioning the summary based on a user-specified personal named entity, which could be one of the interlocutors or one of the persons mentioned in the conversation. During training, we exploit occurrence planning of personal named entities and coreference information to improve temporal coherence and to minimize hallucination in neural generation. Experimental results show that our proposed framework generates fluent and factually consistent summaries under various planning controls using both objective metrics and human evaluations.


Introduction
Automatic summarization is the task of compressing a lengthy piece of text to a more concise version while preserving the information of the source content. Extractive approaches select and concatenate salient words, phrases, and sentences from the source to form the summary (Lin and Bilmes, 2011;Kedzie et al., 2018;. On the other hand, abstractive approaches generate the summary either from scratch or by paraphrasing important parts of the original text (Jing and McKeown, 2000;Gehrmann et al., 2018). For abstractive summarization to be practically usable, it would require more in-depth comprehension, better generalization, reasoning, and incorporation of Figure 1: Dialogue summary examples generated by personal named entity planning: some examples focus on perspectives from distinct personal named entities (e.g., John, Tony); comprehensive planning includes all personal named entities in the dialogue. Note that the content of the ground-truth summary depends on which personal named entity's perspective the focus is during summary formation. real-world knowledge (Hovy et al., 1999;See et al., 2017). While extractive models could suffice for document summarization, abstractive approaches are essential for dialogue summarization to be more easily accessible to users.
Most benchmarked summarization datasets focus on the news domain, such as NYT (Sandhaus, 2008) and CNN/Daily Mail (Hermann et al., 2015) as material for large-scale corpus construction is readily available online. Neural approaches have achieved favorable improvements in both extractive and abstractive paradigms (Paulus et al., 2017;Liu and Lapata, 2019). Neural dialogue summarization is an emerging research area (e.g., Goo and Chen (2018), ). While the available data collections are much smaller than those for documents (Carletta et al., 2005;Gliwa et al., 2019), neural models have shown potential to generate fluent sentences via fine-tuning on large scale contextualized language models (Chen and Yang, 2020;Feng et al., 2021). Unfortunately, most summary generation tasks are constructed in an under-constrained fashion (Kryscinski et al., 2019): in their corpus construction process, only one reference summary is annotated. Models trained via supervised learning on such datasets provide generalpurpose summaries, but are suboptimal for certain applications and use cases (Fan et al., 2018;Goodwin et al., 2020). For instance, as shown in Figure 1, a human can write summaries from John or Tony's perspective. However, a neural model with a general summarizing purpose may overlook information that is important to a specific person's perspective. On the other hand, if someone wants to collect as much information from the source content, the summary should be written in a comprehensive manner, taking into consideration all personal named entities. Such needs are not met with models providing only one possible output.
Furthermore, different from passages, human-tohuman conversations are a dynamic and interactive flow of information exchange (Sacks et al., 1978), which are often informal, verbose, and repetitive. Since important information is scattered across speakers and dialogue turns, and is often embodied in incomplete sentences. Therefore, generating a fluent summary by utterance extraction is impractical, thus requiring models capable of generating abstractive summaries. However, neural abstractive models often suffer from hallucinations that affect their reliability (Zhao et al., 2020), involving improper gendered pronouns and misassigned speaker associations (Chen and Yang, 2020). For example, as shown in Figure 2, the model makes an incorrect description that "she texted Larry last time at the park" (in red). While this sentence achieves a high score in word-overlapping metrics, the semantic meaning it conveys is incorrect: in the context of the generated summary, she refers to Amanda, yet in reality it is Larry that called (not texted) Betty. Such factual inconsistency, the inability to adhere to facts from the source, is a prevalent and unsolved problem in neural text generation.
In this work, we introduce a controllable dialogue summarization framework. As the aim of dialogue summaries often focuses on "who did what" Figure 2: One dialogue summarization example: each coreference chain is highlighted with the same color. The generated sentence in red is factually incorrect. and the narrative flow usually starts with a subject (often persons), we propose to modulate the generation process with personal named entity plannings. More specifically, as shown in Figure 1, a set of personal named entities 1 (in color) are extracted from the source dialogue, and used in a generation model as a conditional signal. We postulate that such conditional anchoring enables the model to support flexible generation. It could be especially useful to address certain demands such as targeting specific client needs for customizing marketing strategies or drilling down customer dissatisfaction at call centers to educate customer agents. In addition, to improve the quality of conditional generation outputs, we integrate coreference resolution information into the contextual representation by a graph-based neural component to further reduce incorrect reasoning .
We conduct extensive experiments on the representative dialogue summarization corpus SAMSum (Gliwa et al., 2019), which consists of multi-turn dialogues and human-written summaries. Empirical results show that our model can achieve state-ofthe-art performance, and is able to generate fluent and accurate summaries with different personal named entity plans. Moreover, factual correctness assessment also shows that the output from our model obtains quality improvement on both automatic measures and human evaluation.

Related Work
Text summarization has received extensive research attention, and is mainly studied in abstractive and extractive paradigms (Gehrmann et al., 2018). For extractive summarization, non-neural approaches study various linguistic and statistical features via lexical (Kupiec et al., 1995) and graph-based modeling (Erkan and Radev, 2004). Much progress has been made by recent neural approaches (Nallapati et al., 2017;Kedzie et al., 2018). Compared with extractive methods, abstractive approaches are expected to generate more concise and fluent summaries. While it is a challenging task, with largescale datasets (Hermann et al., 2015) and sophisticated neural architectures, the performance of abstractive models have achieved substantial improvements in the news domain: sequence-to-sequence models are first introduced by Rush et al. (2015) for abstractive summarization; pointer-generator network (See et al., 2017) elegantly handled outof-vocabulary issues by copying words directly from the source content; Gehrmann et al. (2018) combines the two paradigms by integrating sentence rewriting into content selection; large-scale pre-trained language models also bring further improvement on summarization performance (Liu and Lapata, 2019;Lewis et al., 2020).
Recently, neural summarization for conversations has become an emerging research area. Corpora are constructed from meetings (Carletta et al., 2005) or daily chats (Gliwa et al., 2019). Based on the characteristics of the dialogues, many studies pay attention to utilizing conversational analysis for dialogue summarization, such as leveraging dialogue acts (Goo and Chen, 2018), multi-modal features (Li et al., 2019), topic information , and fine-grained view segmentation with hierarchical modeling (Chen and Yang, 2020).
Controllable text generation introduces auxiliary signals to obtain diverse or task-specific outputs. It has been studied in various domains such as style transferring (Shen et al., 2017) and paraphrasing (Iyyer et al., 2018). The conditional input can be in the form of pre-defined categorical labels (Hu et al., 2017), latent representations, semantic or syntactic exemplars (Gupta et al., 2020), and keyword planning (Hua and Wang, 2020). Recently, He et al. (2020) and Dou et al. (2021) proposed two generic frameworks for length-controllable and question/entity-guided document summarization, and we proposed personal named entity planning upon the characteristics of dialogue summarization.

Controllable Generation with Personal Named Entity Planning
In this section, we introduce the proposed conditional generation framework, elaborate on how we construct personal named entity planning, and delineate the steps for training and generation.

Task Definition
Controllable dialogue summarization with personal named entity planning is defined as follows: Input: The input consists of two entries: (1) the source content D, which is a multi-turn dialogue; (2) a customized conditional sequence C, which is the proposed personal named entity planning. Output: The output is a natural language sequence Y , which represents the summarized information from the source content D with the pre-defined personal named entity plan C. Given one instance of D, Y can be manifested as various summaries conditioned on different choices of C. The output summaries are expected to be fluent and factually correct, covering the indicated entities in the specified conditional signal C.

Personal Named Entity Planning
Personal named entities are used to form a planning sequence. A customized plan represents what the summary includes, covering specific personal named entities that appear in the dialogue. These named entities are not limited to the speaker roles, but include all persons mentioned in the conversation (e.g., "Betty" and "Larry" in Figure 2).

Training with Occurrence Planning
Ground-truth samples for conditional training are built on gold summaries. First, given one dialogue sample and its reference summary, two entity sets are obtained by extracting all personal named entities from the source content and the gold summary respectively. Then, we take the intersection of the two sets, which represent the content coverage of the summary. For instance, given the example in Figure 2, the intersection is {Larry, Amanda, Hannah, Betty}. Next, in order to align the plan with gold summaries written in a certain perspective and narrative flow, we define Occurrence Planning, which reflects the order of personal named entities occurring in the gold summary. To this end, the entity set is re-ordered to {Hannah, Betty, Amanda, Larry}, and converted to a conditional sequence for training the controllable generation framework.

Inference: Comprehensive and Focus Planning Summarization Options
Once the model is trained on personal entity planning, one could customize the input conditional signal as a sequence of personal named entities based on downstream application needs. While our framework supports any combination and order of personal named entities that occurred in the given dialogue, here we focus on two conditional inputs during inference: (1) Comprehensive Planning, which includes all personal named entities in a source dialogue (they are ordered according to the occurrence order in the source) and aims to maximize information coverage. This type of summary supports general purpose use cases.
(2) Focus Planning only targets one specific personal entity in the dialogue. Focus planning could be viewed as a subset of comprehensive planning and can be useful in more targeted applications.

Controllable Neural Generation
In our framework, a neural sequence-to-sequence network is used for conditional training and generation. As shown in Figure 3, the base architecture is a Transformer-based auto-regressive language model, since the Transformer (Vaswani et al., 2017) is widely adopted in various natural language processing tasks, and shows strong capabilities of contextual modeling (Devlin et al., 2019;Lewis et al., 2020). The input comprises a source dialogue with n tokens D = {w 0 , w 1 , ..., w n } and a pre-defined personal named entity planning with m tokens C = {c 0 , c 1 , ..., c m }. Encoder: The encoder consists of a stack of Transformer layers. Each layer has two sub-components: a multi-head layer with self-attention mechanism, and a position-wise feed-forward layer (Equation 1). A residual connection is employed between each pair of the two sub-components, followed by layer normalization (Equation 2).
where l represents the depth of the stacked layers, and h 0 is the embedded input sequence. MHAtt, FNN, LayerNorm are multi-head attention, feedforward and layer normalization components, respectively. Moreover, the additional linguistic feature (e.g., coreference information) is added in the encoded representations.
Decoder: The decoder also consists of a stack of Transformer layers. In addition to the two subcomponents in the encoding layers, the decoder inserts another component that performs multi-head attention over hidden representations from the last encoding layer. Then, the decoder generates tokens from left to right in an auto-regressive manner. The architecture and formula details are described in (Vaswani et al., 2017). During training (see Figure 3), the planning sequence C under Occurrence Planning is concatenated with the source dialogue content D as the input with a special token. The segmentation tokens are pre-defined in different Transformer-based models, such as '[SEP]' in BERT and '</s>' in BART. The model learns to generate the ground truth Y = {y 0 , y 1 , ..., y t } (where t is the token number) by summarizing the information from the dialogue context conditioned on the planning sequence. The loss of maximizing the log-likelihood on the training data is formulated as: During inference, we first specify one condition sequence based on the planning schemes described in Section 3.2. Specifically, one can assess the model's learning capability by generating summaries guided by Occurrence Planning. For simulating the real-world controllable generation scenario, Comprehensive Planning and Focus Planning can be applied. The model then creates a summary that is based on the specific condition which is coherent with the context of the input conversation.

Improving Factual Correctness
While current neural abstractive systems are able to generate fluent summaries, factual inconsistency remains an unsolved problem . Neural models tend to produce statements that are not supported by the source content. These hallucinations are challenging to eradicate in neural modeling due to the implicit nature of learning representations. In document summarization, it has been demonstrated that a certain proportion of abstractive summaries contain hallucinated statements (Kryscinski et al., 2020), as is observed in dialogue summarization (Chen and Yang, 2020). Such hallucinations raise concerns about the usefulness and reliability of abstractive summarization, as summaries that perform well in traditional wordoverlap metrics may fall short of human evaluation standards (Zhao et al., 2020).

Factual Inconsistency Detection
To evaluate and optimize the summarization quality regarding factual correctness, we first build a model to assess the accuracy of generated statements. Negative samples for classification are built via text manipulation, as is done in prior work (Zhao et al., 2020;Kryscinski et al., 2020). Since we focus on conditional personal named entities in this work, we aim to detect the inconsistency issues of person names between the source content and the generated summaries.
As shown in Figure 4, we construct a binary classifier by reading the dialogue and a summary. The classifier output evaluates if the two input entries are factually consistent. A reference summary in the original dataset is labeled as 'correct'. To generate versions of this summary with label 'incorrect', we adopt three strategies to build negative samples: (1) Swapping the positions of where one pair of personal named entities are located in the gold summary with each other. The entities that are connected with word "and" and "or" in one sentence are excluded; (2) Replacing one name (e.g., John) in summaries with another randomly selected name of the same gender (e.g., Peter) in the source content; (3) Replacing one name with another from a person name collection built on the training data. With these samples, a 'BERT-base-uncased' (Devlin et al., 2019) model was fine-tuned to classify whether the summary has been altered. The factual error detector achieved 91% F1 score on a holdout validation set. To identify all personal named entities in both the conversation and the summary, Stanza Named Entity Recognition (NER) tagger (Qi et al., 2020) was used.

Exploiting Coreference Information
In conversations, speakers refer to themselves and each other and mention other objects/persons, resulting in various coreference links and chains across dialogue turns and speakers. We empirically observed that a sizable amount of errors stem from incorrect pronoun assignments in the generation process. Recent language models are also incapable of capturing coreference information without sufficient supervision (Dasigi et al., 2019). Thus, we exploit dialogue coreference resolution in a more explicit manner to enhance the model design as in .
To this end, we first use the AllenNLP toolkit (Gardner et al., 2017) for coreference resolution on the dialogue samples. 2 With the analyzed coreference mentions and clusters, we build a graph by connecting all nodes in each cluster. Here, we add bi-directional edges between each word/span and its neighboring referring mentions. Following , we incorporate the coreference information into the Transformer-based sequence-tosequence model. Given a graph with n nodes, we represent the connected structure with an adjacency matrix A where A ij = 1 if node i and node j are connected. For feature integration: (1) to model the linked information with a graph-based method, the multi-layer Graph Convolutional Network (GCN) is applied (Kipf and Welling, 2017). As shown in Figure 3, we feed hidden states from the last layer of the language encoder to the graph modeling component, then implicit features are computed and exchanged among tokens in the same coreference cluster, and we add them to the contextualized representation; 3 (2) we also conduct coreference information integration by adding self-attention layers and adopting head manipulation  which are parameter-efficient, and they can provide the same performance.

Data Augmentation via Entity Exchange
In addition to the data synthesis strategies in Section 4.1, we further propose an entity-based data augmentation to robustify the model, reducing incorrect correlations that might be made by the model due to data sparsity or imbalance classes. The augmented data is created by two steps: (1) a personal named entity pair with the same gender attribution is extracted; (2) we exchange them in both source content and reference summary to form new samples. For the data used in this paper, each conversation is independent from one another and each interlocutor from a particular dialogue is not an interlocutor in any other dialogue, nor is s/he mentioned in any other dialogue. Therefore, we postulate that this entity-based augmentation is helpful to reduce unnecessary inductive bias from the training data. In our experiment, the sample number of Data Augmentation (DA) is 4k.

Dataset Description
We conduct experiments with the proposed framework on SAMSum (Gliwa et al., 2019), a dialogue summarization dataset. It contains multi-turn daily conversations with human-written summaries. The data statistics are shown in Table 1 words, emoticons, and special tokens, and preprocess them using sub-word tokenization (Lewis et al., 2020). Since the positional embedding of the Transformer-based model can support 1,024 input length, none of the samples are truncated.

Model Configurations
To leverage the large-scale language models which provide semantically-rich contextualized representation to improve downstream tasks such as BERT (Devlin et al., 2019), we use the implementation of BART that is specially pre-trained for sequence-tosequence language generation (Lewis et al., 2020), 4 to initialize parameters of the Transformer layers in Section 3.3, and fine-tune it to boost the performance on our dialogue summarization task. The number of encoder layers, decoder layers, graph modeling layers, input and hidden dimension are 6/6/2/768 for the 'BART-Base' and 12/12/3/1024 for the 'BART-Large', respectively. The learning rate of Transformer layers was set at 3e−5, and that of the graph layers was set at 1e−3. AdamW optimizer (Loshchilov and Hutter, 2019) was used with weight decay of 1e−3 and a linear learning rate scheduler. Batch size was set to 8. Drop-out (Srivastava et al., 2014) of rate = 0.1 was applied as in the original BART configuration. The backbone parameter size is 139M for the 'BART-Base' and 406M for for the 'BART-Large'. For the data augmentation described in Section 4.3, we excluded samples that contain less than two personal named entities in their summaries. Best checkpoints were selected based on validation results of ROUGE-2 value. Tesla A100 with 40G memory was used for training and we used the Pytorch 1.7.1 as the computational framework (Paszke et al., 2019).   (Chen and Yang, 2020). BART w/o Cond. is the baseline without entity planning conditional training. CTRLsum is the generic controllable summarizer proposed in (He et al., 2020), and we further fine-tuned it on the dialogue corpus with our entity planning scheme.

Quantitative Evaluation
We first conducted two evaluations with automatic metrics to assess the summarizers.

ROUGE Evaluation
We adopt ROUGE-1, ROUGE-2, and ROUGE-L, as ROUGE (Lin, 2004) is customary in summarization tasks to assess the output quality with gold summaries via counting n-gram overlap. We employ Py-rouge package to evaluate the models following (Gliwa et al., 2019;Feng et al., 2021).
Matched Training and Testing Conditions: We obtained summaries by conditioning the output generation with the personal named entities in the order they occur in the gold summary (i.e., Oc-  currence Planning). Since Occurrence Planning is extracted from the gold summaries, it serves as the upper-bound performance for the proposed conditional generation. As Comprehensive Planning and Focus Planning are mismatched testing conditions from the training process, we use Occurrence Planning to conduct a sanity check to ensure the proposed model performance meets expectations in idealistic scenarios where training and test conditions are matched: Table 2 shows that the conditional training in Section 3.2.1 is indeed effective. Moreover, the model with 'BART-Large' backbone significantly performs better than that of 'BART-Base', thus we use it for the following generation evaluations. We also select a generic controllable  As Comprehensive Planning covers the maximum number of personal entities in the dialogue, recall (a sensitivity measure) is more suitable in assessing its performance. Similarly, as Focus Planning only concerns a specific personal entity, precision (a specificity measure) is adopted. For evaluating Focus Planning, we randomly selected one speaker entity from each dialogue as condition input. While the aim of conditional summary generation (be it Comprehensive Planning or Focus Planning) is not to generate a summary that emulates the gold summary, we nonetheless provide comparison results with the unconditional baseline 'BART w/o Cond.' for analysis purposes (see Appendix for complete ROUGE results and the generated summary examples). Results in Table 3 -4 suggest: (1) Increasing the information coverage on personal named entities in dialogues improves general-purpose dialogue summarization; (2) Obtaining a reasonably accurate summary focused on a specified personal named entity is feasible; (3) Integrating coreference information and data augmentation improve performance consistently.

Factual Correctness Evaluation
We applied the factual consistency classifier built in Section 4.1 to assess the generated summaries using the accuracy metric (the proportion of samples  that are predicted as true). As shown in Table 5, explicitly incorporating coreference information improves the accuracy of generated summaries guided with all conditional plannings, and data augmentation brings further improvements. Results of Comprehensive Planning is close to the upper-bound of Occurrence Planning. The difference is potentially due to the relatively longer generated summaries and the use of more novel words. Specifically, we observed that the novel word rate (See et al., 2017) of Ctrl+Coref under Occurrence and Comprehensive plannings are 0.28 and 0.33 respectively. The overall accuracy under Focus Planning is relatively lower, which is not unexpected, as more paraphrasing is needed for summarizing from a specified personal entity's perspective. Moreover, the fine-tuned CTRLsum performs similarly to the Ctrl-DiaSumm model, since both of them use 'BART-Large' as the language backbone. However, here we did not pre-trained our models on out-of-domain summarization data.

Quality Scoring
We randomly sampled 50 dialogues with generated summaries for two linguistic evaluators to conduct quality scoring (Paulus et al., 2017). Since abstractive models fine-tuned on contextualized language backbones are able to generate fluent sentences (Lewis et al., 2020;Chen and Yang, 2020), we excluded fluency in the scoring criteria. Instead, factual consistency and informativeness were used to measure how accurate and comprehensive the extracted information is according to the source content. Summaries were scored of [−1, 0, 1], where −1 means a summary was not factually consistent or failed to extract relevant information, 1 means it could be regarded as a human-written output, and  0 means it extracted some relevant information or made minor mistakes. We averaged the normalized scores from evaluators. As shown in Table 6, Comprehensive Planning obtains slightly lower scores in consistency (related to ROUGE precision score) than the training scheme Occurrence Planning, but it achieves higher informativeness scores, which is consistent with the improvement on ROUGE recall scores in Table 3. Moreover, the proposed model (Ctrl+Coref+DA) outperforms base model significantly under Focus Planning.

Error Analysis
Similar to previous work (Chen and Yang, 2020), we conducted error analysis by checking the following 4 error types: (1) Missing information: content mentioned in references is missing in generated summaries; (2) Wrong references: generated summaries contain information that is not faithful to the original dialogue, or associate actions with wrong named entities; (3) Incorrect reasoning: the model learned incorrect associations leading to wrong conclusions in the generated summary; (4) Improper gendered pronouns. Linguistic analysts were given 50 dialogues randomly chosen from the test set and their corresponding summaries from baselines and our models. They were asked to read the dialogue content and summaries and judge if the 4 error types occurred. For each evaluator, the sequence of presentation was randomized differently. As shown in Table 7, summaries under both planning schemes make significantly fewer errors at all fronts. Under Comprehensive Planning, models with conference information and data augmentation (Ctrl+Coref+DA) outperform the base model especially in consistency-related classes. Under Focus Planning, both models produce more factual incorrectness due to more paraphrasing from various personal perspectives, this matches the result from automatic factual consistency evaluation in Section 5.3.2, and the Ctrl+Coref+DA model achieves significant quality improvement.

Conclusion
In this work, we proposed a controllable neural framework for abstractive dialogue summarization.
In particular, a set of personal named entities were used to condition summary generation. This framework could efficiently tailor to different user preferences and application needs, via modulating entity planning. Moreover, the experimental results demonstrated that the abstractive model could benefit from explicitly integrating coreference resolution information, achieving better performance on factual consistency and standard metrics of wordoverlap with gold summaries.    George got a cooking pot for Christmas. His wife wants him to help her in the kitchen.