Reducing Sensitivity on Speaker Names for Text Generation from Dialogues

Changing speaker names consistently throughout a dialogue should not affect its meaning and corresponding outputs for text generation from dialogues. However, pre-trained language models, serving as the backbone for dialogue-processing tasks, have shown to be sensitive to nuances. This may result in unfairness in real-world applications. No comprehensive analysis of this problem has been done in the past. In this work, we propose to quantitatively measure a model's sensitivity on speaker names, and comprehensively evaluate a number of known methods for reducing speaker name sensitivity, including a novel approach of our own. Extensive experiments on multiple datasets provide a benchmark for this problem and show the favorable performance of our approach in sensitivity reduction and quality of generation.


Introduction
The safety and fairness issue of generations from dialogue models is a crucial concern in real applications. Previous work focuses on response generation from open-ended dialogue systems (Xu et al., 2020;Henderson et al., 2018), such as offensive contents (Baheti et al., 2021), gender bias Dinan et al., 2020) and other discriminated behavior (Sheng et al., 2021;Smith and Williams, 2021). For other text generation tasks where the whole dialogue is provided and the output shouldn't go beyond the dialogue, such as dialogue summarization (Gliwa et al., 2019) and dialogue reading comprehension , the fairness issue is still unexplored.
In these tasks, the input dialogues are selfcontained, and the names of the speakers do not carry any connotation from outside of the dialogue. Therefore, changing the speaker names consistently in a dialogue should not affect the meanings of the dialogue and the desired outputs. This contrasts with response generation, where the dialogue is in progress and the output is expected to be different in styles or contents for various speakers. Taking dialogue summarization (Gliwa et al., 2019; as an example for text generation from dialogues, it focuses on generating concise "who-did-what" summaries in the third person. In Fig. 1, the two dialogues are identical except for the speaker names. The two summaries are expected to be the same modulo the speaker names. Unfortunately, models nowadays, following the pretrain-finetune paradigm, are sensitive to trivial changes, which has been verified in other tasks. In relation extraction, spurious correlations between entity mentions and relations lead to entity bias (Zhang et al., 2018(Zhang et al., , 2017Wang et al., 2022b). Other similar work includes the analysis of robustness by entity renaming for machine reading comprehension models on narrative texts (Yan et al., 2022) and name biases in machine translation with inflected languages (Wang et al., 2022a), like German. Besides, Shwartz et al. (2020) claims that pre-trained language models do not treat given names as interchangeable or anonymous, showing unfairness in reading comprehension.
Obviously, dialogue understanding models are sensitive to speaker names according to Fig. 1 as well. The model tends to generate different information given different speaker names, such as "don't want to go" and "doesn't like them". Incorrect content, "... Betsy don't want to go", is generated with the first group of speakers, while not with the other group. According to our pilot experiment with the vanilla BART fine-tuned on SAMSum, around 74.00% of generations are changed by switching speaker names and 69.82% among them are due to distinct contents. Such uneven performances create unfairness among different speakers, especially in the aspect of information allocation. The model may also catch latent properties in names (Romanov et al., 2019) and lead to discrimination, raising the importance of research on the sensitivity on speaker names.
Previous work has also mentioned this problem. Different data pre-processing approaches are adopted during the construction of datasets to avoid using speaker names, such as "A" or "B" in Li et al. (2017). Khalifa et al. (2021) replace speaker names with more common and frequent names that the model may have seen during pre-training. Data augmentation by changing speaker names is adopted by . However, all of them only attempted to attack this problem subjectively, without quantitive analysis and fair comparisons.
In this work, we systematically analyze speaker name sensitivity in text generation from dialogues. We define the speaker name sensitivity and divide the approaches into offline and online ones. Then, we propose two novel insensitivity losses, helping to reduce attention and hidden state distances of the same dialogue with different speaker names for transformer-based models during finetuning. These losses can be used in both kinds of approaches. Results on several tasks show that our losses reduce the sensitivity and get better generations. In summary, our contributions are: • We are the first to investigate the speaker name sensitivity in text generation from dialogues (Sec. 2.1) with all of the codes and results open-sourced at https://github.com/ JiaQiSJTU/SpeakerNameSensitivity.
• We introduce two novel insensitivity losses as auxiliary training objectives for reducing sensitivity during fine-tuning (Sec. 3).
• Experiments on different tasks provide a benchmark with comprehensive analysis on speaker name sensitivity, and show state-ofthe-art performances of our approach (Sec. 5).

Speaker Name Sensitivity
Speaker name sensitivity is the differences in the generations by a model, given the identical dialogues except for different speaker names. We define it as follows.
Let d denote the input dialogue. c denotes other input content, which can be empty for tasks like dialogue summarization, or a piece of text such as a question for reading comprehension. p refers to the set of speakers names in d. f is a one-to-one mapping which maps p into a set of names p ′ from a name pool P consisting of a set of candidate names to be substituted into the samples. The names p ′ are sampled under the uniform distribution without the loss of generality. The speaker name sensitivity SS of a generation model M(·) on this sample is: where Rep(·) replaces names in the sample given f , i.e., from p to p ′ . δ(·) quantifies the differences among generations. Then, the sensitivity SS of a model M(·) is the expectation E of over all samples from the realworld distribution D: In practice, a dialogue dataset is regarded as a sampling from D for evaluations. Each sample in the dataset is provided with a reference output o for supervised training. We use D tr , D va and D te to refer to training, validation and test sets. See detailed implementations and metrics in Sec. 4.1.

Existing Approaches
We investigate existing approaches that target on reducing the sensitivity and classify them into offline ones and online ones, where the former chases to reduce the sensitivity by exploring better model parameters and the latter pursues insensitivity by unification or simplification of input data. Thus, data processing steps are required before inputting into the model and after the inference during the test time and speaker names in D tr , D va and D te are all changed for online approaches. The model needs fine-tuning for both approaches.
Offline approaches include: Embedding Layer(Emb): Similar to (Gu et al., 2020) and (He et al., 2021), an additional embedding layer can be adopted for representing whether the model should be sensitive to corresponding tokens. 2 embeddings are learned during fine-tuning.
Augmentation (Aug):  proposed to do data augmentation by exchanging speaker names in training samples with names from D tr . They aim to reduce unexpected inductive bias caused by speaker names, which is similar to our goal. The model is fine-tuned with augmented training data while D va and D te remain unchanged.
Online approaches are: ID: Some works (Cui et al., 2020;Li et al., 2017) replace speaker names with predefined IDs to avoid name bias. We use "Speaker[NUM]" similarly to Kim et al. (2019) and , which is close to words seen during pre-training and fits different numbers of speakers. " [NUM]" is the index of a speaker's first occurrence.
Frequent (Fre): This refers to the approach proposed in Khalifa et al. (2021). They use 100 frequent male and 100 frequent female names online 1 as the pool P for sampling replacements. This approach can be combined with Aug into FreAug.

Proposed Approach
We focus on the widely-accepted encoder-decoder architecture for pre-trained generation models and design two auxiliary insensitivity losses to take full advantage of augmented data on top of Aug. Given the dialogue sample with different speaker names, a model outputs distinct generations due to its different internal behaviors. Therefore, penalizing unexpected internal differences should help the model behave consistently and reduce the sensitivity.
With this intuition, we propose the crossattention loss and the decoder-hidden-state loss. An illustration for them is in Appendix A. The former corresponds to cross-attention distributions that help the decoder make a soft information selection among encoder hidden states at each step and should be similar with different speaker names. The latter is based on the final decoder hidden 1 https://www.ssa.gov/oact/babynames/decades/century.html states which are expected to be the same under the default teacher-forcing training strategy except for the speaker name tokens. We didn't consider the encoder attentions since according to our pilot analysis of the vanilla BART, the cross attentions distance of the different predictions is around 1.5 times of the same ones. However, there are no differences in the encoder attentions. Other intermediate hidden states are excluded since they are all affected by different input embeddings of speaker names, except that the final decoder hidden states are sure to be the same.

Cross-attention Insensitivity Loss
We denote a model's input and output length, i.e., the number of tokens, as din and dout. During training, the cross attentions calculated for each output token are collected as CA ∈ R N ×dout×din . N is the number of heads for the multi-head attention mechanism, determined by the configuration of pre-trained models. We apply average pooling over the dimension of dout, to get the overall attention over the input tokens CA ∈ R N ×din .
Given an original sample {d i , c i , o i }, we construct K − 1 augmented samples by replacing speaker names. The averaged attentions for all samples are {CA k } K k=1 . Since it is a default that each sample should go through the tokenizer before inputting to the model, {din k } K k=1 are not guaranteed to be identical in two cases. First, names may be tokenized into different token counts. For example, "John" and "Robinson" are tokenized into {"John"} and {"Rob", "inson"} by BART tokenizer. Replacing "John" with "Robinson" in d i will increase the sequence length. Second, long inputs may be truncated at different tokens. So, we consider two corresponding functions for unification: • Sum(·) sums up the attention values of tokens belonging to an occurrence of a speaker name.
• Pad(·) pads attentions into the same length din u by concatenating zeros, which means that this part of contents is missing.
Finally, the loss is calculated as: where loss(·) measures the distances between a pair of attentions.

Decoder-hidden-state Insensitivity Loss
Similarly, hidden states of the decoder's final output for all samples can be denoted as {DH k } K k=1 , where DH k ∈ R dout k ×H and H represents the hidden size. The lengths of them also vary due to the above two cases. We adopt two different functions: • Del(·) ignores the hidden states whose predicted tokens belong to a speaker name.
• Trunc(·) truncates the redundant hidden states at the end without the paired ones.
The loss is defined as: (4) We adopted the mean square error for both losses.

Learning Objective
L ca and L dh are added to the vanilla generation loss L gen with hyper-parameters α and β: The insensitivity losses are only auxiliary finetuning objectives, leaving the inference time unchanged. They can be added on top of both Aug and FreAug, denoted as Ins and FreIns.

Experimental Setup
We define the evaluation metrics for sensitivity, introduce multiple text generation tasks with dialogue data and present implementation details.

Evaluation Metrics for Sensitivity
We uniformly sample names from P , which is specified later, to realize f without the loss of generality and re-sample the name if it is not in p but in the conversation. We avoid changing names mentioned during the conversation in case they are grounded entities. Since it's impossible to enumerate all possible f , we choose to substitute names of samples in D te for T = 5 times. It should be noted that varying names in test data is different from the augmentation approach. The additional test data is fixed once constructed for comparing approaches by quantitatively measuring the sensitivity. We introduce three kinds of δ(·) with taskspecific evaluation metric Score(·), such as Rouge and BertScore for dialogue summarization, and measure the speaker name sensitivity of a model similar to Prabhakaran et al. (2019)' work. Pairwise Sensitivity(S-*) is defined as: o t i is the generation where replaced names are changed back for evaluation. N te is the number of samples in D te . E(·) is the mean operator.
Dialogue models are also expected to get the same scores with task-specific evaluation metrics compared with the reference o. So, we can also add o as the input of δ(·) in Eq. 1 and define the following two metrics: Score Range (R-*) as and Score Deviation (D-*) as The sensitivity metrics here are the lower the better and are denoted by ↓ in the following sections.

Tasks and Datasets
We implement our experiments on the tasks below. The statistics are in Table 1 and we calculate the macro-average scores of samples for each metric. Dialogue Summarization outputs fluent and concise summaries covering the salient information in dialogues. We experiment with the SAMSum dataset (Gliwa et al., 2019) consisting of around 16k open-domain dialogues among two or more interlocutors. Rouge-2 F1 (Lin, 2004) and BertScore F1 (Zhang et al., 2019) 2 are task-specific evaluation metrics. We consider genders to be consistent when switching names following Khalifa et al. (2021).
Question Generation is to generate a question given an input dialogue and its corresponding answer span. We use Molweni dataset  made up of around 10k task-oriented dialogues sampled from the Ubuntu Chat Corpus. Similar to the question generation work based on SQuAD1.1, we extract (dialogue, answer, question) tuples from the original Molweni dataset and ignore unanswerable questions. Bleu (Papineni et al., 2002) and Rouge-L F1 are used for evaluations.
Reading Comprehension generates an answer by inputting a dialogue with a question. We use the Molweni dataset  and ignore unanswerable questions as well. Bleu and Rouge-L F1 are also used for evaluations.

Implementation Details
We use BART-large as our basic pre-trained model. We truncate inputs to the first 1024 tokens and the learning rate is 3e − 5 with weight decay equaling 0.01. The model is fine-tuned with batch size equaling 32 for 10 epochs. We evaluate the performance on D va after each epoch with Rouge-2 F1 or Bleu. The checkpoint with the highest score on D va is saved for testing. During the inference, we decode with no_repeat_ngram_size=3, length_penalty=1.0 and num_beams=4. We search α and β in {1, 10, 20} empirically and report results with the best validation performance. Specifically, α equals 1. β equals 1 for reading comprehension and 10 for the others. Our experiments are done on a single RTX 2080Ti with 11G GPU memory. Considering the GPU memory footprint, we set K = 2, which is the same for Aug and FreAug for fair comparisons.
We test online approaches with their corresponding test sets. For offline approaches, we focus on two sources of P . One is in-distribution names representing speaker names from the corresponding D tr . The other is all-possible names with more than 117 thousand names 3 , which can reflect the models' performances in complicated real scenarios. For approaches with sampling operations, we construct data with 3 different random seeds. Results are averaged over the number of runs.

Results
We show performances of approaches first, followed by ablation studies and evaluations. Then, we take a closer look at offline approaches, which show the inherent capability of models, with multifaceted analysis. Hyper-parameter search and case studies are in Appendix C and E.

Performance of Offline Approaches
The performance on the original test sets is shown in Table 2. Emb only outperforms Vanilla on question generation and Aug only makes little improvements over Vanilla on dialogue summarization. Our approach Ins makes consistent improvements, performing best among offline approaches.  Results with sensitivity scores are in Table 3. Emb fails to generate more insensitive results, especially for question generation. Aug doesn't make promising improvements on outputs' quality over Vanilla, but it reduces the sensitiveness of models across different test sets and tasks. Ins leads to better results on randomly augmented training data with different random seeds, significantly outperforming Aug. In a word, Ins achieves the best performance among offline approaches.
By comparing the results in Table 3 horizontally, in-distribution names perform better than all-possible names on dialogue summarization, whereas results are opposite on the others. Speaker names in SAMSum are mostly real and popular names, while names in Molweni are online nicknames containing unknown words, such as "zykotick9". All-possible names contain a large proportion of real names, and a small proportion of names never seen during pre-training which can be regarded as nicknames. In this way, we can observe that the difficulty of modeling names for a model is "SAMSum in-distribution < all-possible < Molweni in-distribution". In other words, models perform better on more popular names, which is in accord with the success of Fre in Sec. 5.2.

Performance of Online Approaches
The results of online approaches are in Table 4.  All speaker names will be normalized into fixed code names in ID, so that the test set for ID is changeless for each sample and the sensitivity scores are actually 0.0. Unfortunately, its quality scores lag behind Ins and even drop dramatically on dialogue summarization. Thus, it's not recommended to be a necessary data pre-processing step.
Fre makes some improvements on R2 for dialogue summarization by comparing with the vanilla model, which is consistent with the results in (Khalifa et al., 2021), whereas the drops in BertScore were not mentioned in their work. The sensitivity scores are lower than those for offline approaches in Table 3  Vanilla in Table 3. It shows that the advantages of Fre not only come from using the group of frequent names that are easier for a model to understand, but also from doing fine-tuning with this group of names. FreAug doesn't improve the outputs' quality consistently, but reduces the sensitivity scores. FreIns performs the most insensitively with better generation quality among online approaches.

Ablation Study
Ablation studies of our full approach Ins are in Table 5. Aug is regarded as an ablation representing the model trained without any auxiliary losses. Both insensitivity losses outperform Aug with using L dh topping the rank on most metrics, showing that penalizing differences on the decoder hidden states has more direct effects on the outputs. Combining both losses induces more performance gains.

Human Evaluation
Taking dialogue summarization as an example, we did human evaluation to further prove the improvement on sensitivity by sampling 200 pairs of generations for each offline approach and asked three proficient English speakers from Asia to label each case out of 4 choices by selecting the primary one that makes the generations distinct: Information difference means both outputs contain different information or keywords. Factual difference refers to different matchings between speakers and events.
Expression difference is that the outputs have minor differences, such as capitalization and different orders of juxtaposed names. Same represents the identical outputs. The results are in Fig. 2 with 0.64 Kappa score, indicating substantial agreement. We can see that content distinction is the primary difference type. Ins generates less distinct contents and more identical results, outperforming the baselines.

Sensitivity among Name Groups
We collect specific groups of names in terms of popularity and race and show differences in the quality performances on test sets constructed with corresponding names. The sensitivity among different groups for each method are reflected by the scattering of dots vertically in Fig. 3. Name groups by popularity and usage: We define 4 groups. Frequent including words frequently and solely used as human names is mentioned before. Polysemous represents words frequently used but not specialized for human names, such as June and Florida. Rare is names with low occurrence times like Paderau. Unknown names are similar to random strings from a model's perspective since they haven't been exposed to the model. The last three groups are collected by counting occurrences of all-possible names in the pretraining corpus of BART. We select 200 names for each group (More details are in Appendix B).
According to Fig. 3a, we can see that models usually perform poorly on Polysemous, even worse than Rare and Unknown. The daily meanings dominate the representation of this word and confuse the model. Frequent generally outperforms other groups. We conclude that words frequently and uniquely used as names that result in more specialized embeddings in pre-trained models and perform  better. Moreover, comparing the sensitivity among different approaches, Ins outperforms the baselines in most cases except Aug. It achieves more centralized dots due to the performance reduction on the dominant groups or even all groups, showing that models tend to overfit with augmented data without our losses. To recap, Ins results in consistent improvements over Vanilla among different tasks compared with other baselines.
Name groups by races: Names from different races are from Tzioumis (2018) by assigning each name to a race with the highest probability. 4 major groups 4 are gathered, including Non-Hispanic White, Hispanic or Latino, Non-Hispanic Black or African American, and Non-Hispanic Asian or Native Hawaiian or Other Pacific Islander. To avoid the influence of the various number of names, we select the most frequent 50 names in each group and show the results in Fig. 3b. All of the approaches show discrimination against Asian in dialogue summarization. Emb, Aug and Ins improve the insensitivity among different races compared with Vanilla, and Ins is better with the guarantee on quality. We consider to introduce special designs on demographic features in the future.

Sensitivity on an Individual Speaker
We can also only change the name of a single speaker each time to analyze fine-grained sensitivity. The results of offline approaches for dialogue summarization are shown in Table 6 (see more in Appendix D). The sensitivity scores are lower than the ones in Table 3. It seems that the sensitivity of models is proportional to the amount of changes in test samples, i.e., whether changing all speaker names (change-all-name) or only one speaker name (change-one-name). However, it's not always true and changing one name can be more sensitive than changing all names. Taking the results from Ins as an example, around 52.01% samples have speakers whose change-one-name D-BertS is higher than the corresponding changelall-name one. Over 34.80% of the change-onename D-BertS averaged by speakers from the same dialogue is also higher than the change-all-name D-BertS.  We further show the trends between speaker features and their sensitivity scores in Fig. 4. Names are more sensitive and thus crucial for speakers at the start of a dialogue or with more utterances, deserving attention for further improvements.

Related Work
Entity/Name Bias in Narrative Texts: Previous work on entity biases shows that pre-trained lan-guage models are sensitive to changes in narrative text. Some works (Zhang et al., 2018(Zhang et al., , 2017Wang et al., 2022b) for relation extraction mask entities in the context to prohibit learning spurious features between entities and relations. Yan et al. (2022) analyzes the robustness of models by entity renaming on reading comprehension. They all consider different kinds of entities, such as person and organization. However, the entities have the potential to be grounded in real life (Smith and Williams, 2021), and the background knowledge of these entities may be necessary for understanding. Besides, the context and the entities cannot always be well-separated, especially persons Yan et al. (2022). Thus, masking and switching operations are not always suitable for these entities. In our work, we focus on speakers that are not grounded.
Names that are not grounded have also been studied. Information such as age, gender and race can be reflected by a given name to some extent (Girma, 2020), while models learned with statistical features may make wrong predictions about specific persons or bring unexpected stereotypes (Bertrand and Mullainathan, 2004). Romanov et al. (2019) takes occupation classification as an example and discourages the model to predict an individual's occupation depending on his/her name. Wang et al. (2022a) presents that machine translation models perform poorly on female names when translating into languages with grammatical gender and also have sentiment bias caused by names with sentiment-ambiguous words. Samples in all these works only have a single name each, while multiple speaker names are entangled in a single dialogue.
Fairness of Dialogue Models: Safety and fairness issues on generations from dialogue models are crucial for implementation in practice. Harmful differences in responses caused by different demographic personas are observed in well-known dialogue systems (Sheng et al., 2021;Dinan et al., 2020), including offensiveness, gender bias, race discrimination, etc. These unfairness phenomena also exist in dialogue systems without considering persons , reflected by the politeness, sentiment, diversity and other aspects of a response. Recent work from (Smith and Williams, 2021) shows dialogue models treat their conversation partner differently for different speaker names. Instead of analyzing differences in open-ended dialogue systems, we target on text generation tasks given dialogues and show that sen-sitivity/unfairness also exists among speakers.

Conclusion
This paper focuses on the speaker name sensitivity in the text generation from dialogues. We provide a classification for previous approaches, and propose the insensitivity losses to reduce the sensitivity while achieving favorable generation quality. Fair comparisons and comprehensive analysis are done among different approaches for evaluating the sensitivity quantitatively. More approaches targeting dialogue sensitivity issues are expected.

Limitations
Our work has the following limitations: First, we cannot generalize our conclusions to other languages that are dramatically different from English or more complicated multi-lingual scenarios without further experiments.
Second, we didn't consider any special designs on demographic features of names in our proposed approach. As shown in Sec. 5.5, discrimination does exist among different groups. Although Ins outperforms other baselines overall, there is still room to improve insensitivity among different groups for tasks with longer outputs containing multiple speaker names. We hypothesize that demographic features of names can be added through a more dedicated data augmentation strategy.
Third, our experimentation was restricted to the BART model in this paper. The reason is that among all the models that can be fine-tuned with our limited resources, including T5 and GPT-2, BART is still the best and the most popular, therefore we pick BART as the target of this study. Our intention is to devote the limited paper space to a more in-depth analysis of the problem using a range of tasks. Besides, it should be noticed that the speaker name sensitivity is still an issue with recent large pre-trained models, as shown in the example of dialogue summarization with outputs from ChatGPT in Fig. 5. The two summaries are expected to be the same, modulo speaker names. However, the third speaker (Sergio/Ashley) is not even mentioned in Summary-2.
We will try to address these limitations in the future.

Ethics Statement
All of the name lists we adopted in this paper are borrowed from public websites (https://www. ssa.gov) and previous publications (Tzioumis, 2018;Khalifa et al., 2021). We considered only binary genders and four different racial groups, which are clearly incomplete for depicting all humans. Our work is mainly at drawing researchers' attention to the unfairness caused by speaker names in text generation tasks given dialogues. These demographic features are selected to shed light on this potential issue and our method is not restricted to any specific demographic groups.

B Name Groups
To collect polysemous, rare and unknown names, we counted the number of occurrences of all-possible names in the pre-training corpus, Wikipedia 5 and BookCorpus 6 . We denote the frequency of a name as f exact and f ner representing doing exact string match or named entity recognition when counting name occurrences respectively. Rare contains names shown at least once and with the lowest f exact not equaling 0. Unknown includes names with f exact equaling 0. According to our observations, we find that names with a larger f exact are likely to be polysemy and are not uniquely used as personal names. So, we design a metric to recognize such names as follows: rank(·) means that the ranking of a name among the whole name list based on its frequency in descending order 7 . A higher u shows a higher level of uniqueness of a word as a name. The names with the lowest u scores are selected as Polysemous in Sec. 5.5. Examples of names in different name groups are listed as follows:

Cross-attention Insensitivity Loss
Decoder-hidden-state Insensitivity Loss Figure 6: An illustration of insensitive losses. BOS and EOS are special tokens standing for the start and the end of the output.

C Hyper-parameter Search
We empirically searched the hyper-parameters α and β in {1, 10, 20} respectively with 9 combinations for Ins. Due to the limited computation resources and the large search space, we trained the model with different combinations for a single time, selected the best 3 combinations and repeated experiments with different random seeds to determine the final choice of α and β according to the performance on D va . Finally, we set (α, β) as (1, 10), (1, 10), (1,1) for dialogue summarization, question generation and reading comprehension respectively. We directly borrow these settings for FreIns. In Fig. 7, we show the performances of Ins under different combinations for dialogue summarization on the vanilla test set with a single run. We can see that all of the results outperform the baselines in Table 2 and the standard deviation of BertScore among different combinations is only 0.14%, showing the stable improvements of Ins over the baselines.

D Additional Results of Sensitivity on an Individual Speaker
Results for sensitivity on an individual speaker on all of the three tasks are in Table 7 and Table 8. Both tables lead to the same observations and con-

Dialogue-1 Summary:
Toyna will come for Easter. Marshe will invite Louise. Toyna will bring eggs.

Dialogue-2 Summary:
Remeisha invited Ethie and Louise for Easter. Ethie will bring eggs and chocolat ones.

Dialogue-1 Summary:
Toyna is off on Friday. Marshe will invite her and Louise for Easter. Toyna will bring eggs.

Dialogue-2 Summary:
Remeisha will invite Ethie and Louise for Easter on Friday. Ethie will bring eggs.

Dialogue-1 Summary:
Marshe will invite Toyna and Louise for Easter. Toyna will bring eggs.

Dialogue-2 Summary:
Remeisha will invite Ethie and Louise for Easter. Ethie will bring eggs.

Reference Summary:
Marshe(Remeisha) is inviting Toyna(Ethie) for Easter. Toyna(Ethie) will bring some chocolate eggs. clusions as discussed in Sec 5.1 and Sec 5.2, where Ins and FreIns perform best among offline and online approaches correspondingly.

E Case study
We show cases for different tasks in this section. The case for dialogue summarization is in Fig. 8. Vanilla extracts different information for two sets of names: "She will bring eggs" and "Ethie is off on Friday". It also uses different expressions: "will come to ... for Easter" and "invited ... for Easter". Besides, "Louise" is only mentioned in the second summary. Emb has the information difference and the expression difference. Meanwhile, it outputs incorrect content in the second summary, where "chocolat ones" is used for describing "eggs" in the input dialogue. Aug outputs more information for the first set of names. Ins treats the two sets of names equally with the same generations modulo the speaker names.
In the case of question generation in Fig. 9, all baselines generate "who gives Jernee suggestions?" for the second set of names, which is an inaccurate question with multiple candidate answers. Emb also generates a "Who" with the capitalized first