He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues

In this work, we define a new style transfer task: perspective shift, which reframes a dialogue from informal first person to a formal third person rephrasing of the text. This task requires challenging coreference resolution, emotion attribution, and interpretation of informal text. We explore several baseline approaches and discuss further directions on this task when applied to short dialogues. As a sample application, we demonstrate that applying perspective shifting to a dialogue summarization dataset (SAMSum) substantially improves the zero-shot performance of extractive news summarization models on this data. Additionally, supervised extractive models perform better when trained on perspective shifted data than on the original dialogues. We release our code publicly.


Introduction
Style transfer models change surface attributes of text while preserving the content. Previous work on style transfer has focused on controlling the formality, authorial style, and sentiment of text (Jin et al., 2022). We propose a new style transfer task: perspective shift from dialogue to 3rd person conversational accounts ( §2). In this task, we seek to convert from an informal 1st person transcription of the dialogue to a 3rd person rephrasing of the conversation, where each line captures the information of a single utterance with relevant contextualizing information added. Table 1 demonstrates an example conversion and its perspective shifted version.
This task is challenging because it requires the interpretation of many discourse phenomena. In dialogue, speakers commonly use 1st and 2nd person pronouns and casual speech. Speakers also convey their own emotions and opinions in their speech. Converting a multi-party conversation to a singleperspective rephrasing requires pronoun resolution, formalization, and attribution of emotion/stance markers to individuals. While coreference resolution, stance detection, and formalization are often treated as separate tasks, the signal for these objectives is commingled in the dialogues. A pipeline approach would discard information necessary for any one task in the completion of the other two.
We create a dataset for this task by annotating dialogues from the SAMSum corpus (Gliwa et al., 2019), a dialogue summarization corpus of synthetic text message conversations ( §3). For each conversation, annotators rephrase the utterances line-by-line into one or more sentences in 3rd person. Unlike a summary, which condenses information to highlight the most important points, the goal of this transformation is to capture as much of the information from the original utterance as possible in a more standardized form.
We fine-tune BART on this dataset as a supervised baseline under several different problem formulations, and we experiment with incorporating formality data into the training process ( §4). As a motivating use case, we demonstrate that extractive summarization over perspective-shifted dialogue is more fluent and has higher ROUGE scores than extractive summarization over the original dialogues ( §5). This trend holds for zero-shot performance of extractive summarization models trained on news corpora and for fully supervised training on modelgenerated perspective shift data.
Perspective shift can be a useful operation for extractive summarization when annotation time is limited; when additional data from out-of-domain is available; when the exact length and content of the summary is not known at annotation time; or when high faithfulness is important to the end task, but fluency is also a concern ( §5.3). selected utterance, the goal of the task is to rewrite that utterance as a formal third person statement. Four operations are required to accomplish this change: coreference resolution, syntactic rewriting, formalization, and emotion attribution. Table 1 shows an example conversation and perspective shift, demonstrating each of these challenges.
First-person singular and second-person pronouns are usually easily resolved in a conversational context-first-person singular refers to the speaker, while second-person pronouns generally refer to the other conversational parties-plural first-person pronouns can be less obvious to resolve. When a party in a conversation uses the pronoun "we," this plural may be referring to the other parties in the conversation, some but not all of the parties in the conversation, or a party not present in the conversation, e.g. in the utterance "I need to talk to my husband. We might have other plans." In our hand-annotated dataset, we resolve these pronouns wherever possible; if it is not clear what group the pronoun refers to, we resolve the pronoun as referring to "<the current speaker> and others," e.g. "Laura: we are busy" becomes "Laura and others are busy". Other entities in the text may also be difficult to resolve, such as those defined only at the beginning of the conversation, many turns prior to the current reference.
Syntactic rewriting is the problem of converting the syntax of the utterance to reflect 3rd rather than 1st person. This may involve re-conjugating verbs, e.g. converting "Sam: I am busy" to "Sam is busy." Formalization and emotion attribution are related problems, as much of the emotion and stance information in the text is contained in informal phrases, unconventional punctuation, and emojis (Tagg, 2016). Typical formalization eliminates these markers without replacement (Rao and Tetreault, 2018). However, this makes formalization a highly lossy conversion, which may be undesirable for down-stream tasks. We aim to limit the information lost in the perspective shift operation by encoding the meanings of such informal language in the output. Often this takes the form of an adverb (e.g. "Sam angrily says") or a short descriptive sentence (e.g. "Cam is amused"). This requires interpretation of the informal elements of the text.
Clearly, this task is far more complex than simply swapping pronouns for speaker names. We curate a dataset for the perspective shift operation.

Dataset creation
The dataset is an annotated subset of the SAMSum (Gliwa et al., 2019) dataset for dialogue summarization. SAMSum is a dataset of simulated text message conversations, ranging from 3 to 30 lines in length and with between 2 and 20 speakers. The dataset consists of 314 conversations from the train set, 368 conversations from the validation set, and 151 conversations from the test set 2 . We set aside the 151 conversations from test as a test split and use the other 682 conversations as training and validation data.
Annotators were instructed to convert each utterance individually to a formal 3rd person rephrasing, while preserving as much of the tone of the utterance as possible. Annotators were required to insert the speaker's name in each rewritten utterance and remove all 1st-person pronouns. Annotators were also asked to standardize grammar, remove questions, and add additional context (e.g. descriptive adverbs) to convey emotions previously expressed by emoticons. Further information about annotator selection and pay, as well as a full copy of the annotation instructions, is available in Appendix D.

Dataset statistics
The perspective shifted conversations differ from the original in several ways. The number of turns in each conversation is preserved, but the average turn length varies: for the perspective shifts, the mean number of words per turn is 11.0, while the mean for the original dialogues is 8.4. (Note that the simplest heuristic would increase each utterance's word count by 1, as the colon next to the speaker name is swapped out with the word "says").
The average word-wise edit distance between original and perspective-shifted utterances is 8.5 words. This is partially due to the insertion of a dialogue tag (e.g. "says") in each utterance, the removal of emojis (average 0.1 per utterance), and the resolving of first and second person pronouns (average 0.9 per utterance). The part of speech 3 distribution of the conversations also changes, with a strong (65.8%) decrease in interjections and a slight (5.1%) decrease in adjectives and adverbs. However, in utterances that contain at least one emoji, the number of adjectives and adverbs present increases 12.8%. This is consistent with the annotation guidelines, which instruct annotators to capture the meaning of informal markers such as emoji with descriptors.

Formulation of the Prediction Problem
Methods We consider several formulations of the perspective shifting task as a prediction problem with different input and output styles. Below, the first three approaches formulate the problem as a line-by-line task: each input example consists of the full conversation with one utterance designated as the utterance to be perspective shifted. The fourth approach below formulates the problem as conversation-level task in which the entire conver-sation is perspective shifted at once.
1. no context: The input to the model is the utterance u t , and the output is the perspective shifted version, y t .
2. left context only: The input is the dialogue up to and including utterance u t , and the output is the perspective shifted version, y t . A [SEP] token delimits the left context, u 1 , . . . , u t−1 , from the utterance u t .
3. left and right context: The input is the full conversation, with [SEP] tokens around the utterance u t , and the output is the perspective shifted version, y t .

conversation-level:
The input is a complete dialogue u 1 , . . . , u T , and the output is a complete perspective shift y 1 , . . . , y T .
For each formulation, we finetune a BART-large (Devlin et al., 2019) model for 15 epochs, using early stopping, an effective batch size of 8, and a learning rate of 5e-5.
Results ROUGE 1/2/L scores and BARTScore for each model are listed in Table 2. The no context model treats this as a purely utterance-level task, but fully precludes the addition of context from other utterances. This means that second-person and first-person plural pronouns cannot be resolved clearly. While this model scores quite highly on all 4 metrics, we observe a high rate of named entity hallucination in the converted outputs. For instance, for the input utterance "Hannah: Hey, do you have Betty's number?", the no context model outputs "Hannah asks John if he has Betty's number." However, the other conversational partner in this dialogue is "Amanda," not "John." Because the gold perspective shifts were annotated with the full conversation available for reference, this model often hallucinates to fill in named entity slots that it does not have the context to resolve.  By contrast, the conversation-level model has the clear advantage of referencing the entire conversation at generation time. However, the model does not have a requirement to produce the same number of lines as the input and must learn this property during training. We conjecture that this is the reason for this model's relatively weak performance relative to the left and right context model. Additionally, if the model generates more or less lines than the input dialogue, this can be a conflating factor in the extractive summarization example we discuss in Section 5. If the model generates less lines than the input, it has performed some part of the summarization process by abstracting the input into a shorter output; if it has generated more lines than the input, it has produced a harder problem for the extractive summarization system by creating more lines to choose the summary from. Because of this model's weaker performance and this conflating factor, we restrict our remaining experiments in this paper to models that perspective shift one utterance at a time.
The model with left context only mimics how a human might read the conversation for the first time, from top to bottom. This choice of model also imposes the constraint that the output is the same number of lines as the input, as desired. However, the dialogues frequently contain cataphora, especially in the start of the conversations, where the first speaker may be addressing a second speaker who has not yet spoken. For instance, in the example "Hannah: Hey, do you have Betty's number?", this is the first utterance of the dialogue. A model with only left context cannot resolve the word "you" here any better than the no context model.
The left and right context model addresses this concern by providing the full conversation as input, but restricting the output generation to a perspective shift for a single (marked) utterance. This imposes the output length constraint without sacrificing contextual information. This model performs best on all 4 metrics. As the scores for left and right con-text and no context models are relatively close, we conduct a human evaluation comparing these two cases. In our blind comparison of 22 conversations, the left and right context model was preferred over the no context model 86% of the time (2 annotators, Cohen's kappa 0.62).
The conversation-level model may be a good choice for some applications, where output length is less important to the downstream task. This model has a higher degree of abstractiveness, which can lead to increased fluency but also increased hallucination. For tasks where this is a concern, the left and right context model achieves reasonable fluency while adhering more closely to the task, as measured by the automatic metrics.

Formality and Perspective Shift
Approaches We observe that the perspective shifting task requires a high degree of formalization. We consider several models ranging from simple rule-based approaches to those relying on an external formalization dataset in order to better understand the role of formalization in perspective shifting. The external dataset we consider is the Grammarly Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault, 2018): a dataset of approximately 100,000 lines from Yahoo Answers and formal rephrasings of each line.
Our core method is the BART model trained under the left and right context formulation (PS ONLY).
We also consider a heuristic baseline (RULES-BASED HEURISTIC).
For each message, we prepend the speaker's name and the word "says" to the utterance. We replace each instance of the pronoun "I" in the message with the speaker's name. After observing that most messages are not wellpunctuated, we also append a period to the end of each utterance. While this heuristic is simple and ignores many pronoun resolution conflicts, it has the clear advantage of being highly efficient.
We incorporate the GYAFC corpus as part of our training regime by finetuning on the formalization task prior to finetuning on perspective shift (FORMALITY + PS). Finally, we perform an ablation by finetuning BART for formalization on the GYAFC corpus, then attempting zero-shot transfer to the perspective shifting task. As input for this model at test time, we provide either the original dialogues (FORMALITY ONLY) or the output of the rulesbased heuristic (HEURISTIC+FORMALITY).

Results
We evaluate each approach on ROUGE 1/2/L (Lin, 2004) and BARTScore (Yuan et al., 2021). The scores are in Table 3, and example outputs are in Table 4.
At first glance, perspective shift is a task closely related to formalization. However, the addition of formalization data leads to a slight decrease in model performance. This may be due to the formality data biasing the model toward minimal rephrasings, as there is generally relatively low edit distance between the informal and formal sentences in the formality corpus used (Rao and Tetreault, 2018). However, for high performance on perspective shift, the addition of clarifying words, emotion attributions, and pronoun substitutions is necessary; these are high-edit-distance operations that are not observed frequently in the formality data.
Formalization without any additional training for perspective shift is, as expected, far weaker than the perspective-shift-only model.
The rules-based heuristic appears competitive in ROUGE, but both the BARTScore scores and a manual inspection of the output reveal that this approach is lacking.
In the next section, we explore a downstream task: extractive summarization. For all extractive summarization experiments, we use model-generated perspective shift data from the perspective-shift-only model. We train a model on only the validation-set PS data to generate perspective shifts for the train set of SAMSum, and we train a model on only the train-set PS data to generate perspective shifts for the validation set of SAMSum.

Application: Extractive summarization
In the extractive summarization setting, phrases or sentences are taken directly from the input and composed into a summary. This is a clear failure case for dialogue, where sentences in the input are in first person and often pose questions or corrections to previous utterances; knowledge of other speakers in the dialogue can be necessary to contextualize the information. Summaries should present an overview of a conversation that incorporates global contextual information; generally, these summaries are also expected to be in third person.
Extraction over a perspective-shifted dialogue does not suffer from many of the same problems as extraction over an original dialogue. The text in a perspective shifted dialogue is in formal third person, which matches the desired style of the summary text. While individual sentences of the perspective shifted dialogue correspond directly to individual utterances in the dialogue, the coreference resolution involved in the perspective shift step means that these sentences are less interdependent than the dialogue turns. In many respects, perspective shift should make the task of dialogue summarization easier.

Oracle Extraction after Perspective Shift
Methods This intuitive result is confirmed by the performance of an oracle extractive model. Given both the input and the summary, the oracle model is tasked with choosing a combination of k utterances from the input to maximize ROUGE. Table  5 shows the performance of an oracle extractive model over the original SAMSum dialogues and the perspective shifted versions. For comparison, a simple extractive baseline-choosing the three longest utterances-and a strong abstractive model are also reported.
Results Clearly, the potential (best-case) performance of a model over the perspective shifted dialogues is better; the oracle scores over perspective shifted dialogues even approach the scores of the abstractive model.

Zero Shot and Supervised Extractive Summarization
Train/Test Regimes A common summarization domain is news articles due to the relatively wide availability of data. We use an extractive summarizer trained on the CNN/DM news summarization corpus ( I've got so much to do at work and I'm so demotivated. RULES-BASED HEURISTIC Igor says Shit, Igor has got so much to do at work and Igor is so demotivated. HEURISTIC + FORMALITY Igor says, "Shit, Igor has so much to do at work and Igor is so demotivated." GOLD Igor has too much work and too little motivation.  Table 5: Performance of oracle extractive models as compared to the best extractive baseline from the SAMSum paper (longest-3) and a competitive abstractive system (BART-large, averaged over 5 random restarts).
roshot summarization 4 over the original SAMSum dialogues and over a perspective-shifted version of the dialogues. We also consider the fully supervised case; we train models using the PreSumm architecture for extraction over the original SAM-Sum dialogues and over the perspective-shifted dialogues.
Results Results across all models are in Table 6. The zeroshot model scores higher than the supervised model for SAMSum, which at first appears unintuitive. We credit this to 2 factors. First, the training dataset for CNN/DM is approximately 21x more training examples than SAMSum train set, allowing the model increased generalizability to an unseen test set. Second, the summaries in the CNN/DM dataset are often several sentences, while the summaries in the SAMSum dataset tend to be a single sentence. The CNN/DM model's bias toward longer summary length may artificially inflate ROUGE scores, as the model selects more utterances for the output. Despite these factors, the supervised model over perspective shifted data outperforms the zeroshot model over the same data. Perspective shift is useful as an operation to bring the dialogue domain closer to the news domain. This drastically improves zero-shot transfer.
The zero-shot model over perspective shifted data performs better than the fully supervised model trained over the original dialogues. In a low-data setting, where annotating the entire dataset for summarization may be cost-prohibitive, perspective shift can serve as an alternative annotation goal. The perspective shift model used to generate the test data in Table 6 was trained on 545 dialouges (with a validation set of 137 dialogues); by contrast, annotating the entire train and validation sets for summarization would require annotating 15,550 conversations, a more than 20fold increase in annotation effort.

Analysis of Hallucination
One of the oft-cited benefits of extractive summarization is that models that copy text directly from the input are less likely to present factually incorrect summaries (Ladhak et al., 2022). Clearly, perspective shifting introduces a rephrasing step into the summarization pipeline. A natural concern is the potential presence of "cascading errors"where errors in the perspective shifting process lead to hallucinatory extractive summaries. We randomly select 100 conversations and associated summaries from the perspective-shift-then-extract model and a standard abstractive finetuned BART model. We then ask 2 annotators to label each for faithfulness-ranking the summary -1 if it describes   information that contradicts the conversation, 0 if it contains information that cannot be verified or falsified by the conversation, and 1 if all information stated in the summary is derived from the conversation. Cohen's kappa between these two annotators was 0.49, with annotators disagreeing on 12.6% of summaries. For cases where the annotator scores differ, we ask a 3rd annotator to label the conversation and choose the majority opinion. Results of this evaluation are in Table 7. While perspective shift introduces some hallucinations into the dataset, the rate of hallucination is far lower than for abstractive models.
In the 100 randomly selected conversations, we observe 5 hallucinations introduced by the perspective shifting operation that influence the downstream summaries. In the same conversational sample, 22 summaries from the abstractive model contain hallucinations, commonly in the form of incorrectly attributing actions to entities or negating the implications from the original conversations. Here, we define a hallucination as a statement that is not verified by the source text. Some hallucinations are directly contradictory with the source material (contradictions); there are 3 such contradictions in the extractive summaries and 18 such contradictions in the abstractive summaries.

Fluency
Extractions from text message dialogues are not normally conducive to forming a fluent summary. Each message has its own speaker who may use first person pronouns. Additionally, messages often contain slang or emojis, which are not appropriate for a formal summary. Perspective shifted dialogues are more formally written and describe the conversation from a single frame of reference.
To compare the fluency of extraction from original dialogues and perspective shifted dialogues, we calculate the perplexity of the output summaries for each model. We measure perplexity using GPT-2 (Radford et al., 2019), which is not used to generate any of the outputs. The extractions from the perspective shift dialogues have an average perplexity of 31.07, while the extractions from the original dialogues have an average perplexity of 48.77. Example outputs from each model are in Appendix B.
Similarly to extract-then-abstract systems, perspective shifting represents a compromise between the strong faithfulness of extraction and the improved fluency of abstraction.

Discussion
Another possible application of perspective shift for summarization is in query-specific summarization, where there is not a single canonical summary at training time. Instead, a relevant span is selected and summarized based on a user query. Query-specific summarization has been applied to dialogue-based domains, such as meeting summarization (Zhong et al., 2021). In these domains, we conjecture perspective shift may make the choice of an extractive summarization model feasible, allowing for greater interpretability and faithfulness of outputs.
Perspective shift also appears to be a less effortful task for annotators than summarization. We ask a crowdworker to perform perspective shift and summarization annotation for 5 hours each over different sets of dialogues. The annotator gave this unsolicited feedback: [Summarization] is a completely different task in that it takes a lot more mental capacity, paraphrasing complete conversations into a concise synopsis. I need to take a break! 5 This annotator was able to summarize conversations at a faster hourly rate than perspective shifting, but reported that the perspective shift task was more enjoyable.
We discuss perspective shift for different dialogue subdomains briefly in Appendix C.

Related Work
Style Transfer The most similar style transfer task is formalization, which has attracted attention as a standardization strategy for noisy usergenerated text. Formalization can be performed as a supervised learning task, and supervised approaches often use the parallel sentence pairs from Grammarly Yahoo Answers Formality Corpus (Rao and Tetreault, 2018). More commonly, however, formalization is performed as a semi-supervised (Chawla and Yang (2020) Another related style transfer task is the 3rd to 1st person rephrasing task proposed by Granero Moya and Oikonomou Filandras (2021). This task is evaluated with exact-match accuracy, and their best model achieves 92.8% accuracy on the test set. We conjecture perspective shift is a more difficult task because of its many-to-one nature, as well as the additional emotion attribution and formalization required.
Speaking-style transformation Speaking-style transformation is a task which seeks to transform a literal transcription of spoken speech to one that omits disfluencies, filler words, repetitions, and other characteristics of speech that are undesirable in written text. This task attracted notice particularly in the statistical machine translation community (Neubig et al.  (2020)). This task differs from perspective shift in several respects: the focus of speaking-style transformation is on removing disfluencies, whereas perspective shift aims to preserve information that may be conveyed by the informal style of text; perspective shift requires complex coreference resolution and utterance contextualization, while speaking-style transformation leaves references unresolved; and perspective shift is applied to text post-hoc, while speaking style transcription may be performed over transcripts or in an online setting, during speech transcription.  (2021) extract conversational structure from several views to feed into a multi-view decoder. Another approach to modeling the differences between dialogues and well-structured text is to use auxilary tasks during training (Liu et al., 2021a). In work concurrent with this paper, Fang et al. (2022) propose a narrower utterance rewriting task for dialogue summarization, swapping some pronouns in the text for speaker names; however, this task does not allow for full rephrasings of the text or produce output that is in third person, making it unsuitable for extractive summarization.

Dialogue summarization
Domain adaptation for summarization Another popular direction for dialogue summarization is domain adaptation to dialogue, primarily by pretraining models on additional dialogue data. Khalifa et al. (2021) pretrain BART on informal text before training on SAMSum, observing improvement when pretraining on dialogue corpora but not when training on Reddit comments. Yu et al. (2021) study the effectiveness of adding an additional phase of pretraining to improve domain adaptation, in which they either train on a news summarization task, continue pretraining (using the standard reconstruction loss) for an in-domain dataset, or continue pretraining on a smaller dataset of unlabeled input dialogues from the training set. Zou et al. (2021) pretrain an encoder on dialogue and a decoder on summary text separately before training the two together on a summarization objective. While these approaches improve performance on dialogue summarization, particularly in a lowerresource setting, they largely require pretraining at a large computational cost.
Perspective shift is a new, non-trivial style transfer task that requires incorporation of coreference resolution, formalization, and emotion attribution. This paper presents a preliminary dataset for this task that includes interpretation of the meanings of conventional text abbreviations, emojis, and emoticons. The baselines presented in this paper are sufficient for downstream performance on a summarization task, but may be further improved by modeling the unique challenges of this task direction.
In addition to being a challenging task, perspective shift is a useful operation for dialogue summarization. Perspective shift can act as a tool for domain adaptation by shifting dialogue into a form more similar to common summarization domains (e.g. news). For extractive systems, dialogue summarization is largely infeasible because outputs will not be fluent. Perspective shift allows for fluent extractive summaries. This differs from a more traditional extract-then-abstract approach because the "abstraction" (perspective shifting) step can benefit from the full document context. In a domain such as dialogue, where many utterances are strongly conditioned on the prior context of the conversation, this allows for more faithful rephrasings. When coupled with an extractive system, this perspective-shifting-based paradigm allows for the creation of more interpretable, less hallucinatory summarizations when compared to an abstractive model.
Other potential applications of perspective shift include direct application in abstractive summarization; in related tasks such as key point analysis (Bar-Haim et al., 2020), which often rely on dialogues as inputs; and for summarization of texts which contain partial dialogues, such as novels. The general strategy of transforming the input to adapt to a new domain rather than changing the model or pretraining paradigm is a promising direction because of the ease of annotation and relatively low computational cost.

Limitations
Perspective shift requires the modeling of informal language, a challenging task. The meaning of informal language can vary across communities (Jørgensen et al., 2015), age groups (Rickford and Price, 2013), and time (Jin et al., 2021), making generalization of these results more difficult. This is also an inherently lossy conversion; though we take steps to minimize the loss of emotion and stance information, the nuances of this information may still be discarded.
The perspective shift process also discards most of the discourse information available in the original dialogue. By performing perspective shift prior to dialogue summarization, we take a simplistic view of dialogue as a linear collection of firstperson statements without considering underlying structure. While this approach proved effective, we believe that the best possible performance on this task may be constrained by this simplifying assumption. Yeah, Crystal is just going brankrupt. Irene: Let me take him, I also promise to buy him something Irene tells Crystal that she will take her son shopping with her, and that she also will buy him something. Crystal: you really wanna do that?

References
Crystal asks if Irea really wants to do that. Irene: why nont? I'm his aunt! Irene asks why not, she's his aunt. Crystal: well yeah it's just such a drag Yeah, Crystal says, it is a drag. Irene: you were always a bore when shopping :P just let me take the little man Irene tells Crystal that she was always a bore when shopping, and that she should just let her take the little man. Irene: well have fun! Irene tells her to have fun. Crystal: ok Crystal agrees.  While we present the perspective shift task using text message conversations as an example, there are a wide variety of subdomains within dialogue. We apply the perspective shift operation to two other domains-roleplaying game transcripts and media interviews-using the model trained only on data from the text message conversation domain. While the model effectively perspective shifts most short utterances, the largest issues we observed in inspection of these outputs are as follows: 1. Long utterances: The perspective shift model performs poorly when utterances are very lengthy, as this is very uncommon in the SAMSum dataset (average utterance length: 8.4 words). This leads to repetition and denigrated performance, especially when several long utterances occur in sequence.
2. Domain differences in formatting: Differences such as multi-word speaker names or adding sound effects in parentheses are not captured effectively by the model, as they were not encountered at training time.
While we leave improving perspective shift over long outputs to future work, we provide examples of perspective shifts from two different domains, to demonstrate these potential pitfalls for other researchers. These are model-generated perspective shifts, generated using the model trained only on perspective shift for SAMSum dialogues.

C.1 CRD3
CRD3 is a dataset of Dungeons & Dragons roleplaying game transcripts (Rameshkumar and Bailey, 2020). Dungeons & Dragons is a collaborate roleplaying game where multiple players describe the actions and dialogue of their characters as the team explores an open-ended world. While each session of the game consists of several thousand turns of dialogue, the CRD3 dataset sections the sessions into smaller chunks with aligned summaries. For brevity, we present only a chunk of a session, in Table 12. The SAMSum perspective shift model serves as a reasonable baseline for this dataset, though in-domain data would likely further improve performance.

C.2 MediaSum
MediaSum is a dataset of NPR and CNN media interview transcripts (Zhu et al., 2021). The average turn length in this dataset is substantially longer-37.5 words for the NPR transcripts and 53.1 words for the CNN transcripts. Correspondingly, the model-generated perspective shift is worse. The model generates repetitious content in the perspective shift. The model also performs poorly on multiword speaker names, which are a rarity in SAM-Sum as well. A snippet of an interview appears in Table 13 Original Perspective shifted MATT: Okay. You take your first step and you watch something drift across the entrance from wall to wall. Some faint glowing figure-and is gone.
Matt says that when you take your first step and you watch something drift across the entrance from wall to wall, some faint glowing figure-and  Talesin tells her that he is going to do another one.
MATT: Okay. Give me the specifications on that one.
Matt agrees and asks for the specifications.
TALIESIN: All right. Eyes of The Grave. As an action you know the location of any undead within 60 feet of you that isn't behind total cover and isn't protected from divination magic until the end of your next turn.
Talyiesin agrees and mentions that it is the Eyes of the Grave, and that as an action, it can alert the location of any undead within 60 feet or less that isn't covered by total cover and isn't protected from divination magic until the end of the next turn. MATT: Okay, got you. How long does it last?
Matt tells her that he got her. TALIESIN: Six seconds.
Talesin says it takes six seconds. MATT: Okay. We'll say for the purposes of this, this is a reaction to seeing this figure pass by. You definitely get an undead sense from whatever this figure is, and then it merges with the wall and is gone. It appeared to be loosely humanoid in the brief glimpse you saw.
Matt tells her that they will say that for the purposes of this, it is a reaction to seeing the figure pass by, and that you definitely get an undead sense from whatever the figure is, and then it merges with the wall and is gone. TALIESIN: We have undead.
Talyiesin says they have undead. TRAVIS: No.
Trevor says no. SAM: It's a ghost ship! Sam says it is a ghost ship. MARISHA: Is it going to be your crew?
Marisha asks if it is going to be Ducey's crew. TRAVIS: I don't know! Trevor doesn't know. MARISHA: Wait, we're not on your ship anymore. Marisha asks them to wait. She tells them that they are not on Ducey's ship anymore. TRAVIS: Now we're in the mouth of a beast! Ducey, what's going on, brother?
Trevor tells her that they are now in the mouth of a beast, and asks Ducey what is happening. TALIESIN: I don't know, I'm literally blind. I want to light this-Talesin doesn't know, she is literally blind.

Original
Perspective shifted MADELELEINE BRAND, host: OK, here's some good news on the jobs front for both men and women. A new survey out today from the employment firm Manpower finds that about a quarter of employers will add jobs this summer. That's for adults, but for teenagers this summer's job market is shaping up to be the weakest in more than 50 years.
MADELEINE says that there is some good news on the job market for both men and women. A new survey out today from the employment firm Manpower finds that about a quarter of employers will add jobs to the workforce this summer, but for teenagers, the summer job market is the weakest in more than 50 years. ALEX COHEN, host: So, how do you get your teenage kids not to spend the entire summer glued to the couch? You're about to get some tips from Michelle Singletary. She's Day to Day's personal finance contributor. Hi, Michelle! Alexo Cohen, host, tells Michelle that the types of jobs that teens tend to go for in retail, fast food, and the like, that they still need people. He says that oftentimes, listen, you may not get a job at the beginning of the summer, but hold on because in late summer, when some college students are going back to school and perhaps some of the people who lost their MICHELLE SINGLETARY: Hi! MICHelle says hello. ALEX COHEN, host: So why is the summer job market so hard for teens this year?
Alex Cooper says that the types of jobs that teens tend to go for in retail, fast food, and the like, they still need people. He tells Michelle that they've got a tough economy and that they have a lot of college graduates going into the job market and taking jobs that would traditionally go to teens, like in restaurants and retailers, but that for teenagers, the summer job market is the weakest in more than MICHELLE SINGLETARY: Lot of things going on right now. We've got a tough economy. We've got a lot of college graduates going into the market. We have people who are losing their jobs and taking jobs that would traditionally go to teens, like in restaurants and retailers. And we have a lot of older people holding on to their jobs and not retiring because they can't afford to retire. And that puts teens at the end of the line when it comes to these types of jobs.
MICHelle says that there are a lot of things going on in the job market right now, and that they have a tough economy. She says that they've got a number of college graduates going into the workforce and taking jobs that would traditionally go to teens, like in restaurants and the like, but that for teenagers, the summer job market is the weakest in more than 50 years.