Conversation Style Transfer using Few-Shot Learning

Conventional text style transfer approaches focus on sentence-level style transfer without considering contextual information, and the style is described with attributes (e.g., formality). When applying style transfer in conversations such as task-oriented dialogues, existing approaches suffer from these limitations as context can play an important role and the style attributes are often difficult to define in conversations. In this paper, we introduce conversation style transfer as a few-shot learning problem, where the model learns to perform style transfer by observing only a few example dialogues in the target style. We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot. Human evaluation shows that by incorporating multi-turn context, the model is able to match the target style while having better appropriateness and semantic correctness compared to utterance/sentence-level style transfer. Additionally, we show that conversation style transfer can also benefit downstream tasks. For example, in multi-domain intent classification tasks, the F1 scores improve after transferring the style of training data to match the style of the test data.


Introduction
Recent advances in neural dialogue models (Gao et al., 2018;Zhang et al., 2020;Ni et al., 2022) enabled the handling of complex conversational scenarios. However, one key challenge that still remains in conversational AI is to obtain the desired conversation style. Conversations in nature are dynamic and the style requirement of utterances in a conversation depends on many factors including domain (e.g., banking vs restaurant), situation (e.g., conversation with someone angry vs depressed), the speaker demographics (e.g., senior vs youngster) among others, making style transfer of the * Work done during an internship at AWS AI Labs. [Agent] I am sorry to hear that. What is your ID?
[Agent] I can help with that. What is your ID?

Examples in Human Agent Style (Style B: Conversational)
[Customer] I got married. I want to add my spouse to my policy. [Agent ] What is your ID? Additional conversational context helps the style transfer model to yield a more appropriate response as the dialogue contains useful information that can be leveraged during the generation process.

Input Utterance in Chatbot Agent Style (Style
whole conversation more challenging compared to style transfer of a single utterance. Existing studies on Text Style Transfer (TST) focused on transferring the style at sentence level from one known style to another (Pavlick and Tetreault, 2016;Rao and Tetreault, 2018;Niu et al., 2018;Wang et al., 2019;Briakou et al., 2021) by ignoring contextual information, such as the previous turn in a conversation. However, as demonstrated in Figure 1, the context plays an important role in defining conversation style.
In this paper, we explore a novel task: few-shot learning for conversation style transfer. Here, a style transfer model is expected to convert the style of input conversation based on a few example con-  versations in the target style. This is in contrast with the common methodologies in TST, where the style is assumed to be describable with known and well-defined attributes (e.g., politeness, friendliness) (Zhang et al., 2018;Madaan et al., 2020;Reif et al., 2022). For conversations, defining such attributes is challenging due to the dynamic nature and domain dependency. Also, the style of a conversation may be a combination of many attributes. Examples from TWCS dataset (Axelbrooke, 2017) in Table 1 show that the agent responses from Chipotle and VirginTrains services are identified to have similar politeness scores by an off-the-shelf politeness classifier (Danescu-Niculescu-Mizil et al., 2013) but their actual styles are drastically different upon a closer look. The proposed few-shot conversation style transfer task addresses several key challenges. Firstly, the interpretation of style attributes of the source/target dialogues is no longer required rather the style is defined solely through a few example dialogues. Secondly, it does not require parallel data in the form of source-to-target pairs, which is expensive and difficult to collect. Finally, conversation style transfer is performed with only a few example dialogues in the target style. In this paper, we show that transferring the conversation style in such a setting helps downstream applications such as chatbot personalization and domain adaptation for training.
Tailored for the proposed few-shot learning problem, we propose a novel method based on incontext learning (Brown et al., 2020). We propose to perform source-to-target style transfer with stylefree dialogues as pivots. In this approach, we first prompt pre-trained large language models (LLMs) to perform style reduction on source dialogue, then use another set of prompts to rewrite the style-free dialogue to match the target style.
To accurately and efficiently evaluate the quality of conversation style transfer with different models, we conduct human evaluation on style strength, appropriateness, and semantic correctness. The appropriateness assessment is unique to conversation style transfer, which evaluates whether the transferred utterances are out-of-context. Appropriateness is unique and critical for TOD applications as inappropriate responses (as shown in Figure 1) can result in a degraded user experience. As supplementary metrics, we report automatic scores including classifier-based style strength and semantic similarity scores. We observe that utterance-level style transfer without contextual information can achieve the highest style strength scores, however, results in low appropriateness and semantic correctness. On the other hand, by including contextual information, although, with lower style strength, the transferred utterances are more appropriate and semantically correct.
Conversation style transfer can be applied in downstream tasks as a data augmentation or domain adaptation technique. We perform an extrinsic evaluation of style transfer in such a setting for intent classification task, where the training and test data for the task are from different style domains. We apply few-shot conversation style transfer on the training data to convert it to the test style before training. As a result, we observe improvement in intent classification F1 scores across three domains, demonstrating the usefulness of style transfer of conversations in such downstream applications.

Problem Formulation: Few-Shot Conversation Style Transfer
In this section, we propose the novel task of conversation style transfer based on a few non-parallel examples, that does not rely on style attribute definitions (the task is illustrated in Figure 1). Given a conversation in source style A and a few shot nonparallel conversations in target style B, the task is to transfer the style of the conversation in source style A to style B. We address the following limitations of the state-of-the-art models in this task.
Step-1: Style Reduction Step-2: Transfer to Target Style

Style-free input conversation
Input conversation in Style B (formal, conversational) Figure 2: The proposed two-step in-context learning based approach for conversation style transfer: (Step 1) The source conversation is stripped from its style and converted to a neutral style (style-free). (Step 2) The style-free conversation is converted to the target style. Both conversion steps are learned in context.
Few-shot availability of the target style examples: Most of the existing works in style transfer operate under the assumption that a large amount of text is available in the target style to train a model (Niu et al., 2018;Wang et al., 2019;Zhang et al., 2018;Briakou et al., 2021;Madaan et al., 2020;Cheng et al., 2020;Reif et al., 2022). But this assumption may not hold in real-world settings. Hence, we limit target style data to a few dialogues.
Style transfer to arbitrary style: Existing works explicitly define style attributes (e.g., politeness) and transfer a text with a known style attribute to a style with another known attribute, such as impolite to polite (Madaan et al., 2020). But the style of a conversation can be difficult to define with a fixed set of attributes as shown in Table 1, and conversation style may be a combination of many attributes as conversations are dynamic. Hence, in this paper, we study the problem of style transfer of conversations where the style attributes of the source and the target styles are not necessarily known.

Non-parallel examples:
To train a model for transferring the style of a conversation from a source to a target style with a few examples, ideally we want parallel conversations in the source and target styles (Reif et al., 2022;Suzgun et al., 2022). However, parallel data is difficult to obtain and scale to many styles (including out-of-domain styles) due to challenges in determining conversational style attributes and stylizing conversations. Hence, we assume access to a few examples in the source and the target styles that are not parallel.
Evaluation criteria: A successful conversation style transfer model is expected to produce dialogues matching the target style, while preserving the original semantics and appropriateness of the turns. So in this paper, we evaluate our models on the following metrics.
• Style Strength: Following previous studies (Reif et al., 2022;Han et al., 2022) we evaluate the target style strength of utterances produced by a style transfer model. The style strength scores are higher if the transferred utterances match the target style.
• Semantic Correctness: In the context of TOD, we define semantic correctness as the preservation of intents in the style transferred utterances.
• Appropriateness of Response: Appropriateness of response can be defined as the coherence of a response given the previous turn in a conversation. This is required in TODs to prevent the style transferred utterances in a dialogue from being out-of-context.
Positive and negative examples of these metrics are shown in Table 20 in Appendix K.

In-Context Learning for Conversation Style Transfer
In this section, we propose a novel in-context learning based method using large language models (LLMs) for few-shot conversation style transfer. The method is illustrated in Figure 2.

In-context learning with non-parallel examples in source and target styles
To tackle the problem of the unavailability of parallel conversations in source and target styles (as described in Section 2), in this work, we propose a cheaper alternative solution, which prompts the language models with dialogues in one style and their style-free versions. Previous work (Madaan et al., 2020) also showed the effectiveness of style transfer after reducing the source text to a style free format and then convert the style free format to the target style (although they relied on large amount of training data for the purpose). Inspired from these intuitions we break down the task of style transfer in the following two steps.
1. Style Reduction: In this step, we use incontext learning method using LLMs to reduce the source conversation to a style-free form. As a result, we need parallel examples only in the form (C A , C ) for prompting LLMs, where C A is a conversation in source style A and C is the style free form of C A .
2. Transfer to the Target Style: In this step we use another in-context learning step where we convert the style-free input conversation to the target style. This step also requires parallel examples only in the form (C , C B ), where C B is a conversation in target style B and C is the style free form of C B .
We use human supervision to construct the parallel (C A/B , C ) examples as it is easier for humans to rewrite a conversation in style-free format and it omits the requirement of having knowledge about the target style. Prompt structures and examples for in-context learning for the above two steps can be found in Appendix A.

Dynamic Prompt Selection
To boost the efficiency of in-context learning in conversation style transfer, we propose a novel semantic similarity based dynamic prompt example selection procedure. Dynamic prompt selection has been shown effective in other tasks in prior works (Reif et al., 2022;Han et al., 2022). In our method, we first concatenate all of the utterances of a participant in a conversation sequentially. Then we use a sentence transformer (Reimers and Gurevych, 2019) designed for semantic search to encode the concatenated utterances to get a semantic meaning based embedding. For each test conversation we measure the cosine similarity between the embedding of the test conversation and all of the available few-shot training conversations. We select the topk semantically similar few-shot examples for the test conversation during prompting. The more semantically similar conversation appears later in the prompt to place it closer to the test conversation in the prompt. The effectiveness of this approach is examined by comparing it with random prompt example selection method in Section 4.

Baseline: Utterance level style transfer
Existing works study style transfer at utterance level with in-context learning (Reif et al., 2022;Suzgun et al., 2022); we use utterance-level style transfer as a baseline. We transfer the style of the utterances of one party in a conversation utterance by utterance using the same procedure described above. For dynamic prompt selection, we measure semantic similarity between single utterances instead of concatenating all utterances of a party in a conversation. As existing models are either applicable to utterance level only (Riley et al., 2021) or require a lot of training data (Madaan et al., 2020) for style transfer, they are not applicable in conversation style transfer in the few-shot setting.

Experiments
In this section, we present evaluation setup and results of the proposed approach on style transfer quality including style strength, appropriateness, and semantic correctness. Then, we show the evaluation results of applying the approach on a downstream task, namely intent classification.

Setup
Dataset: Given our focus on task-oriented dialogues, we extract style transfer dataset from (1) TWCS dataset (referred to as TWCS) (Axelbrooke, 2017): contains real-life human customer care agent dialogues with customers of different companies, and (2) Cross-domain conversational data from DSTC11 intent induction track (referred to as DSTC11) 1 : contains humanto-human (human agents) and human-to-bot (bot agents) dialogues. We select human agents dialogues (addressed as H 1 ) and bot agents dialogues (addressed as B) from DSTC11, and human-tohuman dialogues from Chipotle customer care from TWCS (addressed as H 2 ). We observed that the three styles, H 1 , H 2 , and B are holistically different. Some observed properties of the human styles (H 1 , H 2 ) are that they are engaging, conversational, and use diverse vocabularies (   example, human style H 1 is formal (uses formal words such as 'mister') while the other human style H 2 is casual and friendly (uses millennial phrases such as 'cool', 'asap'). Additionally, in H 2 , human agents sign their names at the end of each response implying a structural stylistic property of this human style. Some observed properties of the bot style are crispness and to-the-point while not being informal. These observed properties are summarized with quantitative and qualitative analyses in Table 2 and 3, and example conversations in these three styles are presented in Table 12 in Appendix. These analyses support our observation that conversation styles are rather holistic and difficult to characterize using a fixed set of attributes.
In this paper, we study style transfer with the three complex styles stated above where we are able to evaluate the style transfer performance on drastically different style pairs (e.g., human style H 1 /H 2 to bot style B), as well as pairs with nuanced differences (e.g., human style H 1 to human style H 2 ). The style directions we studied in this paper and corresponding dataset statistics are shown  For the number of conextual turns, we experiment with short segment (2 turns) and long segments (4-5 turns). For number of examples, we select the hyperparameter based on the validation set 2 . Note that when increasing the number of turns further, the maximum context length of LMs is reached quickly, therefore, we leave in-context learning with full dialogue context as a future work. In Appendix A, we show example prompts for baseline (utterance-level), short-segment and long-segment.
Construction of Few-Shot Examples: As mentioned in Sec. 3.1, we construct a few (styled, stylefree) conversation pairs for each style domain using human supervision. Comparing the creation of true parallel data between source and target styles, such an approach is easy to execute for humans and results in reusable examples. Humans were asked to reduce the style of the whole conversation. It took approximately 5 minutes for a human to rewrite a 10-12 turns conversation to a style-free form. As the style reduction is a cheap one-time effort in our work, we leave automatic style reduction as a future work.
were 99.89%, 93.3% and 100%, respectively. The details on these classifiers can be found in Appx. J. We treat the confidence scores of the classifiers as the style strength scores. For semantic similarity we measure the cosine distance between SBERT embeddings (Reimers and Gurevych, 2019) of a source utterance and the corresponding style transferred utterance. For the evaluation of appropriateness, we rely only on human evaluation as it is difficult to get an automatic method to measure it.
Human Evaluation: To obtain direct assessment on style transfer quality of different models efficiently, we perform ranking-based human evaluation on style strength and appropriateness. To evaluate style strength, we present human evaluators with utterances in the target style to train them on the properties of the target style. Then we present them with a source utterance and the style transferred versions of it by proposed models and the baseline. The model names were kept hidden from them and the order of the utterances was shuffled. Then we asked the evaluators to rank all of the versions of the same utterance in a descending order based on the style similarity with the reference utterances. For evaluating appropriateness, we present human evaluators with a source agent utterance and all versions of the style transferred utterances along with the immediate previous customer turn as context. Then we ask the evaluators to rank them based on the appropriateness of the agent response. For evaluating semantic correctness, we present human evaluators with a source utterance and the corresponding style transferred utterances. We ask them for each style transferred version if it is semantically similar, partially similar or dissimilar to the source utterance. Each data point was evaluated by three human evaluators who are professional data linguists. We did not include data points where all of the models generated exactly the same response. We convert the rankings of the evaluators to a scale of 1 where a higher score means higher rank (i.e. more appropriate or more similar in style). To aggregate scores, we average ranking scores by three evaluators. The pairwise comparison statistics among the models can be found in Appx. D.5. For semantic correctness, we select the label by taking majority voting. Details on human evaluation data statistics, evaluation interfaces, inter-annotator agreement and rank-scaling can be found in Appx. D.
Ablation Study: We compare dynamic prompt selection with random prompt selection (described in Sec. 3.2). With the ablation on automatic style strength metric using GPT-NeoX, we found that dynamic prompt selection outperforms the random prompt selection method as shown in Table 5.

Results
We show human evaluation results on utterancelevel and segment-level style transfer in Table 6. Models were run on test data (Table 4) using the best hyper-parameters and prompt selection method obtained in the ablation step. We first observe that highest style strength rank score is achieved when performing utterance-level style transfer, however, this resulted in lower appropriateness score. This observation shows that performing conversation style transfer without the dialogue context has significant risk of resulting in inappropriate agent utterance (i.e., utterances do not fit into the context). We can also observe in Table 6 that the smaller LLM GPT-NeoX suffers more from the problem of generation of inappropriate responses compared to the larger LLM Bloom.
Next, we observe that if we increase context (4/5 turns) in the conversation style transfer, the style yes-partial-no yes-partial-no yes-partial-no yes-partial-no yes-partial-no yes-partial-no  Table 6: Human evaluation results for utterance level (baseline) and conversation level style transfer with GPT-NeoX and Bloom LLMs using our model. The best average score over all style dimensions are marked in bold. Utterance level style transfer achieves higher style strength but conversation level style transfers yield more appropriate and semantically correct responses. Statistics with SDs can be found in Appendix G.
strength decreases but appropriateness is preserved. Interestingly, for the larger LLM Bloom, the semantic similarity decreases with the increase of context. We found out that sometimes Bloom generates new agent utterances different from the source utterances or swaps the agent utterance with the customer utterance when performing 4-5 turns conversation level style transfer (examples are shown in Appendix I). Hence, resulting in semantically dissimilar utterances. Therefore, we conclude that the LLMs are still not successful in conditioning on larger context when performing style transfer. So a limited context consisting of 2 utterances is the optimal setting for style transfer in our study. Automatic evaluation results on the test set resulted in the same pattern (shown in Appx. G). Examples of style transferred conversations in all style directions by various versions of our model are shown in Appx. H. We present examples of errors by various versions of the models in Table 7.

Evaluation on Downstream Task
Downstream applications of conversation style transfer are understudied. In this paper, we apply conversation style transfer to intent classification. We evaluate the setting where we have abundant of training data in one style and the test data is in a different style. We test our approach on three domains in DSTC11 intent induction dataset: insurance, banking and finance. Here, the training data is in human-to-human (h2h) style and the test data is in human-to-bot (h2b) style. We transfer the training data from h2h style to h2b style before training a RoBERTa-based intent classifier.
We ran an ablation (using banking and finance data) with utterance-level style transfer and shortsegment style transfer using GPT-NeoX and observed that train data transferred to h2b style using utterance-level style transfer results in higher intent classification scores. We conjecture the reason is that utterance-level style transfer has the strongest style strength score, benefiting the application of domain adaptation. We report results with this method on the all three domains in Table 8. The intent classification results show statistically significant improvement in insurance and banking, and non-significant improvement in finance, compared to the baseline where the training data has h2h style. Data statistics, experimental details and ablation study can be found in Appendix E.

Related Works
Style transfer in NLP has been studied in many variations. One line of research studied this problem as transferring to/from the style of popular novelists to/from modern English. Such as Boyd et al.   defining style attributes and transferring text style from one attribute to another (e.g., positive/negative, informal/formal) (Pavlick and Tetreault, 2016;Rao and Tetreault, 2018;Niu et al., 2018;Wang et al., 2019;Briakou et al., 2021;Zhang et al., 2018;Madaan et al., 2020;Reif et al., 2022).
Existing style transfer approaches make different assumptions about data availability. Certain approaches assume a lot of training data in the target style and use either a sequence-to-sequence model (Rao and Tetreault, 2018;Niu et al., 2018;Riley et al., 2021)  The question is used as context. But the styles of the fictional characters are too evident and characterized by special words and other fictional characters involved in the novels or movies. In contrast, in this paper we study style transfer in Task Oriented Dialogues where (1) the context is the previous turns among the speakers, (2) there are only a few examples of the target style available, and (3) the style attributes are unknown and the conversation style may be a combination of many style attributes.
Recent surveys have emphasized on application of text style transfer in domain adaptation (Jin et al., 2022). In this paper, we take the first step towards applying style transfer to adapt training data for the downstream task of intent classification.

Conclusion
In this paper, we study a novel problem of conversation style transfer using few-shot non-parallel examples. To solve this problem we propose a novel in-context learning approach that transfers the style of a source conversation to the target style by using style-free conversations as pivots and only a few non-parallel examples in source and target styles are needed for the purpose. We perform human and automatic evaluations to evaluate the style transfer quality for task-oriented dialogues on style strength, appropriateness, and semantic correctness. Quantitative and qualitative evaluations showed that conversation style transfer yields more appropriate and semantically correct responses compared to utterance level style transfer, which is crucial when applying to chatbot personalization. Finally, the usage of conversation style transfer for domain adaptation of training data for a downstream intent classification task showed improvement in F1 score.

Ethics Statement
In this paper, we did not annotate any new dataset rather we ran our models on publicly available datasets. The DSTC11 dataset is licensed under the Apache-2.0 License and the TWCS dataset is licensed under CC BY-NC-SA 4.0, both allows non-commercial use and distribution. The dataset references are cited and we provide a detailed statistics of the dataset used.
The examples shown in Table 1 are from real customer care agents from different companies and are taken from the TWCS dataset. These examples from these companies were selected only for studying the problem using real life data, the authors in this paper have no connection to these companies. Note that, the identity of the individual agents are hidden in the original dataset. So, it doesn't contain any personal identification information.
We performed human evaluation on our proposed models in this paper. We made sure that the human evaluation APIs do not impose any cognitive bias towards a specific model. We ensured that by hiding model names, shuffling orders of model outputs and so on. We provide inter-annotators agreement scores and the detailed human evaluation process in the paper and in the Appendix.
The model descriptions and all hyper-parameter details are provided in the paper. Hence, we believe our results are reproducible.
Any generated text that are reported as examples in this paper are output of a Machine Learning model and does not represent the authors' or the funding agency's viewpoints in any ways.
Language models are pre-trained on large amount of human generated text. Hence, they may inherently contain various social and human biases. These biases are not considered in our models. So, we recommend performing more research on these biases before deploying the proposed models into real life systems.

Limitations
Our work in this paper contains the following limitations.
• We construct styled-to-style-free parallel conversations manually using human supervision. This may be expensive to do when there are a large number of style domains. An automatic measure would be ideal for this purpose and this can be an interesting future work.
• We used pre-trained Large Language Models for prompting. Studies (Blodgett et al., 2020;Brown et al., 2020) have discussed the inherent social and cultural bias of pre-trained Large Language Models. These models are trained on a large amount of human generated text. As a result, they may inherently possess social and cultural biases. These kinds of biases are not taken into account in this paper.

A Prompting
A.1 Prompt Structure The structure of prompts for various versions of our model for converting a source conversation to style free conversation are shown in Figure 3. The prompt structures for converting style free conversation to the target style are shown in Figure 4.

A.2 Prompt Example
Examples for all types of prompt structures (as shown in Figure 3

B Example Conversations from Various Domains
Example conversations for chatbot style (referred to as B) and the two human styles H 1 , H 2 are shown in Figure 12.

B.1 PMI-based Style Indicator Lemma Identification
For the identification of style indicator lemmas for each style domain, we use a Pointwise Mutual Information (PMI) (Church and Hanks, 1990) based approach. We first take all of the agent utterances from each style domain and lemmatize each word used by the agents using the spaCy Python library. We ignore all punctuations and stopwords. Then for a lemma, w we calculate the pointwise mutual information (PMI) with a style domain t, I(w, t) using the following formula.
I(w, t) = log P (w|t) P (w) Where P (w|t) is computed by taking all lemmas used in style t and computing count(w) count(all−lemmas) and similarly, P (w) is computed by counting lemma w over the set of utterances in all styles. Now, we rank lemmas for each style domain based on their PMI scores. To remove topic-specific lemmas and rarely used lemmas, we ignore lemmas that are used in more than 10% of the agent utterances in each style domain and used less than 0.5%, 0.3%, 0.3% of the time in case of styles H 1 , B, H 2 , respectively. The top 300 high PMI lemmas for each style domain are reported in Table  9. Hand-picked style indicator lemmas from this top 300 list are shown in Table 3.

B.2 Construction of parallel style free conversations using human supervision
A human annotator was presented with 5-7 conversations from each of the style domains (B, H 1 , H 2 ) and they were asked to rewrite those conversations in a style-free form. One parallel style-free example per style domain written by the human annotator is shown in the right hand side of Figure 7. The human annotator is a researcher in NLP and it took approximately 5 minutes for them to rewrite a 10-12 turns conversation in a style free format. These style free parallel conversations are used for ICL purposes. The statistics of the few shot examples per style domain is shown in Table 10.

C Ablation Study
We perform ablation study to select number of shots and compare the effect of dynamic prompt selection. We experiment with 5, 10, 20 shot training for utterance level style transfer and 2-turns conversation level style transfer. Because of the limit of tokens in prompts we experiment with 4, 8 shot training for 4/5-turns conversation level style transfer. Note that with 4/5-turns context each training example contains many more tokens. In the cases of transferring to the second human style H 2 , 20 shot training is not supported because of the prompt limit and the conversations in this style being more conversational and greater in length. We measure the effectiveness of the number of train-  ing examples and prompt selection techniques by the automatically measured style strength of the target style after style transfer. We run this ablation study on the validation dataset shown in Table  4 and use GPT-NeoX as a Language Model as it is cheaper to use compared to Bigscience-Bloom. The results are shown in Table 11. It can be seen that dynamic prompt selection outperforms random prompt selection in all of the cases. The optimum number of shots for utterance level style transfer and 2-turns conversation level style transfer is 10 and for 4/5-turns conversation level style transfer it is 8.

D.1 Data Selection for Human Evaluation
Our goal with human evaluation is to compare different models. We used the test dataset described in Table 4 for human evaluation. Note that the same conversation segments are used to evaluate various versions of our model and the baseline using GPT-NeoX and Bloom as LLMs. We evaluate only agent responses and we apply two types of filtering step on these datasets before human evaluation.

Filtering
Step 1: When doing style transfer at 4/5-turn conversation level, it may result in non-parallel conversation compared to the source conversation because of turn reduction by the model. To match the non-parallel utterances with the source utterances, we rank the style transferred utterances based on their semantic similarity with the source utterances and pick the one with the highest similarity. We discard any style transferred utterance that has the highest semantic similarity of less than 0.2. Looking manually at those utterances it was observed that those were unrelated utterances generated by the LLMs. Filtering Step 2: We filtered out all agent responses where none of the models (including the baseline) changes the source agent utterances or when the style transferred versions were the same from all models.

D.2 Human Evaluation Settings
Each data point was evaluated by three human evaluators. We worked with professional data linguists who are fluent in English. They were compensated at hourly basis which was in accordance with the standard compensation rate in the United States. They were first trained on the tasks. Specifically, they were briefed on what we mean by style strength, appropriateness and semantic correctness.
Worked out examples were provided to them. The model names were hidden from the annotators and the four versions were presented in a randomly shuffled order for each example. For ranking based evaluation in style strength and appropriateness, the human evaluators were instructed to rank the various style-transferred versions from various models based on their style strength and appropriateness. For example, when evaluating among 3 models, a rank of 1 means it has highest style strength or appropriateness and a rank of 3 means the lowest style strength or appropriateness. The annotators were instructed to provide two style transferred versions the same rank if they were equal in style strength or appropriateness. For evaluation of semantic correctness the human evaluators were presented with the source utterance and the style transferred versions of the source utterance by each of the models.  Table 11: Ablation study for selecting number of shots and prompt selection method. Here, "N/S" means "Not Supported" because of token limit in prompt. GPT-NeoX was used as the Language Model in this ablation study. Dynamic prompt selection technique outperforms random prompt selection in all of the cases. The optimal number of shots for utterance level style transfer, 2-turns conversation level style transfer and 4/5-turns conversation level style transfer are 10, 10, and 8 respectively. Then we asked them for each style transferred version if it is semantically similar, partially similar or dissimilar to the source utterance. Each data point in all of the evaluation metrics is evaluated by three human evaluators. The annotation prompts for style strength, appropriateness and semantic correctness evaluation tasks are shown in Figures  8, 9, 10, respectively.

D.3.1 Style Strength and Appropriateness
For measuring the inter-annotator agreement in ranking evaluations for style strength and appropriateness, we use Spearman's Rank Correlation Coefficient (Zar, 2005). We take the average Spearman's Rank Correlation Coefficient between each pair of human annotators for each data point as an agreement measure. It ranges from -1 to +1 where -1 means absolute disagreement and +1 means absolute agreement.

D.3.3 Agreement Scores
The inter-annotator agreement in all of the tasks are shown in Table 13. Note that, for calculating agreement in the semantic correctness evaluation task,  all of the data points are aggregated to measure the agreement score as they represent categorical evaluation measures. On the other hand, that is not possible in case of ranking based evaluations for style strength and appropriateness. So, we measure the agreement for each data point and take the average agreement over all data points. We can see in the Table 13 that in all of the cases we get strong agreement (> 0.70) among the annotators for the style strength and appropriateness evaluation. The only exception is the case of style strength evaluation task in the direction of H 1 → H 2 , using the GPT-NeoX model. The agreement score is slightly lower (0.69) in this case. Our insight is that these two directions are basically human styles and difference between them is very subtle. As a result, it is difficult for humans as well to differentiate among them. This pattern is observed when doing the automatic evaluation as well.
In case of semantic correctness evaluation task, we always get strong agreement among annotators (> 0.75).

D.4 Scaling Ranking Scores
In the style strength and appropriateness evaluation tasks we use ranking based measure among the output from various models. For example, when evaluating among 3 models, a rank of 1 means it has highest style strength or appropriateness and a rank of 3 means the lowest style strength or appropriateness. We scale these rank scores in the range between 0 to 1 where a higher score means higher style strength or appropriateness. The ranking were scaled for each data point using the following formula.
For each data point, if the number of versions to be ranked is k and ranking of a version i (i ∈ 1, ..., k) is r i , then the reverse rank score, r rev i = k − r i + 1. Now, the scaled rank score, r scaled i = r rev i −min j∈1,...,k r rev j max j∈1,...,k r rev j −min j∈1,...,k r rev j . We average over all human evaluators' scaled ranking score to get the final scaled ranking score for a data point.

D.5 Pairwise Comparison Among Models
The pairwise comparison among various versions of the models for style strength and appropriateness are shown in Table 14. This table represents the statistics on the percentage of time a model is ranked higher in the style strength and appropri-  We take the three domains of DSTC11 dataset namely, Insurance, Banking and Finance for this task. In this dataset, mostly the customer utterances are annotated for intents. We take the human-tohuman conversations as training data and human-tobot conversations as test data. We consider intent classes having at least 20 training utterances for this study. Then we randomly select 90% of training data from each intent class as training set and select rest of the 10% as validation set. The training, test and validation data statistics for each of the domains are shown in Table 15.

E.2 Few-Shot Style Transfer of Training Data
In this dataset, mostly customer utterances are annotated for intent classes. So, we perform few-shot style transfer of the customer utterances only, using the same procedure that we followed for agent utterances style transfer. We found out that customers are more conversational when talking to a human agent compared to when talking to a chatbot agent. So, we use few-shot customer utterances from the human-to-bot conversations to transfer the style of customers in human-to-human conversations. Then use this style transferred data for training an intent classifier. We use a 10-shot setting with dynamic prompt selection based on semantic similarity.

E.3 Intent Classification Results
We compare the performance of the intent classifier when trained on human-to-human conversations vs. training on human-to-human conversations that are transferred to human-to-bot style. We ran an ablation where we experimented with utterance level style transfer and 2-turns conversation level style transfer as these two methods yielded better style strength in our studies. We ran this ablation using only banking and finance domains out of the thee domains. The classification was done 10 times with 10 random seeds for each domain. A RoBERTabased (Liu et al., 2019) text classifier was used to perform the intent classification task. We encoded each utterance using RoBERTa where the embedding of the [CLS] token of the last layer was used as a representation of the utterance. This representation was used for intent classification. The average classification results are shown in Table 16.
Overall, the utterance level style transfer yields the best intent classification results as it achieves the best style strength of the test domain (human-to-bot style).

F LLM Hyperparameters and Infrastructure Used
We use top-k sampling with temperature, t (Holtzman et al., 2019) as a decoding method for the large language models. k=1 and t=0.1 was set for all of our experiments. We ran all of the experiments using PyTorch. Both Bloom and GPT-NeoX were run on a computation node with 8 A100 GPUs.

I Error Analysis for Bigscience-Bloom
Examples of some common types of errors observed in 4/5-turns conversation level style transfer using Bigscience-Bloom are shown in Table 19.

J Style Discriminator Models
We trained RoBERTa-based binary text classifiers to classify between the source and the target styles. Training data for these classifiers were obtained from the residual data after selecting the test and the validation sets as described in Table 4. We treated the confidence scores of these classifiers as the style strength scores. We balance the training data for both of the classes when training these classifiers. For training the classifiers to differentiate between styles (H 1 , B), (H 1 , H 2 ), (H 2 , B), we randomly sampled 4, 875, 1, 792 and 1, 792 agent utterances from each class, respectively. 10% of the data were held out as a validation set. We encoded each agent utterance using a RoBERTa model where the embedding of the [CLS] token of the last layer was used as a representation of the utterance. We used this representation for the classification of the style domain. We stopped training when the validation accuracy did not improve for consecutive two epochs. The validation ac-curacy of the classifiers to differentiate between styles (H 1 , B), (H 1 , H 2 ), (H 2 , B) were 99.89%, 93.3% and 100%, respectively. Note that, style H 2 has an unique property that each agent signs their names after their responses preceded by a hyphen. If we train a classifier to identify style H 2 , it always yielded an accuracy of 100% because of the specific signature format. As a result, other stylistic properties such as, vocabulary usage, crispness, conversational, and so on were missed out by the style classifier. So for training the classifiers involving this style class, we removed this signatures as a preprocessing step.

K Style Evaluation Metrics
Examples of three style transfer metrics -style strength, semantic correctness and appropriateness can be found in Table 20.  yes-partially-no yes-partially-no yes-partially-no yes-partially-no yes-partially-no yes-partially-no   Table 18: Automatic style strength and semantic correctness evaluation results for utterance level (baseline) and conversation level style transfer with GPT-NeoX and Bigscience-Bloom LLMs using our model. Utterance level style transfer achieves higher style strength and conversation level style transfers yield more semantically similar responses.