Boosting Distress Support Dialogue Responses with Motivational Interviewing Strategy

AI-driven chatbots have become an emerging solution to address psychological distress. Due to the lack of psychotherapeutic data, researchers use dialogues scraped from online peer support forums to train them. But since the responses in such platforms are not given by professionals, they contain both conforming and non-conforming responses. In this work, we attempt to recognize these conforming and non-conforming response types present in online distress-support dialogues using labels adapted from a well-established behavioral coding scheme named Motivational Interviewing Treatment Integrity (MITI) code and show how some response types could be rephrased into a more MI adherent form that can, in turn, enable chatbot responses to be more compliant with the MI strategy. As a proof of concept, we build several rephrasers by fine-tuning Blender and GPT3 to rephrase MI non-adherent"Advise without permission"responses into"Advise with permission". We show how this can be achieved with the construction of pseudo-parallel corpora avoiding costs for human labor. Through automatic and human evaluation we show that in the presence of less training data, techniques such as prompting and data augmentation can be used to produce substantially good rephrasings that reflect the intended style and preserve the content of the original text.


Introduction
Demands of the modern world are increasingly responsible for causing severe psychological distress in people.World Health Organization estimates psychological distress affects 29% of people in their lifetime (Steel et al., 2014).The shortage of mental health workers and the stigma associated with mental health further demotivates people from actively seeking help.With the expansion of the internet, many people are seen resorting to peer support platforms such as Reddit and Talklife to vent their distress. 1The anonymity associated with these platforms makes it easier for people to discuss their concerns without being affected by the stigma.Distress consolation through AIdriven chatbots has also become an emerging solution (Fitzpatrick et al., 2017;Inkster et al., 2018;Mousavi et al., 2021).Due to the lack of availability of large-scale psycho-therapeutic conversations, researchers are using data scraped from online peer support forums to train such chatbots (Alambo et al., 2019;Welivita and Pu, 2022).High levels of perceived empathy and information richness make them good candidates for training (Nambisan, 2011;De Choudhury and De, 2014;Sharma et al., 2020a,b).But since peers are not professionals, the responses contained in such forums can sometimes be unfavourable to address distress (e.g.confrontations, judgments, orders etc.).So, using this data can have severe risks.One solution for this is identifying favourable and unfavourable response types that appear in distress support dialogues and developing automatic means that can propose omission or rephrasing of such unfavourable response types.Figure 1 shows an example.
To analyze the types of responses in distress support dialogues, we use labels adapted from a wellestablished behavioral coding system named Motivational Interviewing Treatment Integrity (MITI) code (Moyers et al., 2014).It is used in psychology to evaluate how well a mental health provider responds.Specific response types from the MITI code have shown to increase the likelihood of positive health outcomes (Pérez-Rosas et al., 2018;Gaume et al., 2009).It defines favourable response types such as Questioning, Reflecting, and Advising with permission and unfavourable response types such as Advising without permission, Confronting, and Self-Disclosing (extra-session).In our previous work, we developed a dataset called the MI dataset, to have a comparative understanding of the differences between online support provided by peers and trained counselors.For this, we hired professional counselors to annotate responses given by peers and counselors with labels derived from the MITI code.During analysis, we observed that peers' responses tend to be more supportive, and encouraging than counselors' (as observed by the increased percentage of Support and Affirm labels).But it was also observed that important therapeutic techniques, such as asking more open questions than closed ones, reflections, giving information, advices with permission, and emphasizing speaker's autonomy were lacking in peers' responses and hence require further boosting.One of the major observations was that among the advises given by the peers, 92.86% of them belonged to the category Advise without permission, which is MI non-adherent.This percentage was lower in counselor responses, but still accounted for 77.22% of the advises given by counselors.
In this work, we aim to detect such Advise without permission responses among distress support dialogues and build a rephraser that can rephrase such responses into Advise with permission, which is more MI-adherent.First, we detect such responses through a classifier trained on an augmented version of the MI dataset.Next, as we do not have human written responses rephrasing Advise without permission responses into Advise with permission, we use automatic methods such as template-based replacement and retrieval to construct a pseudo-parallel training corpus containing pairs of Advise without permission and Advise with permission sentences.Since rephrasing is a labor-intensive task compared to labeling and we require professionally trained counselors to do this in the distress consolation setting, using our already labeled dataset to construct a pseudo-parallel corpus saved us both time and cost.We apply the same methods on the augmented version of the MI dataset to form a much larger pseudo-parallel training corpus and use these corpora to fine-tune BlenderBot (Roller et al., 2021) and GPT3 (Brown et al., 2020).Some of the models we fine-tune incorporate different forms of prompting with the aim of obtaining a better outcome with less training examples.We evaluate the rephrasers using automatic and human evaluation.The results mainly show when the training dataset is small, prompting improves the performance of the rephrasers across style transfer and semantic similarity dimensions.They also suggest that when the training dataset is large (in our case through data augmentation), pseudo-parallel data generated through simpler methods such as template replacement produce better results.
Our contributions are four-fold.1) We develop an MI classifier that can predict 15 different favourable and unfavourable response types derived from the MITI code.2) We propose a methodology to rephrase responses detected as Advise without Permission into more MI-adherent Advise with Permission.We show how this can be done in the absence of human written rephrasings by developing pseudo-parallel corpora using different automatic methods.3) We evaluate these rephrasers using automatic and human evaluation and show how prompting and data augmentation can improve the performance of the rephrasers when there is less training data.4) Finally, we discuss how this method can be applied to boost chatbot responses, making them more compliant with the MI strategy.Our code and the datasets can be found at https://github.com/anuradha1992/Boosting-with-MI-Strategy

Related Work
Rephrasing responses recognized as Advise without Permission into Advise with Perrmission can be identified as a sub-task falling under the task of Text Style Transfer (TST), in which the goal is to automatically control the style attributes (e.g.sentiment, politeness, humor, etc.) of text while preserving the content (Jin et al., 2022).The field of TST involves traditional linguistic approaches as well as deep learning approaches.Traditional approaches to TST rely on term replacement and templates (Mairesse and Walker, 2011;Sheikha and Inkpen, 2011).With the success of deep learning, various neural methods have been recently proposed for TST.Given datasets in which there are direct mappings between the text of the source style and the text of the target style, which are referred to as parallel corpora, standard sequence-to-sequence models are often directly applied for TST (Rao and Tetreault, 2018;Shang et al., 2019;Xu et al., 2019).But parallel corpora are challenging to find because the development of such data often requires costly human labor.Thus, TST on non-parallel corpora has become an emerging area of research (Li et al., 2018;Jin et al., 2019;Liu et al., 2022).
Parallel and nonparallel datasets have been proposed for common sub-tasks of TST such as sentiment (Shen et al., 2017), topic (Huang et al., 2020), formality (Rao and Tetreault, 2018), politeness (Madaan et al., 2020), and humor (Gan et al., 2017) transfer.But to the best of our knowledge, this is the first attempt at introducing a new subtask and releasing an nonparallel corpus for style transfer between MI non-adherent Advise without Permission and MI adherent Advise with Permission responses.This task is more challenging than the other sub-tasks because it requires the expertise of professional counselors to generate training data.In this work, we release a nonparallel corpus that can be utilized for this task, which is annotated by professional counselors.We also show how automatic methods could be applied to create pseudo-parallel corpora using this dataset, which can be used to train neural models for this task.

Datasets
For this work, we used dialogues curated from two online support platforms.The first one is Coun-selChat (counselchat.com), in which verified counselors respond to distress-related posts.The Coun-selChat dataset available publicly2 contains 2,129 post-response pairs spanning 31 distress-related topics.We also curated dialogues from a carefully selected set of 8 subreddits: mentalhealthsupport; offmychest; sad; suicidewatch; anxietyhelp; depression; depressed; and depression_help, which are popular among Reddit users to vent their distress.
This dataset, which we call RED (Reddit Emotional Distress), contains 1,275,486 dyadic conversations having on average of 2.66 turns per dialogue.
In our previous work, we recruited professional counselors to annotate a subset of 1,000 dialogues each from CounselChat and RED datasets with labels adapted from the MITI code 2.0 (Moyers et al., 2003) and 4.2.1 (Moyers et al., 2014).We call this the MI dataset.We used 15 labels for annotation.They are elaborated in the appendices.Out of them, we are interested in the labels Advise with Permission and Advise without Permission, which are respectively considered MI-adherent and MI non-adherent response types.The MI dataset contains 16,811 annotated responses, out of which 2.87% (484) and 13.5% (2,285) responses are labeled as Advise with Permission and Advise without Permission, respectively.
To further augment the MI dataset, we used automatic labeling to expand the 15 labels into unlabeled dialogue responses from CounselChat and RED datasets.We used two automatic methods for this purpose: 1) N-gram-based matching; and 2) Similarity based retrieval.
N-gram Based Matching: By tokenizing the responses in the MI dataset and computing the frequencies, we discovered the most frequent N-grams (four-grams and five-grams) occurring among the 15 labels.Examples of them are shown in the appendices.Next, we searched for the presence of these indicative N-grams (first five-gram and then four-grams) among individual sentences that appear in dialogue responses of the unlabeled Coun-selChat and RED datasets.If an indicative N-gram was found in a sentence, we labeled that sentence with the label that N-gram is indicative of.The sentences with overlapping labels were discarded due to ambiguity.In this way, we were able to automatically label 1,918 and 340,361 sentences in CounselChat and RED datasets, respectively.
Similarity Based Retrieval: For each unlabeled sentence among the responses in CounselChat and RED datasets, we computed the cosine similarity with each of the labeled sentences in the MI dataset.Next, for each unlabeled sentence, we retrieved the labeled sentences whose cosine similarity is higher than a certain threshold (the thresholds were different for each of the 15 labels, which were selected after manually inspecting randomly selected pairs of unlabeled and labeled sentences corresponding to different labels).Next, we used a majority vot-ing scheme to select the label we can associate the unlabeled sentence with.When we encountered ties, we computed the average similarities across the clusters of retrieved sentences with different labels that held a tie and selected the label based on maximum average similarity.Using this method, we were able to automatically annotate 2,881 and 1,196,012 sentences in CounselChat and RED datasets, respectively.
Using the union and the intersection of the labels retrieved from N-gram-based matching and similarity-based retrieval and combining them with the gold labels from the MI dataset, we created two augmented-labeled MI datasets having 1,378,469 and 84,052 labeled sentences, respectively.For simplicity, we will refer to them as MI Augmented (Union) and MI Augmented (Intersection) datasets.

MI Classifier
We developed a classifier to automatically classify responses in distress-support dialogues into one of the 15 labels mentioned above.This is an important step that should be followed before rephrasing, since first it should identify the unfavourable responses types.For this purpose, we developed a classifier that consists of a representation network that uses the BERT architecture (Devlin et al., 2019), an attention layer that aggregates all hidden states at each time step, a hidden layer, and a softmax layer.We used the BERT-base architecture with 12 layers, 768 dimensions, 12 heads, and 110M parameters as the representation network.It was initialized with weights from RoBERTa (Liu et al., 2019).We trained three classifiers.The first one was trained on the smaller human-annotated MI dataset (MI Gold) taking 80% of the data for training and leaving 10% each for validation and testing.The other two were trained on the MI Augmented (Union) and MI Augmented (Intersection) datasets, leaving out the data used for validation and testing in the first case.In all cases, the optimal model was chosen based on average cross entropy loss calculated between the ground truth and predicted labels in the human-annotated validation set.
The classifiers trained on MI Gold, MI Augmented (Intersection), and MI Augmented (Union) datasets reported accuracies of 68.31%, 67.13%, and 73.44% on the MI Gold test set, respectively.The reported accuracies on the MI Gold validation set were 67.08%, 64.07%, and 72.67%, respectively for the three classifiers.Accordingly, the labels collected through the union of N-gram matching and cosine similarity-based methods improved the accuracy of the classifier by 8.33% and 7.5%, respectively on the validation and test sets compared to the accuracies reported when trained on the gold-labeled MI dataset.

MI Rephraser
After identifying the favourable and unfavourable response types, we can choose to omit the unfavourable responses or if possible, rephrase them into a more MI adherent form.A label pair that this rephrasing strategy can be applied directly are Advise without Permission and Advise with Permission.Through N-gram analysis, we could discover some N-gram patterns that are indicative of the label pair Advise without Permission (e.g.You should, You need to, You musn't) and Advise with Permission (e.g.It maybe helpful to, I wonder if you can, You may want to consider).These could be identified as style attributes that vary across the responses identified as Advise without Permission and Advise with Permission.Thus, given a response identified as Advise without Permission, the goal of the rephraser would be to rephrase the response to be indicative of Advise with Permission, without changing the semantic content of the response.
As mentioned in Section 2, this can be identified as a sub-task under the task of Text Style Transfer (TST).TST is formally defined as, given a target utterance x ′ and the target discourse style attribute a ′ , model p(x ′ |a, x), where x is a given text carrying a source attribute value a.In our case, x corresponds to the response identified as Advise without Permission, a corresponds to Advise without Permission, and a ′ corresponds to Advise with Permission.

Pseudo-Parallel Corpora
As discussed in Section 2, the most recent methods for TST involve data-driven deep learning models.The prerequisite for using such models is that there exist style-specific corpora for each style of interest, either parallel or nonparallel.With the human-annotated MI dataset, we are in possession of a non-parallel corpus containing 2,285 Advise without Permission and 484 Advise with Permission type of responses.With the MI Augmented (Union) dataset, we have 199,885 Advise without Permission and 3,541 Advise with Permission type of responses.Since creating parallel corpora consumes human labor and cost, using the above data, we de-cided to create pseudo-parallel corpora that contain pairs of Advise without Permission and Advise with Permission responses to train our rephrasers.We used two automatic methods to create these pseudoparallel corpora: 1) Template-based replacement method; and 2) Retrieval method.

Template-Based Replacement Method
We used frequency-based N-gram analysis accompanied by human inspection to determine the linguistic templates that represent Advise with Permission and Advise without Permission responses.Table 11 shows some templates discovered for Advise without Permission (on left) and Advise with Permission (on right).In template-based replacement, if the algorithm detects any linguistic template on the left among the responses labeled as Advise without Permission, it will randomly select a template from the right to replace it with, giving a pair of Advise without Permission and Advise with Permission responses that contain the same semantic content but differ in style.We constructed two pseudo-parallel corpora by applying this method to the MI Gold and MI Augmented (Union) datasets, which contained 2,285 and 199,885 responses labeled as Advise without Permission, respectively.They respectively gave us 240 and 38,559 response pairs.

Retrieval Method
Given the non-parallel corpus containing Advise without Permission and Advise with Permission responses, we computed the semantic similarity between the Advise without Permission and Advise with Permission responses and retrieved the response pairs whose similarity is above a certain threshold.We used Sentence-BERT (Reimers and Gurevych, 2019) to generate embeddings of the two types of responses and compared them using cosine similarity.After manually inspecting a random subset of response pairs over a range of similarity thresholds, we chose 0.7 as the final threshold to determine the semantically similar response pairs.Similar to template-based replacement, we used this method to construct two pseudoparallel corpora by applying the method to the goldlabeled and augmented-labeled MI datasets and obtained 104 and 54,956 response pairs, respectively.For simplicity, we will refer to the corpus constructed using the gold-labeled MI dataset as pseudo-parallel (PP) corpus and the corpus constructed using the augmented-labeled MI dataset as pseudo-parallel augmented (PPA) corpus.We used 80% of the data from each of the corpora for training our rephrasers, and 10% each for validation and testing.In section 7, we gauge the quality of the above corpora using human ratings.

Rephrasing Models
Using the above corpora, we fine-tuned two pretrained language generation architectures Blender (Roller et al., 2021) and GPT-3 (Brown et al., 2020).Blender is a standard Seq2Seq transformer-based dialogue model.We used the 90M parameter version of Blender.Though it is a dialogue generation model, we used it mainly because it is pretrained on Reddit discussions containing ≈1.5B comments and is already aware of the language constructs used in peer support.GPT-3 is a language model that utilizes standard transformer network having 175 billion parameters.We used the smallest but fastest version of GPT-3, Ada, to build our rephrasers.The main reason to use GPT-3 is that it has demonstrated strong few-shot learning capability on many text-based tasks.Both Blender and GPT-3 were fine-tuned on template-based, retrievalbased, and combined PP and PPA corpora.
Prior work has shown large language models can perform various tasks given a clever prompt prepended to the input (Brown et al., 2020).So, we developed two variations of Blender and GPT3 models by appending a generic prompt and an Ngram-based prompt to the end of the training data.In generic prompting, we simply appended the label Advise with permission: to the end of the input text.In N-gram prompting, we detected if there is any N-gram that is indicative of Advise with permission in the output text.If there is, we appended it to the end of the input text.Table 2 shows training examples with generic and N-gram-based prompts.
Altogether we developed 10 different rephrasing models by fine-tuning Blender and GPT-3 on: 1) Training example with generic prompting: Input: try to learn from your mistakes and meet some new people .Advise with permission: Output: It may be important to try to learn from your mistakes and meet some new people.
Training example with N-gram based prompting: Input: try to learn from your mistakes and meet some new people .It may be important to: Output: It may be important to try to learn from your mistakes and meet some new people.

Automatic Evaluation
A successful style-transferred output should be able to demonstrate the correct target style and at the same time preserve the semantic content of the original text (Jin et al., 2022;Fu et al., 2018).We refer to the first criterion as Style Transfer Strength and the second as Semantic Similarity.Automatic metrics used to evaluate text generation methods such as the BLEU score (Papineni et al., 2002), ROUGE (Lin and Och, 2004), METEOR (Banerjee and Lavie, 2005), Word Mover Distance (WMD) (Kusner et al., 2015), Character N-gram F-score (chrf) (Popović, 2015), BERTScore (Zhang et al., 2019) and cosine similarity based on sentence embeddings (Reimers and Gurevych, 2019) are used in the literature to evaluate the semantic similarity between the original and the rephrased text.The Part-of-Speech distance (Tian et al., 2018), a metric specific to TST, is also used to measure semantic similarity.Mir et al. (2019) suggest deleting all attribute-related expressions in the text when applying these metrics to evaluate the output of TST tasks.Thus, before evaluation, we removed the style-specific phrases discovered during N-gram analysis from the input and output text.
To evaluate the style transfer strength, most works use a style classifier to predict if the output conforms to the target style (Hu et al., 2017;Li et al., 2018;Prabhumoye et al., 2018).We used the MI classifier trained on the MI Augmented (Union) dataset to compute the style transfer strength.It is calculated as the percentage of samples classified as Advise with Permission out of all test samples.
Table 3 shows the results of automatic evaluation of the rephrasers on the combined PP test dataset, which contains data from both template and retrieval-based PP test sets.Accordingly, GPT3-based rephrasers show better performance compared to Blender-based rephrasers in 85% of the time across the metrics.It could also be observed that data augmentation improves the scores across most metrics irrespective of the backbone model used.Combining the pseudo-parallel corpora obtained from template-based and retrievalbased methods could improve the performance scores of Blender-based rephrasers across most automatic metrics.But GPT-3 based rephrasers trained only on template-based pseudo-parallel data seem to achieve better scores across almost all the metrics when compared to those trained on retrieval-based and combined corpora.
Blender-based rephrasers that incorporated generic prompting ranked the best across most metrics over all the other Blender-based rephrasers.With the smaller PP training corpus, the GPT-3based rephraser that incorporated generic prompting ranked the best across most metrics.But with the larger PPA training corpus, the GPT-3 based rephraser that was trained on simple templatereplaced pseudo-parallel corpora ranked the best across most automatic metrics.

Human Evaluation
Similar to automatic evaluation, we used two human evaluation criteria to rate the rephrased sentences.The first is how close the rephrased sentence is to Advise with permission (Style transfer strength).The second is to what extent the rephrased sentence preserves the context/meaning of the original sentence (Semantic similarity).
We used the UpWork crowdsourcing platform (www.upwork.com)and recruited four professional counselors to rate the rephrased sentences.Given the original Advise without Permission sentence and a list of rephrased sentences generated by the 10 different rephrasers, we asked two questions from the counselors: 1) Is the rephrased sentence indicative of Advise with permission?; and 2) Does the rephrased sentence preserve the original context?The counselors were asked to answer these questions by indicating a rating on a Likert scale ranging from 0 (Not at all) to 4 (Yes it is).Along ), the score of the rephraser that performs the best is highlighted in bold.The best score obtained for each of BB and GPT3-based rephrasers along each criteria is highlighted in green.Out of them, the best overall score is highlighted with a darker green.
with the rephrased sentences, we also presented them the corresponding Advise with permission sentence obtained from the pseudo-parallel corpora in order to gauge the quality of the corpora used for training.The sentences to be rated were presented to them in a random order to reduce bias.
As the combined PP test corpus developed on the MI Gold dataset is small (only 34 samples), we used 200 randomly selected samples from the combined PPA test corpus developed on the augmented MI dataset to be rated by the human workers.This was to verify the trend of results reported on the PP test corpus.We bundled 9 randomly selected test cases in one batch and allocated two workers to rate each batch.Results were calculated based on the average rating given by the two workers.Following Adiwardana et al. (2020) we also calculated the average of style transfer strength and semantic similarity ratings to obtain a single score.We computed the inter-rater agreement based on weighted Kappa that uses Fleiss-Cohen weights (Wan et al., 2015) and the scores were 0.5870 (moderate agreement) and 0.6933 (substantial agreement) for style transfer strength and semantic similarity, respectively.
Table 4 shows the results of the human evaluation experiment.According to the results, GPT3-based rephrasers win over Blender-based rephrasers 70% and 85% of the time along style transfer and semantic similarity dimensions, respectively.And when it comes to the smaller PP training corpus, using generic prompting during training increases the scores across most cases.But when it comes to the larger PPA corpus, simply training the rephrasers with template-replaced pseudo-parallel pairs gives the best results irrespective of the underlying backbone model.
The average ratings obtained for style transfer strength and semantic similarity for sentence pairs in the PP test corpus were 3.21 and 3.16, respectively.The sentence pairs in the PPA test corpus scored 3.12 and 2.69 in the above two dimensions, respectively.The average ratings being close to 3 with most of them being above 3 suggests that the training corpora used are of substantial quality.

Discussion
In this paper, we presented an example on how distress-consoling responses could be boosted with MI strategy.For this, we first developed a classifier that can identify favourable and unfavourable response types as defined by the MITI code.Then we narrowed our focus to the MI non-adherent response type Advise without Permission and developed several rephrasers that can rephrase Advise without Permission responses into MI adherent response type Advise with Permission.As curating human written rephrasings was costly, we used templated-based replacement and retrieval methods to create pseudo-parallel corpora from gold-labeled and augmented-labeled MI datasets that contained responses from Reddit and CounselChat platforms.We used this data to train several Blender and GPT3-based rephrasers.We also used generic and N-gram-based prompts to see if prompting can improve the rephrasers' performance.
Automatic as well as human evaluation results suggested fine-tuning on GPT3 gives better results in rephrasing Advise without permission responses into Advise with permission.Data augmentation techniques we used by expanding the MITI labels using N-gram-based matching and similaritybased retrieval improved the performance of the MI classifier as well as the Blender and GPT3based rephrasers.The results also suggested when the training datasets are small, the use of generic prompting can enable the rephrasing models to produce better results across style transfer and semantic similarity dimensions.But if you are dealing with large datasets (in our case through data augmentation), pseudo-parallel data generated through simpler methods such as template-based replacement can enable the models to generate substantially good rephrasings closer to the required style and semantically similar to the original sentence.
In the future, we hope to develop a chatbot that can respond to psychological distress using the RED dataset that contain dialogues curated from several mental health-related subreddits.Then we hope to improve the responses generated by this chatbot by applying MI boosting at two different levels: one at the data level; and the other at the model level.At data level boosting, we hope to apply the MI classifier and automatically label the responses in the training data itself.By doing so, we will be able to rephrase the MI non-adherent responses such as Advise without Permission into more MI-adherent responses and omit the other unfavourable responses from the training data.The MI-boosted training data can then be used to train the chatbot.At model-level boosting, a similar methodology can be applied at the level the chatbot is decoding responses (e.g.beam search).Not only generative chatbots but also retrieval-based chatbots could be benefited from this methodology.

Limitations
Certain parts of our proposed methodology, for example, template-based replacement and n-grambased prompting are applicable only when stylespecific linguistic attributes could be identified between the source and the target text.And due to the cost of human labor and the lack of publicly available client-therapist dialogues, the sample size drawn in the study is small and thus may have an impact on the conclusions drawn.Our methods have only been tested for the English language.But we believe similar methods could be applied to other languages given they have unparallel corpora tagged with Advise without Permission and Advise with Permission labels.The rephrasing methods described in this paper are tested for short sentences with a maximum sentence length of 98 tokens.Thus, the scalability of these methods for long text still remains to be tested.
When testing the rephrasers, there are some combinations that could be tried other than the ones already tested.For example, more models can be fine-tuned and tested separately on templatereplaced and retrieval-based PP and PPA corpora but incorporating generic and N-gram prompting.In this work, we first combined these two types of corpora before attempting prompting since we could observe better performance on Blender when the corpora were combined.
In order to have more data, we combined the Advise with Permission and Advise without Permission responses present in CounselChat and RED datasets.But studies show that there are differences in the language used by counselors and peers (Lahnala et al., 2021;Mousavi et al., 2021).So, there can be linguistic differences between the same type of response in CounselChat and RED datasets.Future work should attempt to identify these differences and ideally rephrase the responses given by peers to reflect the language of the counselors.

Ethics Statement
Data Curation: Only publicly available data in Reddit and CounselChat websites were used in this work.Analysis of posts on websites such as Reddit is considered "fair play" since individuals are anonymous and users are aware their responses remain archived on the site unless explicitly deleted.
It is also stated in Reddit's privacy policy that it allows third parties to access public Reddit content.3Also, Reddit's data is already widely available in larger dumps such as Pushshift (Baumgartner et al., 2020).Even though the policies allow it, it should be thoroughly noted that this data contains sensitive information.Thus, we adhere to the guidelines suggested by Benton et al. (2017) for working with social media data in health research, and share only anonymized and paraphrased excerpts from the dataset so that it is not possible to recover usernames through a web search with the verbatim post text.In addition, references to usernames as well as URLs are removed from dialogue content for de-identification.
Human Evaluation: The human raters recruited from the crowdsourcing platform, UpWork, were all trained in the practice of counseling.Since the methods were tested on English-only text, we recruited workers who had professional competency in the English language.We paid them $10 for evaluating each batch of rephrased sentences that required on average ≈30 minutes to complete.Thus, the amount paid to the human raters was ≈2.75 times above the US minimum wage of $7.25 per hour.We also paid an extra $2 as a bonus per each batch for workers who obtained an above-average agreement with the other worker who rated the same batch.
Chatbots for Distress-Consolation: One of the main applications of the proposed methodology is boosting chatbot responses for distress consolation with motivational interviewing strategy.Using chatbots for distress consolation or other mental health interventions has raised ethical concerns among many (Lanteigne, 2019;Montemayor et al., 2021;Tatman, 2022).However, chatbots that intervene in mental health-related matters have already been developed and have been quite popular for a while.Some examples are SimSensei (DeVault et al., 2014), Dipsy (Xie, 2017), Woebot (woebothealth.com), and Wysa (www.wysa.io).Czerwinski et al. (2021) state, About 1 billion people globally are affected by mental disorders; a scalable solution such as an AI therapist could be a huge boon.The current technology to develop such chatbots rely heavily on deep learning and pre-trained language models.But due to the inherently unpredictable nature of these models, they pose a threat of delivering unfavourable responses when such chatbots are used for distress consolation.We believe the methodology we suggest in this work can help them become more reliable and fail-safe by adhering to the motivational interviewing strategy, a guiding style of communication heavily practiced in psychotherapy.However, since the unfavourable response detection and rephrasing methods still rely on neural network models, the artifacts produced in this paper should be used for research purposes only and real-world deployment of them should be done under human supervision.

A.1 The RED (Reddit Emotional Distress) Dataset
The RED dataset is curated from carefully selected 8 mental health-related subreddits in Reddit.According to the latest statistics, 61% of Reddit users are male.Of the users, 48% are from the United States.People aged 18-29 make up Reddit's largest user base (64%).The second biggest age group is 30-49 (29%).Only 7% of Reddit users are over 50.It should be noted that these demographic biases can subtly skew our data and models from representing average human behavior.The data we curated were English-only and they may perpetuate an English bias in NLP systems.

A.2 The MI Dataset
Altogether, 15 labels adapted from the MITI code 2.0 (Moyers et al., 2003) and 4.2.1 (Moyers et al., 2014) were used for annotation.They included Closed Question, Open Question, Simple Reflection, Complex Reflection, and Give Information, which are generally considered favourable.They also included labels recognized specifically as MI adherent, which are Advise with Permission, Affirm, Emphasize Autonomy, and Support.There are another four labels recognized as MI non-adherent, which are Advise without Permission, Confront, Direct, and Warn.We also included two other labels Self-Disclose and Other, which are not included in the MITI code.The label Self-Disclose was included because, in peer support conversations, peers are mostly seen to share their lived experiences.Though it is believed that Self-Disclosure contributes in building rapport between the speaker and listener, as suggested by R. Schwartz (2021), this type of disclosure must be used wisely with caution since it can as well be counterproductive distorting client's transference.Thus, it is important to be able to recognize this response type.
Table 5 shows the full list of labels we adapted from the MITI code along with descriptions and examples.Table 6 shows the statistics of the annotated responses in the MI dataset, corresponding to each label.

A.3 Data Augmentation: N-gram Based Matching
We denote examples of the most frequent N-grams corresponding to each label in Table 7.For simplicity, we list only some of them along with their corresponding frequencies.For data augmentation, we used all four-grams and five-grams, which had a frequency of above 5.Table 8 shows the statistics of the labels extended through N-gram based matching in CC and RED datasets.We also encountered 518 and 53,196 sentences in CounselChat and RED datasets respectively that had overlapping labels, which were discarded due to ambiguity.

A.4 Data Augmentation: Similarity Based Retrieval
To derive semantically meaningful sentence embeddings that can be compared using cosine-similarity, we used Sentence-BERT (SBERT) proposed by Reimers and Gurevych (2019), which uses siamese and triplet network structures to compute sentence embeddings.Among several models the authors have proposed, we used the roberta-base-nli-stsbmean-tokens model, fine-tuned on the NLI (Bowman et al., 2015) and STS benchmark (STSb) (Cer et al., 2017) datasets, since it has reported a high Spearman's rank correlation of 84.79 ± 0.38 between the cosine-similarity of the sentence embeddings and the gold labels in the STS benchmark test set outperforming the existing state-of-the-art.
It is also more efficient to use than roberta-large.
As described in Section 3, we used majority voting followed by computing the average similarity of retrieved sentences with the same label (in case of ties) to choose the final label for an unlabeled sentence.In Figure 2, we show an example elaborating this procedure.
Table 8 shows the statistics of the labels extended through similarity-based retrieval in CC and RED datasets.

A.5 Augmented MI Datasets
Table 9 shows the statistics corresponding to each label in the MI Augmented (Union) and MI Augmented (Intersection) datasets developed by taking the union and the intersection of the sentences automatically annotated by N-gram based matching and similarity based retrieval methods.

B MI Classifier
We used the same hyper-parameter setting used in RoBERTa (Liu et al., 2019) when training the MI classifier.We used the Adam optimizer with β 1 of 0.9, β 2 of 0.98, an ϵ value of 1 × 10 −6 , and a learning rate of 2 × 10 −5 .A dropout of 0.1 was used Advising when the speaker asks directly for the information or advice.Indirect forms of permission can also occur, such as when the listener invites the speaker to disregard the advice as appropriate.
If you agree with it, we could try to brainstorm some ideas that might help you.on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016).We limited the maximum number of input tokens to 100, and used a batch size of 32.All models were trained for 20 epochs.In all cases, the optimal epoch was selected based on the average cross entropy loss calculated between the ground-truth and predicted labels of the human-annotated (MI Gold) validation set.All the experiments were conducted on a machine with 2x12cores@2.5GHz,256 GB RAM, 2x200 GB SSD, and 4xGPU (NVIDIA Titan X Pascal).Experiments were also done using GPT3 as the pre-trained language model, however, RoBERTa was seen to outperform GPT3 in this classification task.

Affirm
Figure 3 shows the architectural diagram of the   MI classifier used for annotation.Table 10 shows the performance scores of the MI classifier when trained on gold-labeled and augmented MI datasets.

C MI Rephraser
C.1 Construction of pseudo-parallel corpora  with Permission responses that were used in the process of creating pseudo-parallel corpora using the template-based replacement method.
In Figure 4, we visualize the process of creating Pseudo-Parallel (PP) and Pseudo-Parallel Augmented (PPA) corpora along with statistics corresponding to each dataset.

C.2 Rephrasing Models
For developing rephrasing models, we used the 90M parameter version of Blender (Roller et al., 2021).It contains an 8 layer encoder, an 8-layer decoder with 512-dimensional embeddings, and 16 attention heads.It has a maximum input length of 1024 tokens.All code for fine-tuning is available in ParlAI (Miller et al., 2017).All the models were fine-tuned for 200 epochs, with a batch size of 8, and a learning rate of 1 × 10 -6 .For other hyperparameters, we used the default values defined in their documentation at https://parl.ai/projects/recipes.Fine-tuning the models was conducted in a machine with 2x12cores@2.5GHz,256 GB RAM, 2x200 GB SSD, and 4xGPU (NVIDIA Titan X Pascal).
We also used GPT3 pretrained language model having 175 billion parameters.The smallest but fastest version of GPT3, Ada was used in our experiments.Fine-tuning of GPT3 models were done through the paid API provided by OpenAI (www.openai.com)following API guide at https: //beta.openai.com/docs/guides/fine-tuning.We used the default set of hyperparameters for fine-tuning all GPT3 based models.These hyperparameters are tested to work well across a range of use cases.All the models were fine-tuned for 4 epochs, with a batch size ≈0.2% of the number of examples in the training set (capped at 256), and a learning rate of 0.05.
Table 12 shows some examples of rephrased sen-      ROUGE score and the BERTscore were computed using the rouge7 and bert_score8 python libraries, respectively.The POS distance was calculated as mentioned in the work by Tian et al. (2018) following the code released by the authors on github.9For computing the Word Mover Distance (WMD), we used Gensim's implementation of the WMD. 10 We used sentence embeddings generated using Sentence-BERT (Reimers and Gurevych, 2019) to compute the cosine similarity between the original and rephrased text.Among the models the authors have proposed, we used the roberta-base-nli-stsbmean-tokens model, fine-tuned on the NLI (Bowman et al., 2015) and STS benchmark (STSb) (Cer et al., 2017) datasets to generate the embeddings.
All the automatic evaluation scores are reported for a single run.

F Other Remarks
In human evaluation results, we observed in 97.5% of the cases, the average scores obtained for style transfer strength are better than the average scores obtained for semantic similarity.This observation is invariant of the type of backbone model used in training.This implies template-based and retrievalbased methods used in creating pseudo parallel data to train the rephrasers make it easier for the rephrasers to generate rephrased sentences that reflect a particular style (in this case, Advise with permission) than preserving the semantic meaning of the original sentence.This is a matter to be further investigated.To improve the scores on semantic similarity, future work can explore ways to take into account the context that precedes the sentence to be rephrased.In this way, though the rephrased version may not reflect exactly what was in the original sentence, it might still be able to generate rephrasings relevant to the preceding context.It should be noted that the application of this work is not limited to improving chatbot responses for distress consolation.This could also be applied for the development of intelligent writing assistants that can suggest better responses when peers untrained in the practice of counseling attempt to respond to distress-related posts on peer support platforms such as Reddit.

G Distribution and Use of Artifacts
The artifacts produced, including the datasets and the models, will be released under the CC BY-NC-SA 3.0 license https://creativecommon  s.org/licenses/by-nc-sa/3.0, providing only non-commercial access to the users.We use artifacts such as the CounselChat dataset, and pretrained language architectures such as BERT (Devlin et al., 2019), RoBERTA (Liu et al., 2019), Blender (Roller et al., 2021), and GPT3 (Brown et al., 2020) for research purposes only, which does not violate their intended use.

Figure 1 :
Figure 1: Example of detecting unfavourable and favourable response types in distress support dialogues and boosting the responses by omitting unfavourable responses or rephrasing them into more favourable ones.

-
You can (verb) -It maybe helpful to (verb) -You could (verb) -You may want to (verb) -You need to (verb) -I encourage you to (verb) -You should (verb) -Perhaps you can (verb) -(Verb) -, if you would like.

Figure 2 :
Figure 2: An example of automatically labeling an unlabeled sentence by computing the cosine-similarity with labeled sentences.The label is chosen based on majority voting.But this example shows a tie.Thus, we compute the average similarity of the sentence clusters that hold a tie and select the label of the sentence cluster with the maximum average similarity.

Figure 3 :
Figure 3: The architecture of the MI classifier.

Figures 5
Figures 5, 6, and 7  shows the user interfaces developed for the human evaluation task.The first one shows the task description, the second one shows the self-evaluating practice task designed to get the counselors familiarized with the rating task, and the last one shows the actual human evaluation task itself.

Figure 6 :
Figure 6: Self-evaluating practice task offered to the counselors to get familiarized with the rating task.

Figure 7 :
Figure 7: The human evaluation task interface.

Table 1 :
Examples of templates corresponding to Advise without Permission and Advise with Permission responses.The full list is included in the appendices.

Table 2 :
Examples with generic and N-gram prompts.

Table 3 :
Automatic evaluation results on PP test set.Under each method (Template, Retrieval etc.), the score of the rephraser that performs the best is made bold.The best score obtained for each of BB and GPT3-based rephrasers along each criteria is highlighted in green.Out of them, the best overall score is highlighted with a darker green.

Table 4 :
Results of human evaluation.Under each methodology (Template, Retrieval etc.
Do you think this is an advantage?Did you use herion this week?2.Open Question Questions that allow a wide range of possible answers.It may seek information or may invite the speaker's perspective or may encourage self-exploration.

Table 5 :
The set of labels adapted from the MITI code that the MI classifier is able to recognize.

Table 6 :
Statistics of human annotated MITI labels in CounselChat (CC) and RED datasets.

Table 11
denotes the full list of templates corresponding to Advise without Permission and Advise

Table 7 :
Examples of most frequent four-grams and five-grams corresponding to each label.Their frequencies are denoted within brackets.

Table 8 :
Statistics of the labels extended through N-gram-based matching and similarity-based retrieval in CC and RED datsets.tences by the different rephraser models we fine-tuned.

Table 9 :
Statistics of the annotated responses in MI Augmented (Intersection) and MI Augmented (Union) datasets.